Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
Mistral released an open-sourced voice model today that could rival paid voice AI, such as those from ElevenLabs and Hume AI, which the company said bridges the gap between proprietary speech recognition models and the more open, yet error-prone versions.
Voxtral, which Mistral will release under an Apache 2.0 license, is available in a 24B parameter version and a 3B variant. The larger model is intended for applications at scale, while the smaller version would work for local and edge use cases.
“Voice was humanity’s first interface—long before writing or typing, it let us share ideas, coordinate work, and build relationships. As digital systems become more capable, voice is returning as our most natural form of human-computer interaction,” Mistral said in a blog post. “Yet today’s systems remain limited—unreliable, proprietary, and too brittle for real-world use. Closing this gap demands tools with exceptional transcription, deep understanding, multilingual fluency, and open, flexible deployment.”
Voxtral is available on Mistral’s API and a transcription-only endpoint on its website. The models are also accessible through Le Chat, Mistral’s chat platform.
The AI Impact Series Returns to San Francisco – August 5
The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.
Secure your spot now – space is limited: https://bit.ly/3GuuPLF
Mistral said that speech AI “meant choosing between two trade-offs,” pointing out that some open-source automated speech recognition models often had limited semantic understanding. Still, closed models with strong language understanding come at a high cost.
Bridging the gap
The company said Voxtral “offers state-of-the-art accuracy and native semantic understanding in the open, at less than half the price of comparable APIs.”
Voxtral, at a 32K token context, can listen to and transcribe up to 30 minutes of audio or 40 minutes of audio understanding. It offers summarization, meaning the model can answer questions based on the audio content and generate summaries without switching to a separate mode. U …