Microsoft launches 3 new AI models in direct shot at OpenAI and Google

by | Apr 2, 2026 | Technology

Microsoft on Wednesday launched three new foundational AI models it built entirely in-house — a state-of-the-art speech transcription system, a voice generation engine, and an upgraded image creator — marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI, Google, and other frontier labs on model development, not just distribution.The trio of models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — are available immediately through Microsoft Foundry and a new MAI Playground. They span three of the most commercially valuable modalities in enterprise AI: converting speech to text, generating realistic human voice, and creating images. Together, they represent the opening salvo from Microsoft’s superintelligence team, which Suleyman formed just six months ago to pursue what he calls “AI self-sufficiency.””I’m very excited that we’ve now got the first models out, which are the very best in the world for transcription,” Suleyman told VentureBeat in an exclusive interview ahead of the launch. “Not only that, we’re able to deliver the model with half the GPUs of the state-of-the-art competition.”The announcement lands at a precarious moment for Microsoft. The company’s stock just closed its worst quarter since the 2008 financial crisis, as investors increasingly demand proof that hundreds of billions of dollars in AI infrastructure spending will translate into revenue. These models — priced aggressively and positioned to reduce Microsoft’s own cost of goods sold — are Suleyman’s first answer to that pressure.Microsoft’s new transcription model claims best-in-class accuracy across 25 languagesMAI-Transcribe-1 is the headline release. The speech-to-text model achieves the lowest average Word Error Rate on the FLEURS benchmark — the industry-standard multilingual test — across the top 25 languages by Microsoft product usage, averaging 3.8% WER. According to Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 each.The model uses a transformer-based text decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC files up to 200MB, and Microsoft says its batch transcription speed is 2.5 times faster than the existing Microsoft Azure Fast offering. Diarization, contextual biasing, and streaming are listed as “coming soon.” Microsoft is already testing …

Article Attribution | Read More at Article Source