Voice agents have been expensive to run and painful to orchestrate, not because the models can’t handle conversation, but because context ceilings forced enterprises to build session resets, state compression, and reconstruction layers into every deployment. OpenAI’s three new voice models are designed to reduce that overhead, and they change how engineers can think about building voice into a larger agent stack.GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper integrate real-time audio into the model management stack as discrete orchestration primitives — separating conversational reasoning, translation, and transcription into specialized components rather than bundling them in a single voice product.The company said in a blog post that Realtime-2 is its first voice model “with GPT-5 class reasoning” and can handle difficult requests and keep conversations flowing naturally. Realtime-Translate understands more than 70 languages and translates them into 13 others at the speaker’s pace, and Realtime-Whisper is its new speech-to-text transcription model.These three actions no longer sit inside a single stack or model. GPT-Realtime-2 could technically handle transcription, but OpenAI is routing distinct tasks to specialized models: Realtime-Translate for multilingual speech and Realtime-Whisper for transcription. Enterprises can assign each task to the appropriate model rather than routing everything through a single, all-encompassing voice system.The new OpenAI models compete against Mistral’s Voxtral models, which also separate transcription and target enterprise use cases. What enterprises should doMore enterprises are seeing the value of voice agents now that more people are becoming comfortable conversing with an AI agent, and also because of the richness of data from voice customer interactions.Organizations evaluating these models will need to consider their orchestration architecture, not just model qualit …