Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.The goal: to position as a global contender in the increasingly competitive enterprise AI market.Announced at the company’s Baidu World 2025 event, ERNIE 5.0 is a proprietary, natively omni-modal model designed to jointly process and generate content across text, images, audio, and video. Unlike Baidu’s recently released ERNIE-4.5-VL-28B-A3B-Thinking, which is open source under an enterprise-friendly and permissive Apache 2.0 license, ERNIE 5.0 is a proprietary model and is available only via Baidu’s ERNIE Bot website (I needed to select it manuallyu from the model picker dropdown) and the Qianfan cloud platform application programming interface (API) for enterprise customers. Alongside the model launch, Baidu introduced major updates to its digital human platform, no-code tools, and general-purpose AI agents — all targeted at expanding its AI footprint beyond China.The company also introduced ERNIE 5.0 Preview 1022, a variant optimized for text-intensive tasks, alongside the general preview model that balances across modalities.Baidu emphasized that ERNIE 5.0 represents a shift in how intelligence is deployed at scale, with CEO Robin Li stating: “When you internalize AI, it becomes a native capability and transforms intelligence from a cost into a source of productivity.”Where ERNIE 5.0 outshines GPT-5 and Gemini 2.5 ProERNIE 5.0’s benchmark results suggest that Baidu has achieved parity—or near-parity—with the top Western foundation models across a wide spectrum of tasks. In public benchmark slides shared during the Baidu World 2025 event, ERNIE 5.0 Preview outperformed or matched OpenAI’s GPT-5-High and Google’s Gemini 2.5 Pro in multimodal reasoning, document understanding, and image-based QA, while also demonstrating strong language modeling and code execution abilities. The company emphasized its ability to handle joint inputs and outputs across modalities, rather than relying on post-hoc modality fusion, which it fra …