Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
If the AI industry had an equivalent to the recording industry’s “song of the summer” — a hit that catches on in the warmer months here in the Northern Hemisphere and is heard playing everywhere — the clear honoree for that title would go to Alibaba’s Qwen Team.
Over just the past week, the frontier model AI research division of the Chinese e-commerce behemoth has released not one, not two, not three, but four (!!) new open source generative AI models that offer record-setting benchmarks, besting even some leading proprietary options.
Last night, Qwen Team capped it off with the release of Qwen3-235B-A22B-Thinking-2507, it’s updated reasoning large language model (LLM), which takes longer to respond than a non-reasoning or “instruct” LLM, engaging in “chains-of-thought” or self-reflection and self-checking that hopefully result in more correct and comprehensive responses on more difficult tasks.
Indeed, the new Qwen3-Thinking-2507, as we’ll call it for short, now leads or closely trails top-performing models across several major benchmarks.
The AI Impact Series Returns to San Francisco – August 5
The next phase of AI is here – are you ready? Join leaders from Block, GSK, and SAP for an exclusive look at how autonomous agents are reshaping enterprise workflows – from real-time decision-making to end-to-end automation.
Secure your spot now – space is limited: https://bit.ly/3GuuPLF
As AI influencer and news aggregator Andrew Curran wrote on X: “Qwen’s strongest reasoning model has arrived, and it is at the frontier.”
In the AIME25 benchmark—designed to evaluate problem-solving ability in mathematical and logical contexts — Qwen3-Thinking-2507 leads all reported models with a score of 92.3, narrowly surpassing both OpenAI’s o4-mini (92.7) and Gemini-2.5 Pro (88.0).
The model also shows a commanding performance on LiveCodeBench v6, scoring 74.1, ahead of Google Gemini-2.5 Pro (72.5), OpenAI o4-mini (71.8), and significantly outperforming its earlier version, which posted 55.7.
In GPQA, a benchmark for graduate-level multiple-choice questions, the model achieves 81.1, nearly matching Deepseek-R1-0528 (81.0) and trailing Gemini-2.5 Pro’s top mark of 86.4.
On Arena-Hard v2, which evaluates alignment and subjective preference through win rates, Qwen3-Thinking-2507 scores 79.7, placing it ahead of all competitors.
The results show that this model not only surpasses its predecessor in every major category but also sets a new standard for what open-source, reasoning-focused models can achieve.
A shift away from ‘hybrid reasoning’
The releas …