MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

by | May 27, 2026 | Technology

Among the many Chinese AI companies and laboratories vying for market share and attention (no pun intended) on the global marketplace, MiniMax stands out for its commitment to providing frontier-level intelligence across a range of modalities, including text, coding, and video (through its Hailuo model series) — often under permissive, enterprise-friendly, standard open source licenses. Now, MiniMax is again raising the eyebrows of AI power users and developers around the world by releasing a new, in-depth technical report on the making of its popular M2 series of language models (M2, M2.5, and M2.7) shedding light on its numerous engineering innovations and clever approaches — while the company and its leaders also teased a whole new sparse attention approach for its upcoming MiniMax M3 series of models, which it says yields up to 15.6 times faster decoding (or LLM response) speed at long contexts (a million tokens) by adopting a custom sub-quadratic framework. In so doing, MiniMax has designed M3 to make ultra-long-context AI agent deployment economically viable.The M2 report is noteworthy for any enterprise working with AI models, and especially those looking to fine-tune and train their own in-house. After all, MiniMax’s M2 series models often achieved top benchmarks in the world for open source AI performance when they were released. While the title has since been eclipsed by several other Chinese labs including DeepSeek and Xiaomi, MiniMax’s new report offers a blueprint that can be used to improve AI model and agent performance by enterprises around the world.As Adina Yakup of Hugging Face observed on X, “Beyond the benchmarks, they’ve done some really solid work on MoE efficiency and agent oriented design. Excited to see where M3 goes next!” The attention dilemmaThe core technical architecture of the M2 series relies on a sparse Mixture-of-Experts (MoE) decoder-only Transformer layout used by numerous other state-of-the-art LLMs.The foundational backbone houses 229.9 billion total parameters, yet maintains a remarkably lean operational footprint by activating just 9.8 billion parameters per token across 256 fine-grained experts. To optimize routing and avoid …

Article Attribution | Read More at Article Source