The generative AI era began for most people with the launch of OpenAI’s ChatGPT in late 2022, but the underlying technology — the “Transformer” neural network architecture that allows AI models to weigh the importance of different words in a sentence (or pixels in an image) differently and train on information in parallel — dates back to Google’s seminal 2017 paper “Attention Is All You Need.”Yet while Transformers deliver unparalleled model quality and have underpinned most of the major generative AI models used today, they are computationally gluttonous. They are burdened by quadratic compute and linear memory demands that make large-scale inference an expensive, often prohibitive, endeavor. Hence, the desire by some researchers to improve on them by developing a new architecture, Mamba, in 2023, which has gone on to be included in hybrid Mamba-Transformer models like Nvidia’s Nemotron 3 Super.Now, the same researchers behind the original Mamba architecture including leaders Albert Gu of Carnegie Mellon and Tri Dao of Princeton have released the latest version of their new architecture, Mamba-3, as a language model under a permissive Apache 2.0 open source license — making it immediately available to developers, including enterprises for commercial purposes. A technical paper has also been published on arXiv.org.This model signals a paradigm shift from training efficiency to an “inference-first” design. As Gu noted in the official announcement, while Mamba-2 focused on breaking pretraining bottlenecks, Mamba-3 aims to solve the “cold GPU” problem: the reality that during decoding, modern hardware often remains idle, waiting for memory movement rather than performing computation.Perplexity (no, not the company) and the newfound efficiency of Mamba 3Mamba, including Mamba 3, is a type of State Space Model (SSM).These are effectively a high-speed “summary machine” for AI. While many popular models (like the ones behind ChatGPT) have to re-examin …