Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

by News Feed Editor | Feb 12, 2026 | Technology

Researchers at Nvidia have developed a technique that can reduce the memory costs of large language model reasoning by up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key value (KV) cache, the temporary memory LLMs generate and store as they process prompts and reason through problems and documents.While researchers have proposed various methods to compress this cache before, most struggle to do so without degrading the model’s intelligence. Nvidia’s approach manages to discard much of the cache while maintaining (and in some cases improving) the model’s reasoning capabilities.Experiments show that DMS enables LLMs to “think” longer and explore more solutions without the usual penalty in speed or memory costs.The bottleneck of reasoningLLMs improve their performance on complex tasks by generating “chain-of-thought” tokens, essentially writing out their reasoning steps before arriving at a final answer. Inference-time scaling techniques leverage this by giving the model a larger budget to generate these thinking tokens or to explore multiple potential reasoning paths in parallel.However, this improved reasoning comes with a significant computational cost. As the model generates more tokens, it builds up a KV cache. For real-world applications, the KV cache is a major bottleneck. As the reasoning chain grows, the cache grows linearly, consuming vast amounts of memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also caps the number of users a system can serve simultaneously, as running out of VRAM causes the system to crash or slow to a crawl.Nvidia researchers frame this not just as a technical hurdle, but as a fundamental economic one for the enterprise.”The question isn’t just about hardware quantity; it’s about whether your infrastructure is processing 100 reasoning threads or 800 threads for the same cost,” P …

Article Attribution | Read More at Article Source

Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

About RN

Website Awards

More Info