Huawei’s Computing Systems Lab in Zurich has introduced a new open-source quantization method for large language models (LLMs) aimed at reducing memory demands without sacrificing output quality. The technique, called SINQ (Sinkhorn-Normalized Quantization), is designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code for performing it has been made available by the Huawei research team on Github and Hugging Face under a permissive, enterprise-friendly Apache 2.0 license, allowing organizations to take and use it, modify it, and deploy it commercially — all for free.Across models of different sizes, SINQ cuts memory usage by 60–70%, depending on architecture and bit-width. This enables models that would previously require >60 GB of memory to run on ~20 GB setups—a critical enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.This makes it possible to run models that previously needed high-end enterprise GPUs—like NVIDIA’s A100 or H100—on significantly more affordable hardware, such as a single Nvidia GeForce RTX 4090 (around $1600), instead of enterprise hardware like the A100 80GB ($19,000) or even H100 units that exceed $30,000.For teams using cloud infrastructure, the savings are similarly tangible. A100-based instances often cost $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available on many platforms for $1–1.50 per hour. Over time, especially for extended inference workloads, this difference can add up to thousands of dollars in cost reductions, while also unlocking LLM deployment on smaller clusters, local workstations, or consumer-grade setups previously constrained by memory.Tackling the Memory Challenge of LLMsRunning large models often requires compromises between performance and size. In practice, neural networks use floating-point numbers to represent both weights and activations. A floating-point number can express a wide range of values (very small, very large, with fractional parts).This flexibility is helpful because during training and inference, weights and activations can vary in scale dramatically. Using floating-point lets the model adjust precisely. (For example, a weight could be 0.0023 or 123.45, and floating-point can capture both with decent prec …