One of the most widely used techniques to make AI models more efficient, quantization, has limits — and the industry could be fast approaching them.
In the context of AI, quantization refers to lowering the number of bits — the smallest units a computer can process — needed to represent information. Consider this analogy: When someone asks the time, you’d probably say “noon” — not “oh twelve hundred, one second, and four milliseconds.” That’s quantizing; both answers are correct, but one is slightly more precise. How much precision you actually need depends on the context.
AI models consist of several components that can be quantized — in particular parameters, the internal variables models use to make predictions or decisions. This is convenient, considering models perform millions of calculations when run. Quantized models with fewer bits representing their parameters are less demanding mathematically, and therefore computationally. (To be clear, this is a different process from “distilling,” which is a more involved and selective pruning of parameters.)
But quantization may have more trade-offs than previously assumed.
The ever-shrinking model
According to a study from researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon, quantized models perform worse if the original, unquantized version of the model was trained over a long period on lots of data. In other words, at a certain point, it may actually be better to just train a smaller model rather than cook down a big one.
That could spell bad news for AI companies training extremely large models (known to improve answer quality) and then quantizing them in an effort to make them less expensive to serve.
The effects are already manifesting. A few months ago, developers and academics reported that quantizing Meta’s Llama 3 model tended to be “more harmful” compared to other models, potentially due to the way it wa …