One of the most commonly used techniques to enhance the efficiency of AI models is quantization. This method involves reducing the number of bits required to represent information in a computer. To put it simply, it’s like saying “noon” instead of the exact time with all the seconds and milliseconds. The level of precision needed depends on the situation.
AI models consist of various components that can be quantized, especially parameters that are crucial for making predictions. Quantized models with fewer bits are less mathematically and computationally demanding when running. However, there are trade-offs associated with quantization that are now becoming more apparent.
A recent study by researchers from Harvard, Stanford, MIT, Databricks, and Carnegie Mellon revealed that quantized models may perform poorly if the original unquantized model was trained over an extended period with large amounts of data. This suggests that at a certain point, training a smaller model from scratch might be more effective than quantizing a larger one.
This could pose challenges for AI companies that train massive models for better accuracy and then try to quantize them to reduce costs. The impact of this is already being seen, with reports indicating that quantizing certain models, like Meta’s Llama 3, can have detrimental effects due to their training methods.
While many AI labs focus on training models on vast datasets to enhance AI capabilities, there is evidence that scaling up eventually reaches a point of diminishing returns. Despite this, the industry seems reluctant to move away from this approach.
To address the degradation of quantized models, researchers suggest training models in low precision to improve robustness. Precision refers to the number of digits a numerical data type can accurately represent. Most models today are trained at 16-bit or half precision and post-train quantized to 8-bit precision for efficiency.
Although hardware vendors like Nvidia advocate for lower precision for quantized model inference, extremely low precision may lead to a significant drop in quality unless the model is exceptionally large. This highlights the complexity of AI models and the limitations of reducing precision without impacting performance.
In conclusion, reducing inference costs through quantization has its limitations, and there is no one-size-fits-all solution. Researchers emphasize the importance of understanding the trade-offs involved in quantization and the need for new architectures to stabilize low precision training. By focusing on meticulous data curation and filtering, along with developing new architectures, the future of AI models could be more efficient and effective.