Researchers have made a breakthrough in AI technology by finding a new way to make AI language models more efficient. They have achieved this by eliminating matrix multiplication from the process, which is a key component in neural network operations that are currently accelerated by GPU chips. The team of researchers from the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University detailed their findings in a recent preprint paper.
Matrix multiplication, also known as “MatMul,” is essential for most neural network computational tasks today. GPUs are highly efficient at performing these mathematical operations quickly, which is why Nvidia currently holds a dominant market share in data center GPUs. These GPUs are commonly used to power AI systems such as ChatGPT and Google Gemini.
In their paper titled “Scalable MatMul-free Language Modeling,” the researchers describe how they created a custom 2.7 billion parameter model without using MatMul. This model showed similar performance to traditional large language models (LLMs). They also demonstrated running a 1.3 billion parameter model at 23.8 tokens per second on a GPU accelerated by a custom-programmed FPGA chip that uses about 13 watts of power.
The researchers believe that a more efficient FPGA could lead to the development of hardware-friendly architectures that consume less power. This could have significant implications for the environmental impact and operational costs of AI systems. While the technique has not yet been peer-reviewed, the researchers are confident that their approach challenges the prevailing belief that matrix multiplication operations are essential for building high-performing language models.
By eliminating matrix multiplications, the researchers aim to make large language models more accessible, efficient, and sustainable, especially for deployment on resource-constrained hardware like smartphones. They reference the BitNet technique, which demonstrated the use of binary and ternary weights in language models, as an important precursor to their work. However, BitNet still relied on matrix multiplications in its self-attention mechanism, prompting the researchers to develop a completely “MatMul-free” architecture to maintain performance while removing matrix multiplications from the attention mechanism.