Scaling Language Models with Millions of Tiny Experts: DeepMind’s PEER Approach

July 13, 2024

DeepMind has introduced a new approach called Parameter Efficient Expert Retrieval (PEER) to address the limitations of current Mixture-of-Experts (MoE) techniques used in scaling large language models. While MoE has been effective in improving performance without increasing computational costs, it has been limited by the number of experts it can accommodate. PEER aims to scale MoE models to millions of experts, offering a more efficient performance-compute tradeoff for large language models.

Scaling language models to enhance their performance has been a focus in recent years, with an increase in parameters leading to better capabilities. However, there are challenges in scaling models beyond a certain point due to computational and memory constraints. The attention and feedforward layers in transformer blocks play a crucial role in the model’s performance, with feedforward layers accounting for a significant portion of the model’s parameters.

MoE addresses these challenges by replacing the dense feedforward layer with specialized expert modules, each containing a fraction of the parameters of the full layer. By routing inputs to specific experts, MoE allows for increased model capacity without a proportional increase in computational costs. The number of experts in an MoE model plays a key role in its performance, with studies suggesting that higher granularity can lead to better results.

PEER differs from existing MoE architectures by using a learned index to efficiently route input data to a large pool of experts. By employing tiny experts with a single neuron in the hidden layer, PEER promotes knowledge transfer and parameter efficiency. This design allows for the sharing of hidden neurons among experts, enhancing the model’s overall performance. Additionally, PEER’s multi-head retrieval approach contributes to its efficiency in handling a vast number of experts.

The potential applications of PEER extend to tasks such as fine-tuning models for new tasks and dynamically adding new knowledge and features to language models. The performance evaluations of PEER demonstrate its superiority in achieving a better performance-compute tradeoff compared to other MoE architectures and dense feedforward layers. Increasing the number of experts in a PEER model has shown further improvements in reducing perplexity scores.

The findings from DeepMind’s research challenge the conventional belief that MoE models are limited by the number of experts they can accommodate. PEER’s approach to scaling MoE to millions of experts opens up new possibilities for reducing training and serving costs for large language models. This innovative architecture could have significant implications for the future development of language models and their applications across various domains.