Attention is a crucial element in the transformer architecture used in large language models (LLMs). However, as LLMs continue to grow in size and handle longer input sequences, the computational cost of attention becomes a bottleneck.
To tackle this challenge, a team of researchers from Colfax Research, Meta, Nvidia, Georgia Tech, Princeton University, and Together AI have developed FlashAttention-3. This new technique aims to significantly accelerate attention computation on Nvidia Hopper GPUs (H100 and H800).
The transformers’ attention mechanism allows the model to compute the relationship between different tokens in an input sequence. While highly effective, this mechanism is computationally expensive, with the cost growing quadratically with the input sequence’s length. As LLMs scale up to process longer sequences, the attention mechanism becomes a major hindrance.
Modern hardware accelerators like GPUs are optimized for matrix multiplication operations, the building blocks of deep learning models. However, attention computations involve a mix of matrix multiplications and other special functions that are not as GPU-optimized. For instance, the softmax function used to normalize attention weights is more computationally expensive than matrix multiplication, leading to inefficiencies.
FlashAttention-3 builds upon the success of its predecessors, FlashAttention and FlashAttention-2, by further optimizing the utilization of resources on Nvidia Hopper GPUs. By scheduling operations efficiently, maximizing data movement between GPU memory segments, and introducing innovations for faster and more accurate computations, FlashAttention-3 achieves up to 75% usage of the H100 GPU’s maximum capabilities, resulting in a 1.5–2x speedup compared to previous versions.
The faster attention computation provided by FlashAttention-3 offers several benefits for LLM development and applications. It can significantly reduce the time needed to train LLMs, enabling experimentation with larger models and datasets. Moreover, it can enhance LLMs’ ability to process longer sequences efficiently, opening up new applications in areas like long-form document understanding and in-context learning.
By utilizing a higher percentage of GPU capacity, FlashAttention-3 can also reduce the number of accelerators required to run LLMs, thereby cutting down production costs. The researchers have made FlashAttention-3 open-source under a permissive license and plan to integrate it into popular deep learning libraries, making it more accessible for researchers and developers to leverage its performance benefits.
In conclusion, FlashAttention-3 represents a significant advancement in optimizing attention computation for LLMs on Nvidia Hopper GPUs. By maximizing hardware capabilities and improving efficiency, this technique paves the way for faster LLM training, enhanced model capabilities, and cost savings in production environments.