One-line definition
Sparse attention is exact attention restricted to a structured sparse mask, so cost scales as for some constant instead of , while the mask is designed so information can still propagate to all positions in a small number of layers.
Why it matters
Dense self-attention costs in compute and in memory. For long inputs (long documents, long-form audio, code repos) this becomes the binding constraint. Empirically most attention weights are near zero. So the dense matrix is wasteful.
Sparse attention was the dominant pre-2023 approach to long context. It is still relevant for encoder-only long-document models (clinical notes, legal contracts, scientific papers).
The two canonical patterns
Longformer (Beltagy et al., 2020)
- Sliding window: each token attends to its neighbors on each side. Local context, like a 1D CNN receptive field.
- Global attention: a small set of designated tokens (e.g.,
[CLS], question tokens in QA) attend to everything and are attended by everything.
Cost: per layer.
BigBird (Zaheer et al., 2020)
Adds a third component:
- Window (local context).
- Random: each token attends to random positions across the sequence (small-world shortcut so any two tokens are connected within hops).
- Global: same as Longformer.
Cost: for fixed . BigBird proves that this pattern is a universal approximator of sequence functions and is Turing complete.
Tradeoffs vs. dense attention
- Quality on standard NLP benchmarks: comparable to dense attention at ; clearly better at where dense doesn’t fit.
- Wall-clock speedup is implementation-bound. Sparse masks need custom CUDA kernels (or BlockSparse / FlashAttention sparse mode) to actually beat dense FlashAttention.
- Generation quality of decoder-only sparse-attention LLMs has lagged dense-attention LLMs in practice; dense + GQA + KV cache is the standard for chat models. Sparse attention is more common in encoders and retrieval models.
Common pitfalls
- Assuming sparsity automatically means speed. Without a kernel that exploits the structure, you’ll be slower than dense FlashAttention.
- Confusing sparse with low-rank. Sparse keeps the softmax exact but on fewer pairs; low-rank methods (Linformer, Performer. See linear attention) approximate the softmax matrix itself.
- Picking too small. With window 32 and 24 layers, information at position 0 cannot reach position 5000 in one forward pass. Either widen the window or add global tokens.