One-line definition
Linear attention replaces the softmax matrix with an explicit factorization through a low-dimensional space, so the per-layer cost drops from to for some .
Why it matters
Sparse attention (see sparse attention) keeps the softmax exact but on fewer pairs. Linear attention approximates the softmax itself, exploiting the empirical observation that the attention matrix is approximately low-rank.
In practice, modern decoder LLMs do not use linear attention. Quality drops are non-trivial at scale and FlashAttention has made dense attention competitive in wall-clock. Linear attention is most relevant in domains with extreme (genomics, time series of millions of steps) or in research on sub-quadratic alternatives.
Two main families
Project the sequence axis (Linformer, Wang et al., 2020)
Learn fixed projection matrices with . Replace with :
The softmax is now . Cost: , linear in . Caveat: is fixed at training time, so you cannot extrapolate to longer sequences without re-training.
Replace softmax with a kernel (Performer, Choromanski et al., 2020)
Softmax can be written as a kernel . Approximate this kernel with random features such that .
Then . The right-hand side is computed right-to-left: is , then is . Cost: , linear in , and works for arbitrary at inference (no fixed projection).
When to use linear attention in 2026
- Sequence length where FlashAttention is still too slow or doesn’t fit memory.
- Encoder-only models on very long inputs.
- Real-time inference with strict latency budgets and tolerable quality loss.
For chat-style decoder LLMs, dense attention with FlashAttention + GQA + KV cache remains the production default.
Common pitfalls
- Comparing FLOPs without measuring wall-clock. Linear attention’s scaling only beats FlashAttention at large ; the crossover is implementation-dependent and often higher than naive analysis suggests.
- Forgetting the constants. Linformer’s and Performer’s may need to be hundreds for good quality, so the linear scaling has a large constant.
- Assuming all softmax-replacement schemes preserve the autoregressive mask trivially. Kernelized attention requires careful handling for causal masking (recursive cumulative sums).