Linear attention (Linformer, Performer, kernel methods)

One-line definition

Linear attention replaces the $n \times n$ softmax matrix with an explicit factorization through a low-dimensional space, so the per-layer cost drops from $O (n^{2} d)$ to $O (nk d)$ for some $k ≪ n$ .

Why it matters

Sparse attention (see sparse attention) keeps the softmax exact but on fewer pairs. Linear attention approximates the softmax itself, exploiting the empirical observation that the $n \times n$ attention matrix is approximately low-rank.

In practice, modern decoder LLMs do not use linear attention. Quality drops are non-trivial at scale and FlashAttention has made dense attention competitive in wall-clock. Linear attention is most relevant in domains with extreme $n$ (genomics, time series of millions of steps) or in research on sub-quadratic alternatives.

Two main families

Project the sequence axis (Linformer, Wang et al., 2020)

Learn fixed projection matrices $E, F \in R^{k \times n}$ with $k ≪ n$ . Replace $K, V \in R^{n \times d}$ with $E K, F V \in R^{k \times d}$ :

Attn = softmax (\frac{Q ( E K ) ^{⊤}}{d}) (F V) .

The softmax is now $n \times k$ . Cost: $O (nk d)$ , linear in $n$ . Caveat: $k$ is fixed at training time, so you cannot extrapolate to longer sequences without re-training.

Replace softmax with a kernel (Performer, Choromanski et al., 2020)

Softmax can be written as a kernel $K (q, k) = exp (q^{⊤} k / d)$ . Approximate this kernel with random features $ϕ : R^{d} \to R^{r}$ such that $E [ϕ (q)^{⊤} ϕ (k)] \approx K (q, k)$ .

Then $softmax (Q K^{⊤}) V \approx ϕ (Q) (ϕ (K)^{⊤} V)$ . The right-hand side is computed right-to-left: $ϕ (K)^{⊤} V$ is $r \times d$ , then $ϕ (Q) \cdot (\dots)$ is $n \times d$ . Cost: $O (n r d)$ , linear in $n$ , and works for arbitrary $n$ at inference (no fixed projection).

When to use linear attention in 2026

Sequence length $n ≫ 32, 000$ where FlashAttention is still too slow or doesn’t fit memory.
Encoder-only models on very long inputs.
Real-time inference with strict latency budgets and tolerable quality loss.

For chat-style decoder LLMs, dense attention with FlashAttention + GQA + KV cache remains the production default.

Common pitfalls

Comparing FLOPs without measuring wall-clock. Linear attention’s $O (n)$ scaling only beats $O (n^{2})$ FlashAttention at large $n$ ; the crossover is implementation-dependent and often higher than naive analysis suggests.
Forgetting the constants. Linformer’s $k$ and Performer’s $r$ may need to be hundreds for good quality, so the linear scaling has a large constant.
Assuming all softmax-replacement schemes preserve the autoregressive mask trivially. Kernelized attention requires careful handling for causal masking (recursive cumulative sums).