Multi-head attention: why one head is not enough

One-line definition

Multi-head attention projects $Q$ , $K$ , $V$ into $h$ lower-dimensional subspaces, runs scaled dot-product attention independently in each, and concatenates the results before a final output projection. Same FLOPs as one large head; very different inductive bias.

Why it matters

Single-head attention computes one weighted average per position. That single distribution has to encode every relation the model needs: syntactic, positional, semantic, coreferential. In practice it cannot, and ablations show that single-head transformers underperform multi-head transformers at matched parameter count (Vaswani et al., 2017).

Multiple heads let different attention patterns coexist. One head learns “previous token,” another “matching bracket,” another “this noun’s modifier.” Probing studies on BERT show many heads fire on syntactic dependencies that linguists recognize (Clark et al., 2019).

The mechanism

Given input $X \in R^{n \times d}$ and head count $h$ with per-head dimension $d_{h} = d / h$ :

Project: $Q = X W_{Q}$ , $K = X W_{K}$ , $V = X W_{V}$ , each shape $n \times d$ . Reshape to $n \times h \times d_{h}$ .
Per-head attention: for each head $i$ ,

head_{i} = softmax (\frac{Q _{i} K _{i}^{⊤}}{d _{h}}) V_{i} .

Concatenate: stack the $h$ heads back into shape $n \times d$ .
Output projection: $MHA (X) = Concat (head_{1}, \dots, head_{h}) W_{O}$ .

Total parameters: $4 d^{2}$ (the four $d \times d$ projection matrices). FLOPs: $O (n^{2} d + n d^{2})$ . Identical to single-head; the heads share the budget.

Why split the dimension

If you keep $d_{h} = d$ per head and run $h$ heads, you multiply parameters and compute by $h$ . Splitting $d$ across heads keeps the cost matched to a single-head baseline, so any gain is attributable to the multiplicity itself, not extra capacity. This is the design choice that makes the comparison meaningful.

Variants

Multi-query attention (MQA): share $K$ and $V$ across all heads; only $Q$ is per-head. KV-cache shrinks by $h$ x. See GQA and MQA.
Grouped-query attention (GQA): share $K, V$ across groups of heads. Compromise between full MHA and MQA. The Llama 2/3 default.
Cross-attention: $Q$ from one sequence, $K, V$ from another. See self-attention vs cross-attention.
Sliding-window / sparse: restrict each head to a local window or learned sparse pattern.

Tradeoffs

Head count: 8 to 32 is typical. More heads with smaller $d_{h}$ can hurt expressiveness; fewer heads with larger $d_{h}$ loses specialization. $d_{h} = 64$ to $128$ is the modern sweet spot.
KV-cache memory scales linearly with $h$ in vanilla MHA. The motivation for MQA and GQA at long context.

Common pitfalls

Equating “more heads” with “more capacity.” Splitting fixes the parameter budget; it is a structural choice, not a scale-up.
Reading the post-softmax weights as “what the model attends to.” Heads are mixed in $W_{O}$ . Single-head probes can be misleading.
Treating MHA as the bottleneck. In long-context LLMs, the FFN is usually larger; attention compute scales with $n^{2}$ but FFN compute scales with $n d^{2}$ .