One-line definition
GQA and MQA reduce the number of distinct K/V projection heads while keeping the full set of Q heads, so multiple query heads share the same key and value tensors. MQA is the extreme case: one K/V head total. GQA picks an intermediate number of K/V groups.
Why it matters
The KV cache dominates LLM serving memory at long contexts (see KV cache). Cutting the number of K/V heads cuts cache size proportionally:
- MHA (standard, e.g. GPT-3): K and V have the same number of heads as Q.
- MQA (Shazeer, 2019): 1 K and 1 V head shared across all Q heads. ~
num_heads× smaller cache. - GQA (Ainslie et al., 2023): G groups, each shared across
num_heads / GQ heads. Tunable midpoint.
Llama 2 70B uses GQA with 8 K/V groups for 64 query heads (8× cache reduction). Llama 3, Mistral, Qwen, and most modern decoders default to GQA.
The mechanism
In standard multi-head attention, for each head :
with and .
In GQA with groups, the query heads are partitioned into contiguous groups of size . All query heads in the same group attend to the same shared . MQA is GQA with .
Implementation: project K and V to dimension instead of , then broadcast (repeat) across the matching Q heads before the matmul.
Tradeoffs
| Variant | KV heads | Cache size | Quality | Used by |
|---|---|---|---|---|
| MHA | 1× | baseline | GPT-3, original Llama | |
| GQA-8 | 8 | × | ~baseline | Llama 2/3 70B, Mistral |
| MQA | 1 | × | small drop | PaLM, Falcon |
GQA recovers nearly all MHA quality while keeping most of MQA’s cache savings. The dominant choice in 2026.
Common pitfalls
- Confusing K/V heads with Q heads. GQA shrinks K/V only; Q stays full-rank.
- Assuming the speedup is in compute. GQA mostly saves memory (cache + bandwidth), not FLOPs. The matmul cost barely changes.
- Re-training cost. You generally cannot convert MHA → GQA post-hoc; the K/V projections were trained per-head. Distillation or partial re-training is required.