Skip to content
mentorship

concepts

Grouped-query and multi-query attention (GQA, MQA)

Share K and V heads across query heads to shrink the KV cache 4-8x with negligible quality loss. Standard in modern decoder LLMs.

Reviewed · 2 min read

One-line definition

GQA and MQA reduce the number of distinct K/V projection heads while keeping the full set of Q heads, so multiple query heads share the same key and value tensors. MQA is the extreme case: one K/V head total. GQA picks an intermediate number of K/V groups.

Why it matters

The KV cache dominates LLM serving memory at long contexts (see KV cache). Cutting the number of K/V heads cuts cache size proportionally:

  • MHA (standard, e.g. GPT-3): K and V have the same number of heads as Q.
  • MQA (Shazeer, 2019): 1 K and 1 V head shared across all Q heads. ~num_heads× smaller cache.
  • GQA (Ainslie et al., 2023): G groups, each shared across num_heads / G Q heads. Tunable midpoint.

Llama 2 70B uses GQA with 8 K/V groups for 64 query heads (8× cache reduction). Llama 3, Mistral, Qwen, and most modern decoders default to GQA.

The mechanism

In standard multi-head attention, for each head :

with and .

In GQA with groups, the query heads are partitioned into contiguous groups of size . All query heads in the same group attend to the same shared . MQA is GQA with .

Implementation: project K and V to dimension instead of , then broadcast (repeat) across the matching Q heads before the matmul.

Tradeoffs

VariantKV headsCache sizeQualityUsed by
MHAbaselineGPT-3, original Llama
GQA-88×~baselineLlama 2/3 70B, Mistral
MQA1×small dropPaLM, Falcon

GQA recovers nearly all MHA quality while keeping most of MQA’s cache savings. The dominant choice in 2026.

Common pitfalls

  • Confusing K/V heads with Q heads. GQA shrinks K/V only; Q stays full-rank.
  • Assuming the speedup is in compute. GQA mostly saves memory (cache + bandwidth), not FLOPs. The matmul cost barely changes.
  • Re-training cost. You generally cannot convert MHA → GQA post-hoc; the K/V projections were trained per-head. Distillation or partial re-training is required.