Skip to content
mentorship

concepts

KV cache: how LLM inference avoids quadratic decode cost

The single most important optimization in autoregressive decoding. Without it, generating 1000 tokens would cost O(1000²) attention operations.

Reviewed · 4 min read

One-line definition

The KV cache stores the keys (K) and values (V) projections from previous tokens during autoregressive decoding, so each new token only needs to compute attention against the cache rather than recomputing K and V from scratch over the full prefix.

Why it matters

Critical optimization for autoregressive decoding. Without it, generating T tokens costs O(T²) attention operations.

Standard in all production LLM serving. Essential for understanding:

  • Why decoding is memory-bound, not compute-bound
  • How serving cost scales with context length
  • Why long contexts get expensive (KV cache size grows linearly with context)
  • How techniques like paged attention (vLLM), grouped-query attention, and multi-query attention work

The mechanism

During autoregressive decoding, after generating token t, the model has already computed K and V for tokens 0..t-1. To generate token t+1:

Without KV cache (wasteful):

  1. Re-encode the full prefix tokens 0..t.
  2. Compute K, V for all t+1 positions.
  3. Compute attention for the new token using all K, V.

With KV cache:

  1. Compute K, V only for the new token (token t).
  2. Append (K_t, V_t) to the cache.
  3. Compute attention from the new token’s Q against the cached K (size t+1) and V (size t+1).
  4. Output is one token; cache grows by one position.

Per layer and per attention head, the cache stores tensors of shape (batch, t, d_head) for K and V each. As decoding progresses, the cache grows by one position per token.

The memory cost

For a typical LLM:

KV cache size = 2 (K and V) * num_layers * num_kv_heads * d_head * sequence_length * batch_size * dtype_bytes

For a 70B model (e.g., 80 layers, 8 KV heads, 128 d_head) with BF16, batch 1, sequence 8K:

2 * 80 * 8 * 128 * 8192 * 1 * 2 bytes = ~2.7 GB per request

Increase batch to 32: ~84 GB. Increase context to 32K: ~336 GB. KV cache dominates serving memory, exceeding weights at long contexts. This drives 2026 serving optimizations.

What an interviewer expects you to say

If asked about KV cache:

  1. Frame the problem: autoregressive decoding would otherwise be O(T²) per generated token.
  2. Describe the mechanism: cache K and V from previous tokens, compute K, V only for the new token.
  3. Quote the memory formula: 2 * layers * kv_heads * d_head * seq_len * batch * dtype per request.
  4. Mention that decoding is memory-bound because of weight-loading per step (the matmul is just one token wide).
  5. Bonus: mention multi-query attention (MQA), grouped-query attention (GQA), and paged attention as KV-cache-aware optimizations.

Multi-query and grouped-query attention

Standard multi-head attention has H attention heads, each with its own K and V projections. The KV cache scales linearly with H.

Multi-query attention (MQA): all H heads share a single K and a single V projection. KV cache is H times smaller. Quality cost: small but non-zero. Used in PaLM, Falcon.

Grouped-query attention (GQA): heads are partitioned into G groups, each group shares K, V. Cache is H/G times smaller. Compromise between standard MHA and MQA. Used in LLaMA-2 70B onwards, Mistral, most modern open LLMs.

KV cache size reduction directly lowers memory and cost. Quality regressions are mild relative to savings.

Paged attention (vLLM)

The naive KV cache allocates contiguous memory for each request. This wastes memory (you have to allocate up to max_seq_len upfront) and fragments memory across requests.

Paged attention: treat KV cache like virtual memory, fixed-size pages (e.g., 16 tokens), allocated on demand. Pages shared across requests with same prefix.

The vLLM serving system popularized this. Most modern LLM serving (TensorRT-LLM, SGLang, etc.) now uses some form of paged KV cache.

Common confusions

  • Q in cache? No. Q is fresh per token. Only K, V cached.
  • Eliminates quadratic cost? Only recomputation. Attention is still O(T) per token; total decode is O(T²). Trades compute for memory.
  • “Bigger KV cache = better quality.” No, it’s just longer context. Quality depends on whether the model can use the long context, not on cache size.
  • “All models have KV cache.” All autoregressive transformers do during decoding. Encoder-only models (BERT) don’t decode. Diffusion models don’t have a KV cache because they don’t autoregress over time.
  • “GQA is just a quality compromise.” It’s primarily a serving cost optimization. The quality compromise is mild; the cost savings are substantial at scale.

Why decoding is memory-bound

Each decode step:

  1. Load the model weights from HBM.
  2. Compute one new K, V (matmul against new token’s input only).
  3. Compute Q against cached K (matmul).
  4. Compute attention against V (matmul).
  5. Apply output projection and FFN.

The weight-loading dominates. The matmuls are tiny (one token wide). The arithmetic intensity is low, FLOPs per byte loaded is small. Tensor cores sit idle most of the time.

This is why:

  • Speculative decoding works: the verify pass is essentially free because the cost is weight-loading, not arithmetic.
  • Batching helps: multiple requests share the same weight load.
  • Quantization helps a lot: cuts weight-loading time directly.
  • KV cache is the next bottleneck: at long contexts, the cache itself becomes large and expensive to read.

Related: FlashAttention, Speculative decoding. Related interview: How would you reduce LLM inference cost by 10x?.