One-line definition
A small “draft” model proposes K candidate tokens cheaply; a single parallel forward pass of the large target model verifies them using a rejection-sampling rule that provably preserves the target model’s output distribution.
Why it matters
LLM decoding is autoregressive: each token depends on the previous, so the GPU sits idle most of the time waiting for the next sequential step. Each forward pass on a single token is memory-bound: you read all of the model’s weights from HBM but only do one matrix-vector multiplication. Tensor cores are barely used.
Speculative decoding turns this serial work into batched work. The large model runs one forward pass per K-token chunk instead of K forward passes. Wall-clock latency drops 2-3× with no quality change.
Speculative decoding is a standard optimization for LLM serving in 2026. Major serving systems (vLLM, TGI, SGLang, TensorRT-LLM) support it.
The mechanism
Setup: large target model M (the one whose distribution we want to sample from), small draft model m (cheap; e.g., a 1B distilled version of a 70B M).
Per cycle:
- Draft. Run m autoregressively for K steps to propose tokens x̂₁ … x̂ᵼ. This is cheap because m is small.
- Verify. Run M once on the K-token prefix in parallel. This gives M’s distribution at every position: pḘ(· | prefix, x̂₁ … x̂ᵢ) for i = 0 … K-1.
- Accept/reject. Sweep i = 1 … K:
- Accept x̂ᵢ with probability α = min(1, pḘ(x̂ᵢ) / pᵡ(x̂ᵢ)).
- On the first reject at position i*: resample a new token from the corrected distribution q(x) = normalize(max(0, pḘ(x) − pᵡ(x))). Discard x̂ᵢ*₊₁ … x̂ᵼ and continue from the new token.
- Bonus token. If all K drafts are accepted, sample one extra token from pḘ(· | full prefix). So a perfect cycle yields K+1 accepted tokens for the cost of one M forward pass.
Why it’s lossless
The accept/reject rule is a special case of rejection sampling chosen specifically so that the marginal distribution of the output token at each position is exactly pḘ. The proof is one page of careful algebra; the upshot is the output stream is statistically indistinguishable from sampling from M directly.
This is the key selling point. Unlike quantization or distillation, speculative decoding is not a quality/speed trade-off, it’s pure free speedup. (At least in theory; in practice, numerical issues, KV-cache subtleties, and tokenizer mismatches can introduce tiny deviations.)
Speedup analysis
Let α̂ = expected acceptance rate per token (a property of how well m mimics M).
Average accepted tokens per cycle: roughly (1 − α̂ᶡ⁺¹) / (1 − α̂): plus the bonus token.
Cost per cycle: K × cost(m) + 1 × cost(M).
If cost(m) « cost(M), the wall-clock speedup approximately equals the average accepted tokens per cycle. In practice:
- α̂ ≈ 0.6-0.8 with a well-distilled draft model → 2-3× speedup typical.
- α̂ ≈ 0.85+ for code generation (high agreement on syntax) → 4-5× possible.
- α̂ ≈ 0.4 with a poorly-matched draft → speedup < 1.5×, sometimes negative.
The choice of K matters: too small, you don’t amortize the M forward pass; too large, late-position drafts almost always get rejected. K = 4-8 is the typical sweet spot.
Variants
- Self-speculation / Medusa. M itself produces K drafts via extra prediction heads attached to its top layer. No separate draft model needed. Lower α̂ than a dedicated draft, but no extra model to maintain.
- EAGLE. Trains a small “feature regressor” on top of M’s hidden states that predicts the next token’s hidden state cheaply. Better α̂ than vanilla self-speculation.
- Tree speculation. m proposes a tree of candidates (multiple branches at each step), M verifies all branches in one batched pass, the longest accepted prefix is kept. Higher per-cycle yield at the cost of more verifier work.
- Lookahead decoding. No draft model at all; uses parallel n-gram speculation. Lower speedup but trivially deployable.
In 2026, EAGLE-2 / EAGLE-3 are SOTA; Medusa is the simplest to implement; tree speculation is what high-end serving systems use under the hood.
What an interviewer expects you to say
If asked about speculative decoding:
- Frame the problem: decoding is memory-bound, GPUs are idle, the n in the matmul is 1.
- Explain draft + verify + accept/reject, with the key insight that M’s verify pass is essentially free because the cost was dominated by weight-loading, not by the K-fold extra matmul.
- State that it’s lossless (preserves M’s distribution) and explain why this is non-obvious and important.
- Quote a realistic speedup number (2-3×).
- Bonus: mention Medusa, EAGLE, or tree speculation as variants.
Common confusions
- “It’s an approximation.” No. The output distribution is exactly pḘ. (Modulo floating-point.)
- “It only helps for greedy decoding.” No, it works for sampling too, the rejection rule is defined in terms of probabilities precisely because sampling is the general case.
- “It needs the draft model to be fine-tuned to match the target.” Helpful but not required. Even a much smaller off-the-shelf model can give α̂ ≈ 0.6.
- “You can use any small model as the draft.” The draft must use the same tokenizer as the target. Tokenizer mismatch is a common deployment pitfall.
- “It saves FLOPs.” No, it does more FLOPs (the wasted draft tokens that get rejected, plus the K-token verification pass). The wins are wall-clock and GPU utilization.
Why interviewers care
This question tests whether you understand:
- Why decoding is memory-bound (the most important fact about LLM inference).
- The difference between batched and sequential workloads on GPU.
- Lossless vs. lossy optimizations (a common confusion).
- That you’ve kept up with serving developments since 2023.
If you can also discuss how speculative decoding interacts with KV-cache, batching, and continuous batching, you’re at a level the interviewer probably wants to hire.
Reading list
- Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2023)
- Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., 2023)
- Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads (Cai et al., 2024)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (Li et al., 2024)
Related: FlashAttention (the other big inference optimization). Related interview question: “Walk me through how you’d serve an LLM with low latency” (coming soon).