Skip to content
mentorship

concepts

Autoregressive vs. diffusion generation

Two paradigms for generative modeling: predict the next element step-by-step (autoregressive) or iteratively denoise from pure noise (diffusion). Different costs, different strengths.

Reviewed · 3 min read

One-line definition

Autoregressive (AR) models factorize and generate one element at a time. Diffusion models learn to invert a Markov noising process and generate by iteratively denoising from Gaussian noise. AR dominates language; diffusion dominates images.

Why it matters

The two paradigms produce very different production tradeoffs:

AspectAutoregressiveDiffusion
Sampling sequential steps (one per token) sequential steps (~10–1000 denoise steps)
Parallelism within sampleNone during generationFull (one denoise step is parallel)
Quality scalingCompute and dataCompute and data + step count
Modality strengthDiscrete sequences (text, code)Continuous (images, audio)
ConditioningPrefix promptCross-attention or classifier-free guidance
LikelihoodExact, easy to computeVariational lower bound; sample-based

Autoregressive

The model factorizes the joint distribution by the chain rule:

A neural net (transformer, RNN) parameterizes each . Training: maximize log-likelihood = minimize cross-entropy (one prediction per token, all positions in parallel via teacher forcing). Sampling: feed back the previous output, generate the next.

Strengths: exact likelihood, simple training, parallel teacher-forced loss, strong on discrete sequences.

Weakness: serial sampling. Each step waits for the previous. This is the bottleneck that motivates speculative decoding.

Diffusion

Define a forward noising process: with . The forward step is a small Gaussian (variance schedule ). The reverse (denoising) step is approximated by a learned model :

Training: sample a clean , sample noise , sample , compute noisy , fit with MSE.

Sampling: , then iterate the reverse step times (typically 10–1000).

Strengths: high-fidelity continuous generation, stable training (no GAN instability), good likelihood estimates with importance-weighted ELBO.

Weakness: slow sampling. Most efforts (DDIM, distillation, consistency models) reduce step count.

When AR vs. diffusion

ModalityProduction default 2026
TextAutoregressive (Llama, GPT, Mistral)
CodeAutoregressive (GPT-4, Codex)
ImagesDiffusion (Stable Diffusion, FLUX, Imagen)
AudioMixed: AR (WaveNet legacy), diffusion (modern TTS), latent autoregressive
VideoDiffusion (Sora, Veo, Stable Video) with latent compression
Molecules / proteinsDiffusion (RFdiffusion)

For text, AR has structural advantages: discrete vocabulary, natural causal ordering, and chain-of-thought reasoning emerges from sequential generation. For images, no natural sequential ordering exists, and diffusion’s iterative refinement maps better to gradual denoising.

Hybrid and emerging approaches

  • Latent diffusion (Stable Diffusion): VAE compresses to latent space; diffusion happens there.
  • Discrete diffusion for text (D3PM, Plaid): apply diffusion to discrete tokens with a categorical noise process.
  • Flow matching (Meta, 2023): generalizes diffusion to a deterministic ODE; faster sampling.
  • Consistency models (Song 2023): one or few-step diffusion via distillation.
  • Diffusion language models (LLaDA, 2024): diffusion applied to text; competitive with AR at smaller scale, scaling unclear.

Common pitfalls

  • Comparing AR perplexity with diffusion ELBO directly. Different objectives; not directly comparable.
  • Treating diffusion step count as fixed. It is a sample-time hyperparameter; reducing improves throughput at quality cost.
  • Forgetting AR is parallel during training. The “AR is slow” concern applies to inference, not training.
  • Assuming diffusion is universally better than GANs. It is for image fidelity and training stability, not necessarily for inference speed; GANs still dominate latency-critical settings.