One-line definition
A diffusion model (Ho et al., 2020; Sohl-Dickstein et al., 2015) defines a forward Markov chain that gradually adds Gaussian noise to data , and learns a neural network to reverse it by predicting the noise added at each step. Sampling iterates the learned reverse process from pure noise.
Why it matters
Diffusion is the dominant 2026 paradigm for high-fidelity generation in continuous modalities:
- Images: Stable Diffusion, DALL-E 3, Midjourney, Imagen, FLUX.
- Video: Sora, Veo, Runway Gen-3.
- Audio: Stable Audio, AudioLDM, Suno.
- Molecules / proteins: RFdiffusion (Watson, Juergens, Bennett et al., 2023) for protein structure generation; widely used in the lab of David Baker, who shared the 2024 Chemistry Nobel for computational protein design.
Compared to GANs and VAEs, diffusion offers stable training, excellent sample diversity, and natural conditional generation. Its main weakness. Slow iterative sampling. Is the focus of active research (DDIM, distillation, consistency models).
The forward process
Define a variance schedule (small values, increasing). The forward step:
This is a fixed (non-learned) Markov chain. After steps with appropriate schedule, .
A useful identity: you can sample directly from in closed form:
So for .
Training
For a sample :
- Pick a random timestep .
- Sample noise .
- Form .
- Train with MSE.
That’s it. One simple loss, no adversary, no special tricks. This is why diffusion is so stable. Denoising regression on a closed-form forward process.
Sampling (reverse process)
DDPM-style ancestral sampling (Ho et al., 2020):
- Start with .
- For :
- Compute predicted noise .
- Sample (or at ).
- Update .
- Return .
DDPM uses . DDIM (Song et al., 2020) reduces to ~50 steps with a deterministic update. Consistency models (Song 2023) and distillation further reduce to 1–4 steps.
Conditional generation
Add condition (text embedding, class label, image) to the network: . Standard conditioning mechanism: cross-attention from to text embeddings (T5 or CLIP).
Classifier-free guidance (Ho & Salimans, 2022) interpolates between conditional and unconditional predictions:
with for text-to-image. Trades sample diversity for adherence to the prompt.
Latent diffusion (Stable Diffusion)
Train a VAE to compress images to a small latent space (typically 1/8 spatial resolution, 4 channels). Run diffusion in the latent space. Decode the final latent with the VAE decoder.
Result: 64×64 latent diffusion ≈ 512×512 image quality at much lower compute. The 2022 SD release (Rombach et al.) made high-quality text-to-image practical on consumer GPUs.
Sample quality vs. step count
Diffusion sample quality is determined by:
- Model capacity and training data.
- Number of denoising steps (more = better, with diminishing returns).
- Sampler (DDIM, DPM-Solver, Euler, Heun). Different ODE/SDE solvers with different quality/speed tradeoffs.
- Classifier-free guidance scale.
A useful rule: 30–50 DDIM steps with a modern sampler matches 1000-step DDPM quality.
Common pitfalls
- Confusing prediction with prediction. Diffusion can be parameterized as predicting the noise, the clean image, or the velocity (v-parameterization). They are mathematically related but not identical.
- Forgetting that the forward process is fixed, not learned. Only the reverse is parameterized.
- Treating diffusion as a likelihood model directly. Diffusion has a variational ELBO; for likelihood-based comparison, use IWAE or compare on FID/sample quality instead.
- Running fewer steps than the model was trained for, without testing. Some samplers degrade past 10 steps; check empirically.
Related
- Variational autoencoders. VAEs are the latent compressor in latent diffusion.
- Autoregressive vs. diffusion. Broader paradigm comparison.