One-line definition
Discrete gradient estimators approximate when is discrete — the case where you cannot reparameterize the sample as a smooth function of and noise. The three you must know: REINFORCE (score function), Gumbel-Softmax (continuous relaxation), and the straight-through estimator.
Why it matters
The reparameterization trick handles continuous latents (Gaussian VAEs). But many models sample discrete objects — categorical latents, hard attention, tokens, architecture choices, RL actions. You can’t push a gradient through argmax or a categorical sample, so you need an estimator. This is the deep-DL follow-up to “explain the reparameterization trick,” and it underpins RLHF (which uses the score-function estimator) and discrete latent-variable models.
The core problem
We want . The expectation is a sum over discrete ; the sampling operation is non-differentiable. The two families of solutions trade bias for variance.
1. Score-function estimator (REINFORCE / likelihood ratio)
Use the log-derivative identity :
Unbiased, requires only that you can sample and evaluate — can be a black box (non-differentiable, even an environment reward).
The catch is high variance. Mitigations:
- Baselines / control variates: subtract a baseline that doesn’t depend on : . Still unbiased (since ), lower variance. The value-function baseline in actor-critic is exactly this.
- More samples, advantage normalization, etc.
This estimator is policy-gradient RL. REINFORCE, A2C, and PPO are all score-function estimators with progressively better variance control.
2. Gumbel-Softmax (Concrete distribution)
Relax the discrete sample into a continuous one you can reparameterize. The Gumbel-Max trick says a categorical sample equals
Replace the non-differentiable argmax with a temperature- softmax:
Now is a differentiable, reparameterized sample (a point on the simplex). As , approaches a one-hot vector but the gradient variance blows up; as grows, samples are smooth but biased toward uniform. You anneal downward during training. Low variance, biased.
3. Straight-through estimator (STE)
Forward pass: use the hard discrete value (e.g. argmax, or a threshold). Backward pass: pretend the operation was the identity (or the softmax), and pass the gradient straight through.
Straight-Through Gumbel-Softmax combines both: hard one-hot forward, soft Gumbel-Softmax gradient backward — so the rest of the network sees a genuine discrete sample. STE is biased (the backward op isn’t the true derivative) but cheap and empirically effective; it is the workhorse behind VQ-VAE codebook training and binarized/quantized networks.
The bias-variance tradeoff
| Estimator | Bias | Variance | Needs differentiable ? | Typical use |
|---|---|---|---|---|
| Score function (REINFORCE) | Unbiased | High | No | RL, RLHF, black-box reward |
| Gumbel-Softmax | Biased () | Low | Yes | Discrete latents (categorical VAE) |
| Straight-through | Biased | Low | Yes (via surrogate) | VQ-VAE, quantization, hard attention |
The dividing question: can you differentiate ? If not (an environment, a metric, a sampled-then-scored pipeline), you’re forced onto the score-function estimator. If you can, the relaxation methods give far lower variance.
What an interviewer expects you to say
- State why reparameterization fails for discrete (you can’t write a discrete sample as a smooth function of noise and ).
- Give the score-function estimator with the identity, that it’s unbiased but high-variance, and that baselines reduce variance without adding bias.
- Explain Gumbel-Softmax as the reparameterizable relaxation with a temperature you anneal — biased, low variance.
- Describe the straight-through estimator (hard forward, soft/identity backward) and that it trains VQ-VAE and quantized nets.
- Connect to practice: RLHF uses score-function (PPO) because text is discrete and a 50K-way Gumbel-Softmax is impractical; DPO sidesteps sampling entirely.
Common confusions
- “You can just backprop through argmax.” Its gradient is zero almost everywhere; that’s the whole problem.
- “REINFORCE is biased.” It’s unbiased; its issue is variance. Baselines fix variance, not bias.
- “Gumbel-Softmax is exact.” It’s biased for any ; only the limit is exact, and there the gradient is uselessly high-variance.
- “Straight-through has a principled gradient.” It doesn’t — it’s a useful heuristic (the backward op deliberately mismatches the forward op).
- “These are RL-only / VAE-only tricks.” They’re general: hard attention, neural architecture search, discrete communication, and quantization all use them.
Related: Explain the reparameterization trick, Policy gradient, PPO, Variational autoencoders, Quantization.