Skip to content
mentorship

concepts

Discrete gradient estimators

How to get gradients through a sampling step over discrete variables, where the reparameterization trick doesn't apply. Covers the score-function (REINFORCE) estimator, the straight-through estimator, and Gumbel-Softmax.

Reviewed · 4 min read

One-line definition

Discrete gradient estimators approximate when is discrete — the case where you cannot reparameterize the sample as a smooth function of and noise. The three you must know: REINFORCE (score function), Gumbel-Softmax (continuous relaxation), and the straight-through estimator.

Why it matters

The reparameterization trick handles continuous latents (Gaussian VAEs). But many models sample discrete objects — categorical latents, hard attention, tokens, architecture choices, RL actions. You can’t push a gradient through argmax or a categorical sample, so you need an estimator. This is the deep-DL follow-up to “explain the reparameterization trick,” and it underpins RLHF (which uses the score-function estimator) and discrete latent-variable models.

The core problem

We want . The expectation is a sum over discrete ; the sampling operation is non-differentiable. The two families of solutions trade bias for variance.

1. Score-function estimator (REINFORCE / likelihood ratio)

Use the log-derivative identity :

Unbiased, requires only that you can sample and evaluate can be a black box (non-differentiable, even an environment reward).

The catch is high variance. Mitigations:

  • Baselines / control variates: subtract a baseline that doesn’t depend on : . Still unbiased (since ), lower variance. The value-function baseline in actor-critic is exactly this.
  • More samples, advantage normalization, etc.

This estimator is policy-gradient RL. REINFORCE, A2C, and PPO are all score-function estimators with progressively better variance control.

2. Gumbel-Softmax (Concrete distribution)

Relax the discrete sample into a continuous one you can reparameterize. The Gumbel-Max trick says a categorical sample equals

Replace the non-differentiable argmax with a temperature- softmax:

Now is a differentiable, reparameterized sample (a point on the simplex). As , approaches a one-hot vector but the gradient variance blows up; as grows, samples are smooth but biased toward uniform. You anneal downward during training. Low variance, biased.

3. Straight-through estimator (STE)

Forward pass: use the hard discrete value (e.g. argmax, or a threshold). Backward pass: pretend the operation was the identity (or the softmax), and pass the gradient straight through.

Straight-Through Gumbel-Softmax combines both: hard one-hot forward, soft Gumbel-Softmax gradient backward — so the rest of the network sees a genuine discrete sample. STE is biased (the backward op isn’t the true derivative) but cheap and empirically effective; it is the workhorse behind VQ-VAE codebook training and binarized/quantized networks.

The bias-variance tradeoff

EstimatorBiasVarianceNeeds differentiable ?Typical use
Score function (REINFORCE)UnbiasedHighNoRL, RLHF, black-box reward
Gumbel-SoftmaxBiased ()LowYesDiscrete latents (categorical VAE)
Straight-throughBiasedLowYes (via surrogate)VQ-VAE, quantization, hard attention

The dividing question: can you differentiate ? If not (an environment, a metric, a sampled-then-scored pipeline), you’re forced onto the score-function estimator. If you can, the relaxation methods give far lower variance.

What an interviewer expects you to say

  1. State why reparameterization fails for discrete (you can’t write a discrete sample as a smooth function of noise and ).
  2. Give the score-function estimator with the identity, that it’s unbiased but high-variance, and that baselines reduce variance without adding bias.
  3. Explain Gumbel-Softmax as the reparameterizable relaxation with a temperature you anneal — biased, low variance.
  4. Describe the straight-through estimator (hard forward, soft/identity backward) and that it trains VQ-VAE and quantized nets.
  5. Connect to practice: RLHF uses score-function (PPO) because text is discrete and a 50K-way Gumbel-Softmax is impractical; DPO sidesteps sampling entirely.

Common confusions

  • “You can just backprop through argmax.” Its gradient is zero almost everywhere; that’s the whole problem.
  • “REINFORCE is biased.” It’s unbiased; its issue is variance. Baselines fix variance, not bias.
  • “Gumbel-Softmax is exact.” It’s biased for any ; only the limit is exact, and there the gradient is uselessly high-variance.
  • “Straight-through has a principled gradient.” It doesn’t — it’s a useful heuristic (the backward op deliberately mismatches the forward op).
  • “These are RL-only / VAE-only tricks.” They’re general: hard attention, neural architecture search, discrete communication, and quantization all use them.

Related: Explain the reparameterization trick, Policy gradient, PPO, Variational autoencoders, Quantization.