Discrete gradient estimators

One-line definition

Discrete gradient estimators approximate $\nabla_{θ} E_{z \sim p_{θ}} [f (z)]$ when $z$ is discrete — the case where you cannot reparameterize the sample as a smooth function of $θ$ and noise. The three you must know: REINFORCE (score function), Gumbel-Softmax (continuous relaxation), and the straight-through estimator.

Why it matters

The reparameterization trick handles continuous latents (Gaussian VAEs). But many models sample discrete objects — categorical latents, hard attention, tokens, architecture choices, RL actions. You can’t push a gradient through argmax or a categorical sample, so you need an estimator. This is the deep-DL follow-up to “explain the reparameterization trick,” and it underpins RLHF (which uses the score-function estimator) and discrete latent-variable models.

The core problem

We want $\nabla_{θ} E_{z \sim p_{θ} (z)} [f (z)]$ . The expectation is a sum over discrete $z$ ; the sampling operation is non-differentiable. The two families of solutions trade bias for variance.

1. Score-function estimator (REINFORCE / likelihood ratio)

Use the log-derivative identity $\nabla_{θ} p_{θ} (z) = p_{θ} (z) \nabla_{θ} lo g p_{θ} (z)$ :

\nabla_{θ} E_{z} [f (z)] = E_{z \sim p_{θ}} [f (z) \nabla_{θ} lo g p_{θ} (z)] .

Unbiased, requires only that you can sample $z$ and evaluate $lo g p_{θ} (z)$ — $f$ can be a black box (non-differentiable, even an environment reward).

The catch is high variance. Mitigations:

Baselines / control variates: subtract a baseline $b$ that doesn’t depend on $z$ : $(f (z) - b) \nabla_{θ} lo g p_{θ} (z)$ . Still unbiased (since $E [\nabla_{θ} lo g p_{θ}] = 0$ ), lower variance. The value-function baseline in actor-critic is exactly this.
More samples, advantage normalization, etc.

This estimator is policy-gradient RL. REINFORCE, A2C, and PPO are all score-function estimators with progressively better variance control.

2. Gumbel-Softmax (Concrete distribution)

Relax the discrete sample into a continuous one you can reparameterize. The Gumbel-Max trick says a categorical sample equals

z = i arg max (lo g π_{i} + g_{i}), g_{i} \sim Gumbel (0, 1) .

Replace the non-differentiable argmax with a temperature- $τ$ softmax:

y_{i} = \frac{exp (( lo g π _{i} + g _{i} ) / τ )}{\sum _{j} exp (( lo g π _{j} + g _{j} ) / τ )} .

Now $y$ is a differentiable, reparameterized sample (a point on the simplex). As $τ \to 0$ , $y$ approaches a one-hot vector but the gradient variance blows up; as $τ$ grows, samples are smooth but biased toward uniform. You anneal $τ$ downward during training. Low variance, biased.

3. Straight-through estimator (STE)

Forward pass: use the hard discrete value (e.g. argmax, or a threshold). Backward pass: pretend the operation was the identity (or the softmax), and pass the gradient straight through.

forward: z = one_hot (ar g max), backward: \frac{\partial z}{\partial logits} \approx \frac{\partial softmax}{\partial logits} .

Straight-Through Gumbel-Softmax combines both: hard one-hot forward, soft Gumbel-Softmax gradient backward — so the rest of the network sees a genuine discrete sample. STE is biased (the backward op isn’t the true derivative) but cheap and empirically effective; it is the workhorse behind VQ-VAE codebook training and binarized/quantized networks.

The bias-variance tradeoff

Estimator	Bias	Variance	Needs differentiable $f$ ?	Typical use
Score function (REINFORCE)	Unbiased	High	No	RL, RLHF, black-box reward
Gumbel-Softmax	Biased ( $τ > 0$ )	Low	Yes	Discrete latents (categorical VAE)
Straight-through	Biased	Low	Yes (via surrogate)	VQ-VAE, quantization, hard attention

The dividing question: can you differentiate $f$ ? If not (an environment, a metric, a sampled-then-scored pipeline), you’re forced onto the score-function estimator. If you can, the relaxation methods give far lower variance.

What an interviewer expects you to say

State why reparameterization fails for discrete $z$ (you can’t write a discrete sample as a smooth function of noise and $θ$ ).
Give the score-function estimator with the $\nabla lo g p$ identity, that it’s unbiased but high-variance, and that baselines reduce variance without adding bias.
Explain Gumbel-Softmax as the reparameterizable relaxation with a temperature you anneal — biased, low variance.
Describe the straight-through estimator (hard forward, soft/identity backward) and that it trains VQ-VAE and quantized nets.
Connect to practice: RLHF uses score-function (PPO) because text is discrete and a 50K-way Gumbel-Softmax is impractical; DPO sidesteps sampling entirely.

Common confusions

“You can just backprop through argmax.” Its gradient is zero almost everywhere; that’s the whole problem.
“REINFORCE is biased.” It’s unbiased; its issue is variance. Baselines fix variance, not bias.
“Gumbel-Softmax is exact.” It’s biased for any $τ > 0$ ; only the $τ \to 0$ limit is exact, and there the gradient is uselessly high-variance.
“Straight-through has a principled gradient.” It doesn’t — it’s a useful heuristic (the backward op deliberately mismatches the forward op).
“These are RL-only / VAE-only tricks.” They’re general: hard attention, neural architecture search, discrete communication, and quantization all use them.