Advantage estimation and GAE

One-line definition

The advantage $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$ measures how much better action $a$ is than the policy’s average. Generalized Advantage Estimation (GAE, Schulman et al., 2016) estimates it as an exponentially weighted average of $n$ -step TD residuals, controlled by a single parameter $λ$ .

Why it matters

Policy gradient methods optimize $\nabla_{θ} J (θ) = E [\nabla lo g π_{θ} (a ∣ s) \cdot Ψ]$ . The choice of $Ψ$ controls the bias-variance tradeoff:

$Ψ = R$ (full return): unbiased, high variance.
$Ψ = Q^{π} (s, a)$ : lower variance but needs an action-value estimator.
$Ψ = A^{π} (s, a)$ : same expectation as $Q$ but with the baseline subtracted, lower variance.

Substituting an estimator for $A$ introduces bias. GAE makes this tradeoff explicit and tunable. It is the default advantage estimator in PPO, the most widely deployed RL algorithm.

The mechanism

Define the TD residual at step $t$ :

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}) .

The $n$ -step advantage estimate is

\hat{A}_{t}^{(n)} = l = 0 \sum n - 1 γ^{l} δ_{t + l} .

GAE blends all $n$ -step estimates with exponential weight $λ$ :

\hat{A}_{t}^{GAE} (γ, λ) = l = 0 \sum \infty (γ λ)^{l} δ_{t + l} .

In code this collapses to a backward recursion:

\hat{A}_{t} = δ_{t} + γ λ \hat{A}_{t + 1} .

A single backward pass over the trajectory.

The two knobs

$γ$ (discount): how much future reward matters. Part of the problem definition; usually 0.99 for episodic, 0.95 to 0.999 for continuing tasks.
$λ$ (GAE): bias-variance dial.
- $λ = 0$ recovers the 1-step TD residual: low variance, biased by $V$ errors.
- $λ = 1$ recovers the full Monte Carlo return minus $V (s_{t})$ : unbiased, high variance.
- $λ = 0.95$ to $0.97$ is the standard for PPO.

Why subtracting a baseline reduces variance

For any function $b (s)$ depending only on state, $E_{π} [\nabla lo g π (a ∣ s) \cdot b (s)] = 0$ . So the gradient estimator

\nabla lo g π (a ∣ s) \cdot (Q^{π} (s, a) - b (s))

has the same expectation but lower variance, when $b$ correlates with $Q^{π}$ . The optimal baseline is exactly $V^{π}$ , hence the advantage formulation.

How it is used in PPO

PPO trains an actor and a value critic jointly. At each rollout:

Run the policy for $T$ steps in $N$ parallel environments. Collect transitions.
Run the value network on every observed state to get $V (s_{t})$ .
Compute $δ_{t}$ and then $\hat{A}_{t}$ via the GAE recursion.
Compute returns as $\hat{R}_{t} = \hat{A}_{t} + V (s_{t})$ for the value-function regression target.
Normalize advantages (subtract mean, divide by std) per batch. Important for training stability.
Train policy with the clipped objective using $\hat{A}_{t}$ , train value network on $\hat{R}_{t}$ .

Common pitfalls

Forgetting to bootstrap on truncation. When an episode is cut off mid-trajectory (not because of termination), $δ_{T}$ should use $V (s_{T})$ as the bootstrap. Conflating truncation with termination is a frequent bug.
Not normalizing advantages. PPO almost always benefits from per-batch advantage normalization.
Using $λ = 1$ with a large $γ$ on long horizons. Variance explodes.
Applying GAE in off-policy settings without correction. GAE assumes on-policy data. With a replay buffer and importance sampling, V-trace or Retrace is the corrected version.