Skip to content
mentorship

concepts

Advantage estimation and GAE

Policy gradients need a low-variance estimate of how much better an action was than average. GAE is the standard answer: an exponentially weighted blend of n-step returns.

Reviewed · 3 min read

One-line definition

The advantage measures how much better action is than the policy’s average. Generalized Advantage Estimation (GAE, Schulman et al., 2016) estimates it as an exponentially weighted average of -step TD residuals, controlled by a single parameter .

Why it matters

Policy gradient methods optimize . The choice of controls the bias-variance tradeoff:

  • (full return): unbiased, high variance.
  • : lower variance but needs an action-value estimator.
  • : same expectation as but with the baseline subtracted, lower variance.

Substituting an estimator for introduces bias. GAE makes this tradeoff explicit and tunable. It is the default advantage estimator in PPO, the most widely deployed RL algorithm.

The mechanism

Define the TD residual at step :

The -step advantage estimate is

GAE blends all -step estimates with exponential weight :

In code this collapses to a backward recursion:

A single backward pass over the trajectory.

The two knobs

  • (discount): how much future reward matters. Part of the problem definition; usually 0.99 for episodic, 0.95 to 0.999 for continuing tasks.
  • (GAE): bias-variance dial.
    • recovers the 1-step TD residual: low variance, biased by errors.
    • recovers the full Monte Carlo return minus : unbiased, high variance.
    • to is the standard for PPO.

Why subtracting a baseline reduces variance

For any function depending only on state, . So the gradient estimator

has the same expectation but lower variance, when correlates with . The optimal baseline is exactly , hence the advantage formulation.

How it is used in PPO

PPO trains an actor and a value critic jointly. At each rollout:

  1. Run the policy for steps in parallel environments. Collect transitions.
  2. Run the value network on every observed state to get .
  3. Compute and then via the GAE recursion.
  4. Compute returns as for the value-function regression target.
  5. Normalize advantages (subtract mean, divide by std) per batch. Important for training stability.
  6. Train policy with the clipped objective using , train value network on .

Common pitfalls

  • Forgetting to bootstrap on truncation. When an episode is cut off mid-trajectory (not because of termination), should use as the bootstrap. Conflating truncation with termination is a frequent bug.
  • Not normalizing advantages. PPO almost always benefits from per-batch advantage normalization.
  • Using with a large on long horizons. Variance explodes.
  • Applying GAE in off-policy settings without correction. GAE assumes on-policy data. With a replay buffer and importance sampling, V-trace or Retrace is the corrected version.