Skip to content
mentorship

concepts

Policy gradient methods

Directly optimize the policy by following the gradient of expected return. REINFORCE, actor-critic, and the foundation of modern RL.

Reviewed · 3 min read

One-line definition

Policy gradient methods directly parametrize a policy and optimize by ascending the gradient of expected return:

This is the policy gradient theorem (Sutton et al., 1999).

Why it matters

Policy gradient methods are the foundation of modern continuous-control RL (SAC, PPO), of large-scale RL (AlphaGo’s policy net, OpenAI Five), and of LLM alignment (RLHF uses PPO). They are essential whenever:

  • The action space is continuous (no easy ).
  • A stochastic policy is desirable (exploration, multi-modal optimal policies).
  • You can write a clean, differentiable policy parametrization.

REINFORCE

The simplest policy gradient (Williams, 1992):

where is the return-to-go from step .

Sample a trajectory by running the policy. For each step, compute and weight by the actual return. Backprop. Take a gradient step.

Pros: extremely general. Works for any differentiable policy. Cons: high variance. Return is a noisy estimate of expected return; gradient estimator dominates by random rollout luck.

Variance reduction: baselines

Subtract any state-dependent baseline from . Does not change the expected gradient (so no bias) but can drastically reduce variance:

Optimal baseline: a value function . This leads directly to actor-critic.

Actor-critic

Two networks:

  • Actor . The policy.
  • Critic . Value function estimator.

Use the critic as the baseline. Define the advantage . Update:

Critic is trained by TD: .

This is the basic A2C / A3C algorithm.

Generalized Advantage Estimation (GAE)

Instead of one-step or full-return advantages, use a weighted sum:

trades bias for variance. → one-step TD. → Monte-Carlo. Standard in PPO.

Trust regions and PPO

Naive policy gradient updates can drastically change the policy in one step, leading to collapse. Trust Region Policy Optimization (TRPO; Schulman et al., 2015) constrains the KL divergence between old and new policies. PPO (Schulman et al., 2017) replaces the constraint with a clipped surrogate objective:

with and typical.

PPO is the dominant policy gradient algorithm in 2026: simpler than TRPO, robust, well-understood, used in RLHF.

Off-policy actor-critic

For sample efficiency in continuous control, use a replay buffer with importance sampling correction or Q-function critics. DDPG, TD3, SAC are the standard off-policy actor-critic algorithms. SAC adds a maximum-entropy bonus that encourages exploration.

Common pitfalls

  • High variance. Always use a baseline; always normalize advantages within a batch (subtract mean, divide by std).
  • No entropy bonus. Policies collapse to deterministic; add to the loss to encourage exploration.
  • Reusing samples without importance correction. On-policy methods (REINFORCE, PPO) sample from the current policy; using stale samples introduces bias. PPO’s clipping caps the staleness.
  • Mismatch between rollout and learning. GPU-driven RL often has the policy version drift between rollout and gradient update; PPO’s IS ratio handles small mismatches.
  • Treating policy gradient as universally low-variance. It isn’t. Q-learning often has lower variance for discrete-action problems.