One-line definition
Policy gradient methods directly parametrize a policy and optimize by ascending the gradient of expected return:
This is the policy gradient theorem (Sutton et al., 1999).
Why it matters
Policy gradient methods are the foundation of modern continuous-control RL (SAC, PPO), of large-scale RL (AlphaGo’s policy net, OpenAI Five), and of LLM alignment (RLHF uses PPO). They are essential whenever:
- The action space is continuous (no easy ).
- A stochastic policy is desirable (exploration, multi-modal optimal policies).
- You can write a clean, differentiable policy parametrization.
REINFORCE
The simplest policy gradient (Williams, 1992):
where is the return-to-go from step .
Sample a trajectory by running the policy. For each step, compute and weight by the actual return. Backprop. Take a gradient step.
Pros: extremely general. Works for any differentiable policy. Cons: high variance. Return is a noisy estimate of expected return; gradient estimator dominates by random rollout luck.
Variance reduction: baselines
Subtract any state-dependent baseline from . Does not change the expected gradient (so no bias) but can drastically reduce variance:
Optimal baseline: a value function . This leads directly to actor-critic.
Actor-critic
Two networks:
- Actor . The policy.
- Critic . Value function estimator.
Use the critic as the baseline. Define the advantage . Update:
Critic is trained by TD: .
This is the basic A2C / A3C algorithm.
Generalized Advantage Estimation (GAE)
Instead of one-step or full-return advantages, use a weighted sum:
trades bias for variance. → one-step TD. → Monte-Carlo. Standard in PPO.
Trust regions and PPO
Naive policy gradient updates can drastically change the policy in one step, leading to collapse. Trust Region Policy Optimization (TRPO; Schulman et al., 2015) constrains the KL divergence between old and new policies. PPO (Schulman et al., 2017) replaces the constraint with a clipped surrogate objective:
with and typical.
PPO is the dominant policy gradient algorithm in 2026: simpler than TRPO, robust, well-understood, used in RLHF.
Off-policy actor-critic
For sample efficiency in continuous control, use a replay buffer with importance sampling correction or Q-function critics. DDPG, TD3, SAC are the standard off-policy actor-critic algorithms. SAC adds a maximum-entropy bonus that encourages exploration.
Common pitfalls
- High variance. Always use a baseline; always normalize advantages within a batch (subtract mean, divide by std).
- No entropy bonus. Policies collapse to deterministic; add to the loss to encourage exploration.
- Reusing samples without importance correction. On-policy methods (REINFORCE, PPO) sample from the current policy; using stale samples introduces bias. PPO’s clipping caps the staleness.
- Mismatch between rollout and learning. GPU-driven RL often has the policy version drift between rollout and gradient update; PPO’s IS ratio handles small mismatches.
- Treating policy gradient as universally low-variance. It isn’t. Q-learning often has lower variance for discrete-action problems.
Related
- Q-learning. Value-based alternative.
- RLHF and DPO. PPO applied to LLM alignment.
- PPO. Algorithm-level details.