Policy gradient methods

One-line definition

Policy gradient methods directly parametrize a policy $π_{θ} (a ∣ s)$ and optimize $θ$ by ascending the gradient of expected return:

\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot R_{t}] .

This is the policy gradient theorem (Sutton et al., 1999).

Why it matters

Policy gradient methods are the foundation of modern continuous-control RL (SAC, PPO), of large-scale RL (AlphaGo’s policy net, OpenAI Five), and of LLM alignment (RLHF uses PPO). They are essential whenever:

The action space is continuous (no easy $max_{a}$ ).
A stochastic policy is desirable (exploration, multi-modal optimal policies).
You can write a clean, differentiable policy parametrization.

REINFORCE

The simplest policy gradient (Williams, 1992):

\nabla_{θ} J = E_{τ \sim π_{θ}} [t \sum \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G_{t}]

where $G_{t} = \sum_{k = t}^{T} γ^{k - t} r_{k}$ is the return-to-go from step $t$ .

Sample a trajectory by running the policy. For each step, compute $lo g π_{θ} (a_{t} ∣ s_{t})$ and weight by the actual return. Backprop. Take a gradient step.

Pros: extremely general. Works for any differentiable policy. Cons: high variance. Return $G_{t}$ is a noisy estimate of expected return; gradient estimator dominates by random rollout luck.

Variance reduction: baselines

Subtract any state-dependent baseline $b (s_{t})$ from $G_{t}$ . Does not change the expected gradient (so no bias) but can drastically reduce variance:

\nabla_{θ} J = E [t \sum \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot (G_{t} - b (s_{t}))] .

Optimal baseline: a value function $V (s_{t})$ . This leads directly to actor-critic.

Actor-critic

Two networks:

Actor $π_{θ} (a ∣ s)$ . The policy.
Critic $V_{ϕ} (s)$ . Value function estimator.

Use the critic as the baseline. Define the advantage $A (s, a) = Q (s, a) - V (s) \approx G_{t} - V_{ϕ} (s_{t})$ . Update:

\nabla_{θ} J = E [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot A_{t}] .

Critic is trained by TD: $V_{ϕ} (s_{t}) \leftarrow V_{ϕ} (s_{t}) + α [r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t})]$ .

This is the basic A2C / A3C algorithm.

Generalized Advantage Estimation (GAE)

Instead of one-step or full-return advantages, use a weighted sum:

A_{t}^{GAE (γ, λ)} = l = 0 \sum \infty (γ λ)^{l} δ_{t + l}, δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}) .

$λ \in [0, 1]$ trades bias for variance. $λ = 0$ → one-step TD. $λ = 1$ → Monte-Carlo. Standard $λ = 0.95$ in PPO.

Trust regions and PPO

Naive policy gradient updates can drastically change the policy in one step, leading to collapse. Trust Region Policy Optimization (TRPO; Schulman et al., 2015) constrains the KL divergence between old and new policies. PPO (Schulman et al., 2017) replaces the constraint with a clipped surrogate objective:

L^{PPO} (θ) = E [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) A_{t})]

with $r_{t} (θ) = π_{θ} (a_{t} ∣ s_{t}) / π_{θ_{old}} (a_{t} ∣ s_{t})$ and $ε = 0.2$ typical.

PPO is the dominant policy gradient algorithm in 2026: simpler than TRPO, robust, well-understood, used in RLHF.

Off-policy actor-critic

For sample efficiency in continuous control, use a replay buffer with importance sampling correction or Q-function critics. DDPG, TD3, SAC are the standard off-policy actor-critic algorithms. SAC adds a maximum-entropy bonus that encourages exploration.

Common pitfalls

High variance. Always use a baseline; always normalize advantages within a batch (subtract mean, divide by std).
No entropy bonus. Policies collapse to deterministic; add $- β \cdot H (π_{θ})$ to the loss to encourage exploration.
Reusing samples without importance correction. On-policy methods (REINFORCE, PPO) sample from the current policy; using stale samples introduces bias. PPO’s clipping caps the staleness.
Mismatch between rollout and learning. GPU-driven RL often has the policy version drift between rollout and gradient update; PPO’s IS ratio handles small mismatches.
Treating policy gradient as universally low-variance. It isn’t. Q-learning often has lower variance for discrete-action problems.

Q-learning. Value-based alternative.
RLHF and DPO. PPO applied to LLM alignment.
PPO. Algorithm-level details.