Proximal Policy Optimization (PPO)

One-line definition

PPO (Schulman et al., 2017) is a policy-gradient algorithm that keeps each gradient update close to the previous policy by clipping the ratio of new to old action probabilities. It is the default RL algorithm for continuous control, large-scale games, and RLHF in 2026.

Why it matters

PPO is the most-used deep RL algorithm for several reasons:

Stable across domains. Robotics, Atari, locomotion, language model fine-tuning all use it with similar hyperparameters.
Simple: no second-order Hessian-vector products (unlike TRPO).
Mini-batch friendly: samples can be reused for several epochs per rollout.
Backbone of RLHF: ChatGPT, Claude, Llama-Instruct were all trained with PPO on reward signals.

The objective

For a policy $π_{θ}$ and old policy $π_{θ_{old}}$ collected during the latest rollout, define the probability ratio:

r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{old}} ( a _{t} ∣ s _{t} )} .

PPO maximizes the clipped surrogate:

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \cdot A_{t})]

with advantage estimate $A_{t}$ (typically GAE) and clip parameter $ε = 0.2$ .

The min ensures: when the advantage is positive, the gradient is bounded above by $1 + ε$ . Preventing the new policy from over-committing to the action; when negative, bounded below by $1 - ε$ .

Full PPO objective

In addition to $L^{CLIP}$ , PPO usually includes:

Value function loss: $L^{VF} = (V_{ϕ} (s_{t}) - R_{t})^{2}$ trained jointly.
Entropy bonus: $L^{S} = H [π_{θ} (\cdot ∣ s_{t})]$ . Encourages exploration.

Total:

L = L^{CLIP} - c_{1} L^{VF} + c_{2} L^{S} .

Standard $c_{1} = 0.5$ , $c_{2} = 0.01$ .

The training loop

For each iteration:

Rollout: run policy $π_{θ_{old}}$ in the environment for $T$ steps × $N$ parallel actors. Collect $(s, a, r, s^{'})$ .
Compute advantages: $A_{t}$ via GAE using $V_{ϕ}$ .
Optimize: SGD on $L$ for $K$ epochs over the rollout data, in mini-batches.
Update: $θ_{old} \leftarrow θ$ at the start of next iteration.

Typical: $T = 2048$ per actor, $N = 8$ – $64$ actors, $K = 4$ – $10$ epochs, mini-batch size 64–256, $ε = 0.2$ , $γ = 0.99$ , $λ = 0.95$ , learning rate $3 \times 1 0^{- 4}$ .

Why clipping helps

Without clipping, naive policy gradient with off-policy samples (multiple SGD epochs over the same rollout) can cause runaway updates: a single very-positive advantage gets multiplied by an unbounded ratio, the policy moves drastically, the next sample is from a much-changed distribution, and learning collapses.

Clipping bounds the per-step policy change. The min in the surrogate keeps the algorithm from “overshooting” in either direction.

PPO for RLHF

In RLHF (see RLHF and DPO):

State: prompt + partial response.
Action: next token.
Reward: scalar from the reward model at end of response, plus per-token KL penalty to a frozen reference model.
Policy: the LLM being aligned.

The KL penalty is critical. It prevents the policy from drifting into reward-model-exploiting nonsense. Without it, PPO will find adversarial token sequences that maximize the reward model but produce gibberish.

PPO vs. alternatives

Algorithm	When to use
PPO	Default for on-policy with continuous or discrete actions; RLHF
SAC	Off-policy continuous control with sample efficiency required
DQN	Discrete actions, off-policy data abundant
DPO (LLM)	Direct preference optimization without explicit reward model. Increasingly displacing PPO for LLM alignment
TRPO	When KL constraint matters more than simplicity

Common pitfalls

Wrong advantage normalization. Normalize advantages within each mini-batch (zero mean, unit std); without, learning is unstable.
Using too many epochs per rollout. $K > 10$ leads to too much off-policy drift; clipping stops bias but variance grows.
Forgetting the value function loss coefficient. $c_{1} = 0.5$ keeps actor and critic on similar scales.
Logging episodic reward only. Watch entropy, KL between consecutive policies, value-function explained variance, clip fraction. Each diagnoses different failure modes.
Treating PPO as parameter-free. It has many hyperparameters; defaults work but tuning gives ~2× sample efficiency.

Policy gradient. Foundational algorithm.
RLHF and DPO. LLM-alignment context.