One-line definition
PPO (Schulman et al., 2017) is a policy-gradient algorithm that keeps each gradient update close to the previous policy by clipping the ratio of new to old action probabilities. It is the default RL algorithm for continuous control, large-scale games, and RLHF in 2026.
Why it matters
PPO is the most-used deep RL algorithm for several reasons:
- Stable across domains. Robotics, Atari, locomotion, language model fine-tuning all use it with similar hyperparameters.
- Simple: no second-order Hessian-vector products (unlike TRPO).
- Mini-batch friendly: samples can be reused for several epochs per rollout.
- Backbone of RLHF: ChatGPT, Claude, Llama-Instruct were all trained with PPO on reward signals.
The objective
For a policy and old policy collected during the latest rollout, define the probability ratio:
PPO maximizes the clipped surrogate:
with advantage estimate (typically GAE) and clip parameter .
The min ensures: when the advantage is positive, the gradient is bounded above by . Preventing the new policy from over-committing to the action; when negative, bounded below by .
Full PPO objective
In addition to , PPO usually includes:
- Value function loss: trained jointly.
- Entropy bonus: . Encourages exploration.
Total:
Standard , .
The training loop
For each iteration:
- Rollout: run policy in the environment for steps × parallel actors. Collect .
- Compute advantages: via GAE using .
- Optimize: SGD on for epochs over the rollout data, in mini-batches.
- Update: at the start of next iteration.
Typical: per actor, – actors, – epochs, mini-batch size 64–256, , , , learning rate .
Why clipping helps
Without clipping, naive policy gradient with off-policy samples (multiple SGD epochs over the same rollout) can cause runaway updates: a single very-positive advantage gets multiplied by an unbounded ratio, the policy moves drastically, the next sample is from a much-changed distribution, and learning collapses.
Clipping bounds the per-step policy change. The min in the surrogate keeps the algorithm from “overshooting” in either direction.
PPO for RLHF
In RLHF (see RLHF and DPO):
- State: prompt + partial response.
- Action: next token.
- Reward: scalar from the reward model at end of response, plus per-token KL penalty to a frozen reference model.
- Policy: the LLM being aligned.
The KL penalty is critical. It prevents the policy from drifting into reward-model-exploiting nonsense. Without it, PPO will find adversarial token sequences that maximize the reward model but produce gibberish.
PPO vs. alternatives
| Algorithm | When to use |
|---|---|
| PPO | Default for on-policy with continuous or discrete actions; RLHF |
| SAC | Off-policy continuous control with sample efficiency required |
| DQN | Discrete actions, off-policy data abundant |
| DPO (LLM) | Direct preference optimization without explicit reward model. Increasingly displacing PPO for LLM alignment |
| TRPO | When KL constraint matters more than simplicity |
Common pitfalls
- Wrong advantage normalization. Normalize advantages within each mini-batch (zero mean, unit std); without, learning is unstable.
- Using too many epochs per rollout. leads to too much off-policy drift; clipping stops bias but variance grows.
- Forgetting the value function loss coefficient. keeps actor and critic on similar scales.
- Logging episodic reward only. Watch entropy, KL between consecutive policies, value-function explained variance, clip fraction. Each diagnoses different failure modes.
- Treating PPO as parameter-free. It has many hyperparameters; defaults work but tuning gives ~2× sample efficiency.
Related
- Policy gradient. Foundational algorithm.
- RLHF and DPO. LLM-alignment context.