Skip to content
mentorship

concepts

Value-based vs. policy-based RL

Two paradigms in reinforcement learning. Value-based learns Q(s, a) and acts greedily; policy-based directly parametrizes the policy. When to use which.

Reviewed · 3 min read

One-line definition

Value-based RL learns a value function (typically ) and acts greedily with respect to it: . Policy-based RL directly parametrizes a stochastic policy and optimizes via the policy gradient. Actor-critic combines both.

Why it matters

Choosing the right paradigm is a central decision in RL system design. Mismatch leads to poor sample efficiency, instability, or simply not working. For instance, value-based methods are awkward in continuous action spaces, and pure policy-based methods are sample-inefficient in tabular settings.

Value-based methods

Examples: Q-learning, DQN, Rainbow, distributional Q-learning.

Strengths:

  • Off-policy data reuse: replay buffers enable training on old data, large effective sample size.
  • Lower variance than vanilla policy gradient.
  • Good for discrete action spaces with moderate cardinality.

Weaknesses:

  • Continuous actions require . A separate optimization at each step. DDPG / TD3 learn a deterministic actor for this.
  • No stochastic policies: greedy w.r.t. is deterministic. For exploration, must add -greedy externally.
  • Maximization bias: overestimates true when is noisy.
  • Many tricks needed for stable deep variants: target networks, prioritized replay, double DQN.

Use when:

  • Discrete actions, environment can be simulated cheaply (Atari, board games).
  • Need to reuse offline data.
  • Value function approximation is structurally easy (e.g., low-d state).

Policy-based methods

Examples: REINFORCE, A2C, A3C, TRPO, PPO.

Strengths:

  • Continuous actions are natural: parametrize as Gaussian or similar.
  • Stochastic policies built-in: useful for exploration, mixed-strategy equilibria, partial observability.
  • Direct objective: maximize expected return.
  • Simpler theoretical framing: no Bellman equations, just expectation gradients.

Weaknesses:

  • Sample inefficient: standard policy gradient is on-policy. Discard data after each gradient step.
  • High variance: Monte Carlo gradient estimator is noisy without baselines.
  • Local optima: policy gradient can get stuck in deterministic suboptimal policies.

Use when:

  • Continuous control (robotics, physical simulation).
  • Stochastic policy needed (exploration, multi-agent).
  • Policy is naturally differentiable but value function is not (e.g., LLMs as policies in RLHF).

Actor-critic: combining the two

An actor-critic algorithm trains both:

  • Actor : the policy.
  • Critic or : the value function, used as a baseline / target for the actor’s gradient.

The critic reduces the variance of the policy gradient; the actor handles continuous actions cleanly. Almost all modern RL algorithms (PPO, SAC, DDPG, TD3, IMPALA) are actor-critic.

Decision matrix

ProblemFirst choice
Discrete actions, abundant simulationDQN / Rainbow
Continuous control, sample-efficientSAC
Continuous control, simple and robustPPO
LLM alignmentPPO or DPO
Multi-agentMAPPO, IMPALA
Real-world robotics with limited samplesSAC or model-based RL
Board games / planningAlphaZero-style (MCTS + learned policy/value)
Partially observable, RNN policyPPO with LSTM/transformer policy

What about model-based RL?

A third paradigm: learn a dynamics model and plan with it. Examples: Dreamer, MuZero, World Models. Strengths: extreme sample efficiency. Weaknesses: dynamics model errors compound; engineering complexity. Used when real-world samples are expensive (robotics, healthcare).

Common pitfalls

  • Choosing value-based for continuous control. Awkward; SAC or DDPG-family if you must use Q-learning.
  • Choosing pure policy gradient for sample-rich discrete problems. DQN-family is much more sample-efficient.
  • Treating actor-critic as fundamentally different. It’s just policy gradient with a learned baseline; understand both pieces.
  • Ignoring variance in policy gradient. Without baselines + advantage normalization, you’ll see noisy curves and slow learning.