Skip to content
mentorship

concepts

Q-learning

Learn the action-value function Q(s, a) by Bellman backups. The foundation of value-based RL. DQN, Rainbow, and the original Atari breakthroughs.

Reviewed · 3 min read

One-line definition

Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that learns the optimal action-value function . The expected return from taking action in state and then acting optimally. By iterating the Bellman optimality update:

Why it matters

Q-learning is the canonical value-based RL algorithm. Combined with deep neural networks (DQN; Mnih et al., 2015), it produced the original deep RL breakthroughs on Atari and remains the foundation of value-based methods. Knowing Q-learning is the prerequisite for understanding: target networks, experience replay, double DQN, dueling networks, and the relationship to actor-critic methods.

The setup

A Markov decision process (MDP): states , actions , transition , reward , discount .

The optimal action-value function:

The optimal policy: .

Bellman optimality

satisfies the Bellman optimality equation:

Q-learning approximates this by sampling: take action , observe and , update toward the target .

Tabular Q-learning

For small finite state-action spaces, store as a table. Sample transitions , apply the update with learning rate . Guaranteed to converge to if every state-action pair is visited infinitely often and learning rates satisfy Robbins-Monro conditions.

Off-policy: the update uses regardless of which action is actually taken next. This decouples exploration policy (e.g., -greedy) from the learned greedy policy.

Deep Q-Networks (DQN)

For large state spaces, replace the table with a neural network . Two essential tricks make this work:

  1. Experience replay: store transitions in a buffer, sample mini-batches uniformly. Breaks temporal correlation; stabilizes training; enables data reuse.
  2. Target network: maintain a separate frozen copy for the target . Update to every steps. Prevents the target from chasing itself.

The DQN loss:

Variants

  • Double DQN (van Hasselt 2015): use from online network, value from target network. Reduces overestimation bias.
  • Dueling DQN (Wang 2016): factor .
  • Prioritized replay (Schaul 2015): sample transitions with high TD error more often.
  • Rainbow (Hessel 2018): combines six improvements; canonical strong baseline.
  • Distributional RL (C51, IQN): predict the distribution of returns, not just the mean.

Limitations

  • Maximization bias: is biased upward when is noisy. Double DQN partially fixes.
  • Continuous action spaces: becomes a non-trivial optimization; use deterministic policy gradients (DDPG, TD3, SAC) instead.
  • Sample efficiency: deep Q-learning needs millions of environment steps; impractical for slow simulators.
  • Off-policy correction: off-policy data can be biased; DQN papers often need careful replay buffer management.

Q-learning vs. policy gradient

MethodQ-learningPolicy gradient
Learns
PolicyImplicit ()Explicit
On/off policyOff-policyUsually on-policy
Continuous actionsHard ()Natural
VarianceLowerHigher
Sample efficiencyHigher (data reuse)Lower

In practice for continuous control: SAC (combines Q-learning with stochastic policy). For discrete actions with small action space: DQN-family. For large discrete action spaces (LLMs): policy gradient or DPO-family.

Common pitfalls

  • Skipping the target network. Loss explodes; training diverges.
  • Skipping experience replay. Successive samples are highly correlated; gradient estimates are biased.
  • Confusing on-policy and off-policy. Q-learning is off-policy: you can learn from old data with a different policy.
  • Ignoring exploration. Greedy policy from a randomly-initialized is terrible; -greedy with decaying is the standard.