One-line definition
Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that learns the optimal action-value function . The expected return from taking action in state and then acting optimally. By iterating the Bellman optimality update:
Why it matters
Q-learning is the canonical value-based RL algorithm. Combined with deep neural networks (DQN; Mnih et al., 2015), it produced the original deep RL breakthroughs on Atari and remains the foundation of value-based methods. Knowing Q-learning is the prerequisite for understanding: target networks, experience replay, double DQN, dueling networks, and the relationship to actor-critic methods.
The setup
A Markov decision process (MDP): states , actions , transition , reward , discount .
The optimal action-value function:
The optimal policy: .
Bellman optimality
satisfies the Bellman optimality equation:
Q-learning approximates this by sampling: take action , observe and , update toward the target .
Tabular Q-learning
For small finite state-action spaces, store as a table. Sample transitions , apply the update with learning rate . Guaranteed to converge to if every state-action pair is visited infinitely often and learning rates satisfy Robbins-Monro conditions.
Off-policy: the update uses regardless of which action is actually taken next. This decouples exploration policy (e.g., -greedy) from the learned greedy policy.
Deep Q-Networks (DQN)
For large state spaces, replace the table with a neural network . Two essential tricks make this work:
- Experience replay: store transitions in a buffer, sample mini-batches uniformly. Breaks temporal correlation; stabilizes training; enables data reuse.
- Target network: maintain a separate frozen copy for the target . Update to every steps. Prevents the target from chasing itself.
The DQN loss:
Variants
- Double DQN (van Hasselt 2015): use from online network, value from target network. Reduces overestimation bias.
- Dueling DQN (Wang 2016): factor .
- Prioritized replay (Schaul 2015): sample transitions with high TD error more often.
- Rainbow (Hessel 2018): combines six improvements; canonical strong baseline.
- Distributional RL (C51, IQN): predict the distribution of returns, not just the mean.
Limitations
- Maximization bias: is biased upward when is noisy. Double DQN partially fixes.
- Continuous action spaces: becomes a non-trivial optimization; use deterministic policy gradients (DDPG, TD3, SAC) instead.
- Sample efficiency: deep Q-learning needs millions of environment steps; impractical for slow simulators.
- Off-policy correction: off-policy data can be biased; DQN papers often need careful replay buffer management.
Q-learning vs. policy gradient
| Method | Q-learning | Policy gradient |
|---|---|---|
| Learns | ||
| Policy | Implicit () | Explicit |
| On/off policy | Off-policy | Usually on-policy |
| Continuous actions | Hard () | Natural |
| Variance | Lower | Higher |
| Sample efficiency | Higher (data reuse) | Lower |
In practice for continuous control: SAC (combines Q-learning with stochastic policy). For discrete actions with small action space: DQN-family. For large discrete action spaces (LLMs): policy gradient or DPO-family.
Common pitfalls
- Skipping the target network. Loss explodes; training diverges.
- Skipping experience replay. Successive samples are highly correlated; gradient estimates are biased.
- Confusing on-policy and off-policy. Q-learning is off-policy: you can learn from old data with a different policy.
- Ignoring exploration. Greedy policy from a randomly-initialized is terrible; -greedy with decaying is the standard.
Related
- Policy gradient. Alternative paradigm.
- Markov chains. MDP background.