Q-learning

One-line definition

Q-learning (Watkins, 1989) is an off-policy temporal-difference algorithm that learns the optimal action-value function $Q^{*} (s, a)$ . The expected return from taking action $a$ in state $s$ and then acting optimally. By iterating the Bellman optimality update:

Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)] .

Why it matters

Q-learning is the canonical value-based RL algorithm. Combined with deep neural networks (DQN; Mnih et al., 2015), it produced the original deep RL breakthroughs on Atari and remains the foundation of value-based methods. Knowing Q-learning is the prerequisite for understanding: target networks, experience replay, double DQN, dueling networks, and the relationship to actor-critic methods.

The setup

A Markov decision process (MDP): states $s$ , actions $a$ , transition $p (s^{'} ∣ s, a)$ , reward $r (s, a)$ , discount $γ \in [0, 1]$ .

The optimal action-value function:

Q^{*} (s, a) = E [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s, a_{0} = a, π^{*}] .

The optimal policy: $π^{*} (s) = ar g max_{a} Q^{*} (s, a)$ .

Bellman optimality

$Q^{*}$ satisfies the Bellman optimality equation:

Q^{*} (s, a) = E_{s^{'}} [r + γ a^{'} max Q^{*} (s^{'}, a^{'})] .

Q-learning approximates this by sampling: take action $a$ , observe $r$ and $s^{'}$ , update $Q$ toward the target $r + γ max_{a^{'}} Q (s^{'}, a^{'})$ .

Tabular Q-learning

For small finite state-action spaces, store $Q$ as a table. Sample transitions $(s, a, r, s^{'})$ , apply the update with learning rate $α$ . Guaranteed to converge to $Q^{*}$ if every state-action pair is visited infinitely often and learning rates satisfy Robbins-Monro conditions.

Off-policy: the update uses $max_{a^{'}} Q (s^{'}, a^{'})$ regardless of which action is actually taken next. This decouples exploration policy (e.g., $ε$ -greedy) from the learned greedy policy.

Deep Q-Networks (DQN)

For large state spaces, replace the table with a neural network $Q_{θ} (s, a)$ . Two essential tricks make this work:

Experience replay: store transitions in a buffer, sample mini-batches uniformly. Breaks temporal correlation; stabilizes training; enables data reuse.
Target network: maintain a separate frozen copy $Q_{θ^{-}}$ for the target $r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'})$ . Update $θ^{-}$ to $θ$ every $K$ steps. Prevents the target from chasing itself.

The DQN loss:

L (θ) = E_{(s, a, r, s^{'}) \sim D} [(r + γ a^{'} max Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2}] .

Variants

Double DQN (van Hasselt 2015): use $ar g max$ from online network, value from target network. Reduces overestimation bias.
Dueling DQN (Wang 2016): factor $Q (s, a) = V (s) + (A (s, a) - \overset{ˉ}{A} (s))$ .
Prioritized replay (Schaul 2015): sample transitions with high TD error more often.
Rainbow (Hessel 2018): combines six improvements; canonical strong baseline.
Distributional RL (C51, IQN): predict the distribution of returns, not just the mean.

Limitations

Maximization bias: $max_{a} Q (s, a)$ is biased upward when $Q$ is noisy. Double DQN partially fixes.
Continuous action spaces: $max_{a}$ becomes a non-trivial optimization; use deterministic policy gradients (DDPG, TD3, SAC) instead.
Sample efficiency: deep Q-learning needs millions of environment steps; impractical for slow simulators.
Off-policy correction: off-policy data can be biased; DQN papers often need careful replay buffer management.

Q-learning vs. policy gradient

Method	Q-learning	Policy gradient
Learns	$Q (s, a)$	$π (a ∣ s)$
Policy	Implicit ( $ar g max$ )	Explicit
On/off policy	Off-policy	Usually on-policy
Continuous actions	Hard ( $max_{a}$ )	Natural
Variance	Lower	Higher
Sample efficiency	Higher (data reuse)	Lower

In practice for continuous control: SAC (combines Q-learning with stochastic policy). For discrete actions with small action space: DQN-family. For large discrete action spaces (LLMs): policy gradient or DPO-family.

Common pitfalls

Skipping the target network. Loss explodes; training diverges.
Skipping experience replay. Successive samples are highly correlated; gradient estimates are biased.
Confusing on-policy and off-policy. Q-learning is off-policy: you can learn from old data with a different policy.
Ignoring exploration. Greedy policy from a randomly-initialized $Q$ is terrible; $ε$ -greedy with decaying $ε$ is the standard.

Policy gradient. Alternative paradigm.
Markov chains. MDP background.