One-line definition
Exploration means choosing actions to gather information; exploitation means choosing actions that look best given current information. Three canonical strategies: -greedy (occasional random actions), UCB (optimism in the face of uncertainty), and Thompson sampling (sample from the posterior).
Why it matters
Every learning agent faces this trade-off. In multi-armed bandits, in reinforcement learning, in recommender systems with online learning, in A/B-testing platforms with adaptive allocation. The wrong strategy either wastes data on suboptimal actions forever (under-exploration) or never converges to the best policy (over-exploration).
The three classical strategies are still the practical tools. Modern RL replaces them with more sophisticated mechanisms (policy entropy, count-based bonuses, RND), but the underlying tension is the same.
The setup
K actions (arms). Action has unknown mean reward . At each round, you pick one arm and observe a noisy reward sample. You want to maximize cumulative reward.
The optimal arm is . Regret at round is . Lower regret is better. The benchmark.
-greedy
With probability , pick the action with the highest empirical mean reward . With probability , pick a random action.
| Pros | Trivial to implement; one hyperparameter |
| Cons | Non-decaying keeps wasting samples on suboptimal arms; achieves linear regret |
| Fix | Decay over time, e.g. . Achieves regret |
In RL: -greedy is the default exploration scheme for DQN-family algorithms. Initial , anneal to 0.05 to 0.1 over the first steps.
UCB (Upper Confidence Bound)
For each arm, maintain a confidence interval on . Pick the arm with the highest upper bound:
where is the number of times arm has been pulled and is an exploration constant (typically ).
The bonus is an upper bound from Hoeffding’s inequality: arms with few pulls get a large bonus and are exploration candidates; well-explored arms with high empirical mean dominate by exploitation.
| Pros | Provably regret; deterministic, reproducible |
| Cons | Assumes bounded rewards; harder to extend to deep RL |
| Used in | UCB1 for bandits, UCT for tree search (the algorithm under AlphaGo’s MCTS) |
Thompson sampling
Maintain a posterior distribution over for each arm. At each round:
- Sample from the posterior of each arm.
- Pick the arm with the highest sampled value.
- Observe the reward, update the posterior.
For Bernoulli rewards: maintain a Beta posterior per arm, sample, pick the max. For Gaussian rewards: maintain a Gaussian posterior, sample, pick the max.
| Pros | Optimal asymptotic regret; naturally handles non-stationarity (with discounting); easy to extend to contextual bandits |
| Cons | Needs a tractable posterior; randomized so non-reproducible per run |
| Used in | Production bandits at LinkedIn, Yahoo, Netflix; ad-bidding systems |
When each strategy is the right choice
| Situation | Strategy |
|---|---|
| Small action space, simple reward | -greedy with decay |
| Provable regret bounds matter | UCB |
| Bayesian model fits naturally; posterior is tractable | Thompson sampling |
| Continuous actions (deep RL) | Entropy regularization + noise (SAC), or learned exploration bonuses |
| Sparse-reward exploration in deep RL | Curiosity-driven, count-based bonuses, RND |
Beyond bandits
In full RL, exploration is harder because actions affect future state distributions, not just immediate reward. The classical strategies adapt:
- Boltzmann exploration: sample actions proportional to . Replaces -greedy in some settings.
- Maximum-entropy RL (SAC): add an entropy bonus to the reward. Encourages stochastic policies as long as the entropy bonus pays off.
- Intrinsic motivation: shape the reward with bonuses for novel states (count-based, prediction error, distillation gap).
Common pitfalls
- Holding constant. A constant 0.1 means 10 percent of samples are wasted forever. Anneal.
- Using UCB with unbounded rewards. The Hoeffding-style bounds only hold for bounded rewards. Modify or rescale.
- Comparing exploration strategies on a single seed. All three are stochastic; report mean and variance across many seeds.
- Treating “exploit” as the goal. Without enough exploration, the empirical best arm is biased upward and you never find the true best.