Skip to content
mentorship

concepts

Exploration vs exploitation: epsilon-greedy, UCB, Thompson sampling

An RL or bandit agent has to keep trying new actions to learn while taking the best-known action to score. Three classical strategies, each with a different way of resolving the tension.

Reviewed · 4 min read

One-line definition

Exploration means choosing actions to gather information; exploitation means choosing actions that look best given current information. Three canonical strategies: -greedy (occasional random actions), UCB (optimism in the face of uncertainty), and Thompson sampling (sample from the posterior).

Why it matters

Every learning agent faces this trade-off. In multi-armed bandits, in reinforcement learning, in recommender systems with online learning, in A/B-testing platforms with adaptive allocation. The wrong strategy either wastes data on suboptimal actions forever (under-exploration) or never converges to the best policy (over-exploration).

The three classical strategies are still the practical tools. Modern RL replaces them with more sophisticated mechanisms (policy entropy, count-based bonuses, RND), but the underlying tension is the same.

The setup

K actions (arms). Action has unknown mean reward . At each round, you pick one arm and observe a noisy reward sample. You want to maximize cumulative reward.

The optimal arm is . Regret at round is . Lower regret is better. The benchmark.

-greedy

With probability , pick the action with the highest empirical mean reward . With probability , pick a random action.

ProsTrivial to implement; one hyperparameter
ConsNon-decaying keeps wasting samples on suboptimal arms; achieves linear regret
FixDecay over time, e.g. . Achieves regret

In RL: -greedy is the default exploration scheme for DQN-family algorithms. Initial , anneal to 0.05 to 0.1 over the first steps.

UCB (Upper Confidence Bound)

For each arm, maintain a confidence interval on . Pick the arm with the highest upper bound:

where is the number of times arm has been pulled and is an exploration constant (typically ).

The bonus is an upper bound from Hoeffding’s inequality: arms with few pulls get a large bonus and are exploration candidates; well-explored arms with high empirical mean dominate by exploitation.

ProsProvably regret; deterministic, reproducible
ConsAssumes bounded rewards; harder to extend to deep RL
Used inUCB1 for bandits, UCT for tree search (the algorithm under AlphaGo’s MCTS)

Thompson sampling

Maintain a posterior distribution over for each arm. At each round:

  1. Sample from the posterior of each arm.
  2. Pick the arm with the highest sampled value.
  3. Observe the reward, update the posterior.

For Bernoulli rewards: maintain a Beta posterior per arm, sample, pick the max. For Gaussian rewards: maintain a Gaussian posterior, sample, pick the max.

ProsOptimal asymptotic regret; naturally handles non-stationarity (with discounting); easy to extend to contextual bandits
ConsNeeds a tractable posterior; randomized so non-reproducible per run
Used inProduction bandits at LinkedIn, Yahoo, Netflix; ad-bidding systems

When each strategy is the right choice

SituationStrategy
Small action space, simple reward-greedy with decay
Provable regret bounds matterUCB
Bayesian model fits naturally; posterior is tractableThompson sampling
Continuous actions (deep RL)Entropy regularization + noise (SAC), or learned exploration bonuses
Sparse-reward exploration in deep RLCuriosity-driven, count-based bonuses, RND

Beyond bandits

In full RL, exploration is harder because actions affect future state distributions, not just immediate reward. The classical strategies adapt:

  • Boltzmann exploration: sample actions proportional to . Replaces -greedy in some settings.
  • Maximum-entropy RL (SAC): add an entropy bonus to the reward. Encourages stochastic policies as long as the entropy bonus pays off.
  • Intrinsic motivation: shape the reward with bonuses for novel states (count-based, prediction error, distillation gap).

Common pitfalls

  • Holding constant. A constant 0.1 means 10 percent of samples are wasted forever. Anneal.
  • Using UCB with unbounded rewards. The Hoeffding-style bounds only hold for bounded rewards. Modify or rescale.
  • Comparing exploration strategies on a single seed. All three are stochastic; report mean and variance across many seeds.
  • Treating “exploit” as the goal. Without enough exploration, the empirical best arm is biased upward and you never find the true best.