Exploration vs exploitation: epsilon-greedy, UCB, Thompson sampling

One-line definition

Exploration means choosing actions to gather information; exploitation means choosing actions that look best given current information. Three canonical strategies: $ϵ$ -greedy (occasional random actions), UCB (optimism in the face of uncertainty), and Thompson sampling (sample from the posterior).

Why it matters

Every learning agent faces this trade-off. In multi-armed bandits, in reinforcement learning, in recommender systems with online learning, in A/B-testing platforms with adaptive allocation. The wrong strategy either wastes data on suboptimal actions forever (under-exploration) or never converges to the best policy (over-exploration).

The three classical strategies are still the practical tools. Modern RL replaces them with more sophisticated mechanisms (policy entropy, count-based bonuses, RND), but the underlying tension is the same.

The setup

K actions (arms). Action $a$ has unknown mean reward $μ_{a}$ . At each round, you pick one arm and observe a noisy reward sample. You want to maximize cumulative reward.

The optimal arm is $a^{*} = ar g max_{a} μ_{a}$ . Regret at round $T$ is $T μ_{a^{*}} - \sum_{t = 1}^{T} μ_{a_{t}}$ . Lower regret is better. The benchmark.

$ϵ$ -greedy

With probability $1 - ϵ$ , pick the action with the highest empirical mean reward $\overset{μ}{^}_{a}$ . With probability $ϵ$ , pick a random action.


Pros	Trivial to implement; one hyperparameter
Cons	Non-decaying $ϵ$ keeps wasting samples on suboptimal arms; achieves linear regret
Fix	Decay $ϵ$ over time, e.g. $ϵ_{t} = 1/ t$ . Achieves $O (lo g T)$ regret

In RL: $ϵ$ -greedy is the default exploration scheme for DQN-family algorithms. Initial $ϵ = 1$ , anneal to 0.05 to 0.1 over the first $1 0^{6}$ steps.

UCB (Upper Confidence Bound)

For each arm, maintain a confidence interval on $μ_{a}$ . Pick the arm with the highest upper bound:

a_{t} = ar g a max (\overset{μ}{^}_{a} + c \frac{ln t}{N _{a}}),

where $N_{a}$ is the number of times arm $a$ has been pulled and $c$ is an exploration constant (typically $c = 2$ ).

The bonus $ln t / N_{a}$ is an upper bound from Hoeffding’s inequality: arms with few pulls get a large bonus and are exploration candidates; well-explored arms with high empirical mean dominate by exploitation.


Pros	Provably $O (lo g T)$ regret; deterministic, reproducible
Cons	Assumes bounded rewards; harder to extend to deep RL
Used in	UCB1 for bandits, UCT for tree search (the algorithm under AlphaGo’s MCTS)

Thompson sampling

Maintain a posterior distribution over $μ_{a}$ for each arm. At each round:

Sample $\tilde{μ}_{a}$ from the posterior of each arm.
Pick the arm with the highest sampled value.
Observe the reward, update the posterior.

For Bernoulli rewards: maintain a Beta posterior per arm, sample, pick the max. For Gaussian rewards: maintain a Gaussian posterior, sample, pick the max.


Pros	Optimal asymptotic regret; naturally handles non-stationarity (with discounting); easy to extend to contextual bandits
Cons	Needs a tractable posterior; randomized so non-reproducible per run
Used in	Production bandits at LinkedIn, Yahoo, Netflix; ad-bidding systems

When each strategy is the right choice

Situation	Strategy
Small action space, simple reward	$ϵ$ -greedy with decay
Provable regret bounds matter	UCB
Bayesian model fits naturally; posterior is tractable	Thompson sampling
Continuous actions (deep RL)	Entropy regularization + noise (SAC), or learned exploration bonuses
Sparse-reward exploration in deep RL	Curiosity-driven, count-based bonuses, RND

Beyond bandits

In full RL, exploration is harder because actions affect future state distributions, not just immediate reward. The classical strategies adapt:

Boltzmann exploration: sample actions proportional to $exp (Q (s, a) / τ)$ . Replaces $ϵ$ -greedy in some settings.
Maximum-entropy RL (SAC): add an entropy bonus $α H (π (\cdot ∣ s))$ to the reward. Encourages stochastic policies as long as the entropy bonus pays off.
Intrinsic motivation: shape the reward with bonuses for novel states (count-based, prediction error, distillation gap).

Common pitfalls

Holding $ϵ$ constant. A constant 0.1 means 10 percent of samples are wasted forever. Anneal.
Using UCB with unbounded rewards. The Hoeffding-style bounds only hold for bounded rewards. Modify or rescale.
Comparing exploration strategies on a single seed. All three are stochastic; report mean and variance across many seeds.
Treating “exploit” as the goal. Without enough exploration, the empirical best arm is biased upward and you never find the true best.