Skip to content
mentorship

concepts

Monte Carlo and importance sampling

Estimate expectations by averaging over random samples. The simplest way to compute integrals you can't compute analytically.

Reviewed · 2 min read

One-line definition

Monte Carlo estimates by drawing and computing the sample average . Importance sampling corrects for sampling from a different distribution by reweighting: .

Why it matters

Almost every probabilistic ML algorithm computes intractable expectations: posterior expectations in Bayesian inference, gradients of expectations in REINFORCE, off-policy returns in RL. Monte Carlo and importance sampling are the universal hammers when you can sample from the relevant distribution but can’t integrate analytically.

Plain Monte Carlo

If are i.i.d.:

  • Unbiased: .
  • Variance: .
  • Convergence rate: regardless of dimension. (Beats deterministic numerical integration in high dimensions.)

The CLT gives Gaussian confidence intervals: .

Importance sampling

When sampling from is hard (e.g., posterior, rare event), sample from a proposal instead:

The weights are importance weights. Critical:

  • must be positive wherever is non-zero (no holes).
  • The variance of the estimator depends on . A poorly matched can make variance enormous.
  • Self-normalized IS: when is only known up to a constant, use . Biased (consistent) but always usable.

Effective sample size

A diagnostic for how well IS is working:

If most weight concentrates on a single sample, ESS even when . Check ESS before trusting an IS estimate.

Where it shows up in ML

Use caseWhat’s the integral
REINFORCE policy gradient
Variational inference (ELBO gradient via reparam alternatives)
Off-policy RL (importance sampling correction)
Bayesian posterior predictive
Importance-weighted autoencoders (IWAE)tighter ELBO via -sample IS

Variance reduction

  • Control variates: subtract a quantity with known mean: .
  • Antithetic variates: pair with (for symmetric ).
  • Stratified sampling: divide the domain into strata and sample within each.
  • Rao–Blackwellization: condition out variables you can integrate exactly.

Common pitfalls

  • Heavy-tailed importance weights. Variance can be infinite even when the true expectation exists.
  • Reusing samples across proposals (e.g., multi-step IS in off-policy RL): the weights compound, and variance explodes.
  • Confusing self-normalized IS bias with MC bias. Plain MC is unbiased; self-normalized IS is biased but consistent.
  • Forgetting that must dominate . A “small” hole in where has mass introduces unbounded bias.