Monte Carlo and importance sampling

One-line definition

Monte Carlo estimates $E_{p} [f (X)] = \int f (x) p (x) d x$ by drawing $X_{1}, \dots, X_{n} \sim p$ and computing the sample average $\overset{μ}{^} = \frac{1}{n} \sum_{i} f (X_{i})$ . Importance sampling corrects for sampling from a different distribution $q$ by reweighting: $\overset{μ}{^} = \frac{1}{n} \sum_{i} f (X_{i}) \frac{p ( X _{i} )}{q ( X _{i} )}$ .

Why it matters

Almost every probabilistic ML algorithm computes intractable expectations: posterior expectations in Bayesian inference, gradients of expectations in REINFORCE, off-policy returns in RL. Monte Carlo and importance sampling are the universal hammers when you can sample from the relevant distribution but can’t integrate analytically.

Plain Monte Carlo

If $X_{i} \sim p$ are i.i.d.:

Unbiased: $E [\overset{μ}{^}] = E_{p} [f (X)]$ .
Variance: $Var (\overset{μ}{^}) = Var_{p} (f) / n$ .
Convergence rate: $n$ regardless of dimension. (Beats deterministic numerical integration in high dimensions.)

The CLT gives Gaussian confidence intervals: $\overset{μ}{^} \pm 1.96 \cdot \overset{σ}{^} / n$ .

Importance sampling

When sampling from $p$ is hard (e.g., posterior, rare event), sample from a proposal $q$ instead:

E_{p} [f (X)] = E_{q} [f (X) \frac{p ( X )}{q ( X )}] \approx \frac{1}{n} i \sum f (X_{i}) w_{i}, w_{i} = \frac{p ( X _{i} )}{q ( X _{i} )} .

The weights $w_{i}$ are importance weights. Critical:

$q$ must be positive wherever $p \cdot f$ is non-zero (no holes).
The variance of the estimator depends on $Var_{q} [(f \cdot p / q)]$ . A poorly matched $q$ can make variance enormous.
Self-normalized IS: when $p$ is only known up to a constant, use $\overset{μ}{^} = \sum w_{i} f (X_{i}) / \sum w_{i}$ . Biased (consistent) but always usable.

Effective sample size

A diagnostic for how well IS is working:

ESS = \frac{( \sum w _{i} ) ^{2}}{\sum w _{i}^{2}} .

If most weight concentrates on a single sample, ESS $\approx 1$ even when $n = 1 0^{4}$ . Check ESS before trusting an IS estimate.

Where it shows up in ML

Use case	What’s the integral
REINFORCE policy gradient	$\nabla_{θ} E_{π_{θ}} [R]$
Variational inference (ELBO gradient via reparam alternatives)	$\nabla_{ϕ} E_{q_{ϕ}} [lo g p - lo g q]$
Off-policy RL (importance sampling correction)	$E_{π_{behavior}} [\frac{π _{target}}{π _{behavior}} R]$
Bayesian posterior predictive	$\int p (y ∣ θ) p (θ ∣ D) d θ$
Importance-weighted autoencoders (IWAE)	tighter ELBO via $K$ -sample IS

Variance reduction

Control variates: subtract a quantity with known mean: $f (X) - c (g (X) - E [g])$ .
Antithetic variates: pair $X$ with $- X$ (for symmetric $p$ ).
Stratified sampling: divide the domain into strata and sample within each.
Rao–Blackwellization: condition out variables you can integrate exactly.

Common pitfalls

Heavy-tailed importance weights. Variance can be infinite even when the true expectation exists.
Reusing samples across $n$ proposals (e.g., multi-step IS in off-policy RL): the weights compound, and variance explodes.
Confusing self-normalized IS bias with MC bias. Plain MC is unbiased; self-normalized IS is biased but consistent.
Forgetting that $q$ must dominate $p \cdot f$ . A “small” hole in $q$ where $p$ has mass introduces unbounded bias.