Central limit theorem

One-line definition

If $X_{1}, \dots, X_{n}$ are i.i.d. with mean $μ$ and finite variance $σ^{2}$ , then as $n \to \infty$ :

n \cdot (\overset{ˉ}{X}_{n} - μ) d N (0, σ^{2}) .

The standardized sample mean converges in distribution to a Gaussian, regardless of the original distribution’s shape (as long as variance is finite).

Why it matters

The CLT is why we can:

Build Gaussian-based confidence intervals for almost any estimator (sample mean, regression coefficients, A/B test deltas).
Use $z$ -tests and $t$ -tests on data that isn’t itself Gaussian.
Trust that with enough samples, our reported metric ± std is approximately calibrated.

It also explains the prevalence of Gaussian assumptions in ML. Many quantities are sums or averages, and so naturally trend Gaussian.

What “enough samples” means

The Berry–Esseen theorem bounds how fast convergence happens:

x sup F_{n} (x) - Φ (x) \leq \frac{C \cdot E [ ∣ X - μ ∣ ^{3} ]}{σ ^{3} n}

with $C \approx 0.4$ . For symmetric, light-tailed distributions, $n = 30$ already gives an excellent Gaussian approximation. For heavy-tailed or skewed distributions you may need $n = 1000$ or more.

Heuristic check: plot a histogram of bootstrap means; if it looks Gaussian, the CLT has kicked in.

Variants

Multivariate CLT: $n (\overset{ˉ}{X}_{n} - μ) \to N (0, Σ)$ for a vector-valued sum with covariance matrix $Σ$ .
Lyapunov / Lindeberg CLT: relaxes the i.i.d. assumption to independent (not identical) with mild moment conditions.
Martingale CLT: extends to dependent data forming a martingale; used in online learning regret analysis.
CLT for U-statistics, M-estimators: extends to functions of multiple samples.

When CLT fails

Infinite variance (e.g., Cauchy distribution): CLT does not apply; sample means do not concentrate. Use stable distributions instead.
Strong dependence: highly correlated samples violate the i.i.d. assumption; effective sample size is much less than $n$ .
Discrete distributions on small support: CLT applies but the discrete approximation may be visibly bad until $n$ is large.

Common pitfalls

Treating $n = 30$ as universally enough. It is not for skewed or heavy-tailed data.
Using CLT on small sample sizes for inference. Below $n \approx 30$ , prefer the $t$ -distribution (which uses the sample-estimated variance) rather than $z$ .
Using parametric CLT confidence intervals on data that isn’t independent. A/B tests with user-level dependence (one user, multiple events) have effective $n$ much smaller than event count; cluster-bootstrap or use mixed-effects models.
Confusing “the mean is normally distributed” with “the data is normally distributed.” The CLT is about the mean, not individual samples.