One-line definition
If are i.i.d. with mean and finite variance , then as :
The standardized sample mean converges in distribution to a Gaussian, regardless of the original distribution’s shape (as long as variance is finite).
Why it matters
The CLT is why we can:
- Build Gaussian-based confidence intervals for almost any estimator (sample mean, regression coefficients, A/B test deltas).
- Use -tests and -tests on data that isn’t itself Gaussian.
- Trust that with enough samples, our reported metric ± std is approximately calibrated.
It also explains the prevalence of Gaussian assumptions in ML. Many quantities are sums or averages, and so naturally trend Gaussian.
What “enough samples” means
The Berry–Esseen theorem bounds how fast convergence happens:
with . For symmetric, light-tailed distributions, already gives an excellent Gaussian approximation. For heavy-tailed or skewed distributions you may need or more.
Heuristic check: plot a histogram of bootstrap means; if it looks Gaussian, the CLT has kicked in.
Variants
- Multivariate CLT: for a vector-valued sum with covariance matrix .
- Lyapunov / Lindeberg CLT: relaxes the i.i.d. assumption to independent (not identical) with mild moment conditions.
- Martingale CLT: extends to dependent data forming a martingale; used in online learning regret analysis.
- CLT for U-statistics, M-estimators: extends to functions of multiple samples.
When CLT fails
- Infinite variance (e.g., Cauchy distribution): CLT does not apply; sample means do not concentrate. Use stable distributions instead.
- Strong dependence: highly correlated samples violate the i.i.d. assumption; effective sample size is much less than .
- Discrete distributions on small support: CLT applies but the discrete approximation may be visibly bad until is large.
Common pitfalls
- Treating as universally enough. It is not for skewed or heavy-tailed data.
- Using CLT on small sample sizes for inference. Below , prefer the -distribution (which uses the sample-estimated variance) rather than .
- Using parametric CLT confidence intervals on data that isn’t independent. A/B tests with user-level dependence (one user, multiple events) have effective much smaller than event count; cluster-bootstrap or use mixed-effects models.
- Confusing “the mean is normally distributed” with “the data is normally distributed.” The CLT is about the mean, not individual samples.