Bias and variance of estimators

One-line definition

For an estimator $\hat{θ}$ of a parameter $θ$ , bias is $E [\hat{θ}] - θ$ and variance is $E [(\hat{θ} - E [\hat{θ}])^{2}]$ . Mean-squared error decomposes as $MSE = Bias^{2} + Variance$ .

Why it matters

This decomposition is the statistical version of the bias–variance tradeoff familiar from ML: complex models have low bias but high variance; simple models have high bias but low variance. The same accounting applies to any estimator. Sample mean, regularized regression coefficient, importance sampling weight.

The decomposition

For estimator $\hat{θ}$ of a fixed (non-random) parameter $θ$ :

MSE (\hat{θ}) = E [(\hat{θ} - θ)^{2}] = E [(\hat{θ} - E \hat{θ})^{2}] + (E \hat{θ} - θ)^{2} = Var (\hat{θ}) + Bias (\hat{θ})^{2} .

The cross-term vanishes because $E [\hat{θ} - E \hat{θ}] = 0$ . Two-line derivation; central to all of statistics.

Why biased estimators can be useful

Unbiased estimators ( $E \hat{θ} = θ$ ) are not always optimal. A biased estimator with much lower variance can have lower MSE.

Examples:

Estimator	Bias	Variance	When better
Sample mean	0	$σ^{2} / n$	universal
Sample variance with $n - 1$	0	larger	unbiased baseline
Sample variance with $n$ (MLE)	small negative	smaller	when minimizing MSE
Ridge regression	nonzero	smaller than OLS	when $X^{⊤} X$ is ill-conditioned
Stein estimator	shrinkage bias	strictly lower	always for $\geq 3$ dimensions

The James-Stein estimator (1961) famously dominates the sample mean in $\geq 3$ dimensions despite being biased.

Connection to ML model selection

In supervised learning, the same decomposition holds for the prediction error of a model:

E [(y - \hat{f} (x))^{2}] = Var (\hat{f} (x)) + Bias (\hat{f} (x))^{2} + σ_{ε}^{2} .

The $σ_{ε}^{2}$ term is irreducible noise. Increasing model capacity decreases bias but increases variance; regularization shifts the tradeoff toward higher bias.

Common pitfalls

Equating “biased” with “bad.” Many useful estimators are biased; lower MSE is what matters.
Reporting variance without specifying what’s random. “Variance of the estimator” is over re-sampling the data; “variance of the prediction” is over both data and inputs. Different objects.
Forgetting that the cross-term vanishes only against the expectation of $\hat{θ}$ . Random fixed offsets ruin the decomposition.
Confusing estimator variance with model variance. Estimator variance is a property of the estimation procedure; model variance is a property of the model class.