Linear regression

Predict a continuous target as a linear combination of features by minimizing squared error. Closed-form solution, MLE under Gaussian noise, and the foundation everything else builds on.

Reviewed September 18, 2025 · 3 min read

One-line definition

Linear regression models $y = w^{⊤} x + b + ε$ with $ε \sim N (0, σ^{2})$ . The MLE / least-squares estimator is

\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y .

Why it matters

Linear regression is the most-analyzed model in statistics and the building block for almost everything: GLMs, kernel ridge regression, MLP last layers, factor models. Knowing its assumptions and failure modes is essential. If you don’t know when OLS is wrong, you don’t know when fancier models help.

Ordinary least squares (OLS)

Loss: $L (w, b) = \sum_{i} (y_{i} - w^{⊤} x_{i} - b)^{2} = ∥ y - X w ∥^{2}$ (absorb $b$ into $w$ by adding a column of 1s).

Closed-form minimizer (when $X^{⊤} X$ is invertible):

\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y .

This is the normal equations solution. Derived by setting $\nabla_{w} L = 0$ .

Numerically, never compute it that way. Use:

QR decomposition: $X = QR$ then $\overset{w}{^} = R^{- 1} Q^{⊤} y$ . Stable.
SVD: works even when $X^{⊤} X$ is singular (gives minimum-norm solution).
Gradient descent: for huge $n$ where matrix factorization doesn’t fit.

Probabilistic interpretation

OLS is MLE for $y ∣ x \sim N (w^{⊤} x, σ^{2})$ . The Gaussian-noise assumption is what motivates the squared-error loss; if errors are heavy-tailed, OLS is no longer optimal (consider Huber loss or quantile regression).

Ridge and lasso

When $X^{⊤} X$ is ill-conditioned (collinear features, $p > n$ ), OLS variance explodes. Add regularization:

Method	Penalty	Effect
Ridge (L2)	$λ ∥ w ∥_{2}^{2}$	Shrinks all coefficients toward 0; closed form: $\overset{w}{^} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y$
Lasso (L1)	$λ ∥ w ∥_{1}$	Drives some coefficients to exactly 0 (sparsity); no closed form (use coordinate descent / proximal gradient)
Elastic net	both	Combines sparsity with grouping

Ridge is the default for prediction; lasso for variable selection or interpretability.

Assumptions and diagnostics

The classical OLS assumptions:

Linearity: $E [y ∣ x]$ is linear in $x$ .
Independence: residuals are independent.
Homoskedasticity: residual variance is constant.
Normality: residuals are normally distributed (only matters for inference, not for prediction).

Check by plotting residuals: vs. predicted (linearity, homoskedasticity), vs. each feature (linearity), QQ plot (normality), Durbin-Watson (independence in time series).

Gauss–Markov theorem

Under the first three assumptions (with finite variance), OLS is the Best Linear Unbiased Estimator (BLUE). Minimum variance among all linear unbiased estimators. Note: biased estimators (ridge) can do better in MSE.

When to use vs. alternatives

Linear is enough: low-dimensional clean data, interpretability needed.
Non-linear: add basis functions (polynomial features), kernels (kernel ridge regression), or use trees / neural nets.
Heavy-tailed errors: Huber regression, quantile regression.
Many irrelevant features: lasso or elastic net.
Hierarchical / clustered data: mixed-effects (linear mixed model).

Common pitfalls

Inverting $X^{⊤} X$ directly. Use QR / SVD; the inverse is numerically unstable and ridge-style regularization is often needed.
Including highly collinear features without regularization. Coefficients become unstable and uninterpretable; drop one or use ridge.
Reporting $R^{2}$ on training data and calling it generalization. Use cross-validated $R^{2}$ .
Confusing prediction intervals with confidence intervals. Confidence intervals cover the mean; prediction intervals cover individual outcomes (much wider).