Skip to content
mentorship

concepts

Linear regression

Predict a continuous target as a linear combination of features by minimizing squared error. Closed-form solution, MLE under Gaussian noise, and the foundation everything else builds on.

Reviewed · 3 min read

One-line definition

Linear regression models with . The MLE / least-squares estimator is

Why it matters

Linear regression is the most-analyzed model in statistics and the building block for almost everything: GLMs, kernel ridge regression, MLP last layers, factor models. Knowing its assumptions and failure modes is essential. If you don’t know when OLS is wrong, you don’t know when fancier models help.

Ordinary least squares (OLS)

Loss: (absorb into by adding a column of 1s).

Closed-form minimizer (when is invertible):

This is the normal equations solution. Derived by setting .

Numerically, never compute it that way. Use:

  • QR decomposition: then . Stable.
  • SVD: works even when is singular (gives minimum-norm solution).
  • Gradient descent: for huge where matrix factorization doesn’t fit.

Probabilistic interpretation

OLS is MLE for . The Gaussian-noise assumption is what motivates the squared-error loss; if errors are heavy-tailed, OLS is no longer optimal (consider Huber loss or quantile regression).

Ridge and lasso

When is ill-conditioned (collinear features, ), OLS variance explodes. Add regularization:

MethodPenaltyEffect
Ridge (L2)Shrinks all coefficients toward 0; closed form:
Lasso (L1)Drives some coefficients to exactly 0 (sparsity); no closed form (use coordinate descent / proximal gradient)
Elastic netbothCombines sparsity with grouping

Ridge is the default for prediction; lasso for variable selection or interpretability.

Assumptions and diagnostics

The classical OLS assumptions:

  1. Linearity: is linear in .
  2. Independence: residuals are independent.
  3. Homoskedasticity: residual variance is constant.
  4. Normality: residuals are normally distributed (only matters for inference, not for prediction).

Check by plotting residuals: vs. predicted (linearity, homoskedasticity), vs. each feature (linearity), QQ plot (normality), Durbin-Watson (independence in time series).

Gauss–Markov theorem

Under the first three assumptions (with finite variance), OLS is the Best Linear Unbiased Estimator (BLUE). Minimum variance among all linear unbiased estimators. Note: biased estimators (ridge) can do better in MSE.

When to use vs. alternatives

  • Linear is enough: low-dimensional clean data, interpretability needed.
  • Non-linear: add basis functions (polynomial features), kernels (kernel ridge regression), or use trees / neural nets.
  • Heavy-tailed errors: Huber regression, quantile regression.
  • Many irrelevant features: lasso or elastic net.
  • Hierarchical / clustered data: mixed-effects (linear mixed model).

Common pitfalls

  • Inverting directly. Use QR / SVD; the inverse is numerically unstable and ridge-style regularization is often needed.
  • Including highly collinear features without regularization. Coefficients become unstable and uninterpretable; drop one or use ridge.
  • Reporting on training data and calling it generalization. Use cross-validated .
  • Confusing prediction intervals with confidence intervals. Confidence intervals cover the mean; prediction intervals cover individual outcomes (much wider).