Skip to content
mentorship

concepts

Maximum likelihood estimation

The dominant statistical principle: pick parameters that make the observed data most probable. Reduces to minimizing cross-entropy for classification and MSE for Gaussian regression.

Reviewed · 2 min read

One-line definition

For a parametric family and observed data , the maximum likelihood estimate (MLE) is

Why it matters

MLE underlies almost every modern ML loss function:

  • Cross-entropy for classification = MLE under a categorical model.
  • Mean-squared error = MLE under a Gaussian noise model.
  • Negative log-likelihood for language models = MLE.

When you read “minimize the negative log-likelihood,” you’re reading MLE.

Properties

Under regularity conditions (smooth log-likelihood, identifiable model, true parameter in interior), MLE is:

  • Consistent: as .
  • Asymptotically normal: where is the Fisher information.
  • Asymptotically efficient: achieves the Cramér–Rao lower bound. No consistent estimator has lower asymptotic variance.

Common cases

ModelMLE solution
Gaussian mean (known )
Gaussian variance (biased; sample-variance uses )
Bernoulli ()
Categoricalempirical class frequencies
Linear regression with Gaussian noiseOLS:
Logistic regressionno closed form; iterative (Newton, gradient methods)

Connection to cross-entropy and KL

For categorical and empirical distribution , the negative log-likelihood divided by is

Maximizing likelihood = minimizing cross-entropy = minimizing KL from the empirical distribution to the model.

MLE vs MAP

MAP (maximum a posteriori) adds a prior: . MAP equals MLE when the prior is uniform (improper). Common choices:

  • Gaussian prior L2 regularization on .
  • Laplace prior L1 regularization.

Common pitfalls

  • Treating sample variance as MLE. MLE for Gaussian variance divides by (biased); the sample variance divides by (unbiased). Different estimators.
  • Stopping at MLE without checking identifiability. If two parameter values yield identical likelihoods, MLE is non-unique.
  • Trusting MLE on small samples. Asymptotic guarantees can be misleading when is small relative to dimension; use cross-validation or Bayesian methods.
  • Forgetting that MLE on a misspecified model is still well-defined. It converges to the parameter that minimizes KL to the (mismatched) true distribution within the model family.