Maximum likelihood estimation

One-line definition

For a parametric family $p (x ∣ θ)$ and observed data ${x_{1}, \dots, x_{n}}$ , the maximum likelihood estimate (MLE) is

\hat{θ}_{MLE} = ar g θ max i = 1 \prod n p (x_{i} ∣ θ) = ar g θ max i = 1 \sum n lo g p (x_{i} ∣ θ) .

Why it matters

MLE underlies almost every modern ML loss function:

Cross-entropy for classification = MLE under a categorical model.
Mean-squared error = MLE under a Gaussian noise model.
Negative log-likelihood for language models = MLE.

When you read “minimize the negative log-likelihood,” you’re reading MLE.

Properties

Under regularity conditions (smooth log-likelihood, identifiable model, true parameter in interior), MLE is:

Consistent: $\hat{θ} \to θ^{*}$ as $n \to \infty$ .
Asymptotically normal: $n (\hat{θ} - θ^{*}) \to N (0, I (θ^{*})^{- 1})$ where $I (θ)$ is the Fisher information.
Asymptotically efficient: achieves the Cramér–Rao lower bound. No consistent estimator has lower asymptotic variance.

Common cases

Model	MLE solution
Gaussian mean (known $σ^{2}$ )	$\overset{μ}{^} = \overset{x}{ˉ}$
Gaussian variance	$\overset{σ}{^}^{2} = \frac{1}{n} \sum (x_{i} - \overset{x}{ˉ})^{2}$ (biased; sample-variance uses $n - 1$ )
Bernoulli ( $p$ )	$\overset{p}{^} = fraction of successes$
Categorical	empirical class frequencies
Linear regression with Gaussian noise	OLS: $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$
Logistic regression	no closed form; iterative (Newton, gradient methods)

Connection to cross-entropy and KL

For categorical $p_{θ} (x)$ and empirical distribution $\overset{p}{^}_{data}$ , the negative log-likelihood divided by $n$ is

- \frac{1}{n} i \sum lo g p_{θ} (x_{i}) = H (\overset{p}{^}_{data}, p_{θ}) = H (\overset{p}{^}_{data}) + KL (\overset{p}{^}_{data} ∥ p_{θ}) .

Maximizing likelihood = minimizing cross-entropy = minimizing KL from the empirical distribution to the model.

MLE vs MAP

MAP (maximum a posteriori) adds a prior: $\hat{θ}_{MAP} = ar g max_{θ} lo g p (θ) + \sum lo g p (x_{i} ∣ θ)$ . MAP equals MLE when the prior is uniform (improper). Common choices:

Gaussian prior $\Rightarrow$ L2 regularization on $θ$ .
Laplace prior $\Rightarrow$ L1 regularization.

Common pitfalls

Treating sample variance as MLE. MLE for Gaussian variance divides by $n$ (biased); the sample variance divides by $n - 1$ (unbiased). Different estimators.
Stopping at MLE without checking identifiability. If two parameter values yield identical likelihoods, MLE is non-unique.
Trusting MLE on small samples. Asymptotic guarantees can be misleading when $n$ is small relative to dimension; use cross-validation or Bayesian methods.
Forgetting that MLE on a misspecified model is still well-defined. It converges to the parameter that minimizes KL to the (mismatched) true distribution within the model family.