One-line definition
For a parametric family and observed data , the maximum likelihood estimate (MLE) is
Why it matters
MLE underlies almost every modern ML loss function:
- Cross-entropy for classification = MLE under a categorical model.
- Mean-squared error = MLE under a Gaussian noise model.
- Negative log-likelihood for language models = MLE.
When you read “minimize the negative log-likelihood,” you’re reading MLE.
Properties
Under regularity conditions (smooth log-likelihood, identifiable model, true parameter in interior), MLE is:
- Consistent: as .
- Asymptotically normal: where is the Fisher information.
- Asymptotically efficient: achieves the Cramér–Rao lower bound. No consistent estimator has lower asymptotic variance.
Common cases
| Model | MLE solution |
|---|---|
| Gaussian mean (known ) | |
| Gaussian variance | (biased; sample-variance uses ) |
| Bernoulli () | |
| Categorical | empirical class frequencies |
| Linear regression with Gaussian noise | OLS: |
| Logistic regression | no closed form; iterative (Newton, gradient methods) |
Connection to cross-entropy and KL
For categorical and empirical distribution , the negative log-likelihood divided by is
Maximizing likelihood = minimizing cross-entropy = minimizing KL from the empirical distribution to the model.
MLE vs MAP
MAP (maximum a posteriori) adds a prior: . MAP equals MLE when the prior is uniform (improper). Common choices:
- Gaussian prior L2 regularization on .
- Laplace prior L1 regularization.
Common pitfalls
- Treating sample variance as MLE. MLE for Gaussian variance divides by (biased); the sample variance divides by (unbiased). Different estimators.
- Stopping at MLE without checking identifiability. If two parameter values yield identical likelihoods, MLE is non-unique.
- Trusting MLE on small samples. Asymptotic guarantees can be misleading when is small relative to dimension; use cross-validation or Bayesian methods.
- Forgetting that MLE on a misspecified model is still well-defined. It converges to the parameter that minimizes KL to the (mismatched) true distribution within the model family.