One-line definition
For random variables (parameters / hypothesis) and (data / evidence):
The posterior is proportional to the likelihood times the prior .
Why it matters
Bayes’ rule is the only mathematically consistent way to update probabilistic beliefs given new evidence. It underlies probabilistic ML (Gaussian processes, Bayesian deep learning), classification (naive Bayes), generative models (latent variable inference), and many engineering systems (Kalman filtering, sensor fusion).
The connection to MLE: the posterior peak (MAP estimate) collapses to MLE under a uniform prior. So MLE is a special case of Bayesian inference with no prior beliefs.
The four pieces
| Piece | Name | What it is |
|---|---|---|
| Prior | Beliefs about before seeing data | |
| Likelihood | How probable the data is under each hypothesis | |
| Posterior | Updated beliefs after seeing data | |
| Evidence / marginal likelihood | Normalizing constant; |
The evidence is often intractable (high-dimensional integral). For point estimates and many decisions you can ignore it.
The classic example
A medical test is 99% accurate: and . The disease has prevalence . A random person tests positive. What is ?
Despite the 99% test accuracy, ~91% of positive results are false. The posterior depends crucially on the prior.
Conjugate priors
A prior is conjugate to a likelihood if the resulting posterior is in the same family (so updating stays in closed form).
| Likelihood | Conjugate prior | Posterior family |
|---|---|---|
| Bernoulli/binomial | Beta | Beta |
| Categorical/multinomial | Dirichlet | Dirichlet |
| Gaussian (mean, known ) | Gaussian | Gaussian |
| Gaussian (precision) | Gamma | Gamma |
| Poisson | Gamma | Gamma |
Used in: Thompson sampling for bandits (Beta-Bernoulli), online recsys updates, conjugate Gibbs samplers.
Approximate inference (when conjugacy fails)
Modern Bayesian deep learning rarely has closed-form posteriors. Standard approximations:
- Laplace approximation: Gaussian centered at MAP with covariance from the Hessian.
- Variational inference: optimize a parametric family to minimize .
- MCMC (Metropolis-Hastings, HMC, NUTS): draw samples from the posterior asymptotically.
- Stochastic-gradient Langevin / SGHMC: scale to large data via mini-batches.
Common pitfalls
- Confusing likelihood with posterior. is not a probability distribution over ; it does not integrate to 1 over .
- Ignoring the prior in low-data regimes. With small , the posterior is dominated by the prior.
- Reporting MAP without uncertainty. A posterior contains more than its mode; the spread is often the more useful information.
- Improper priors. Some “uniform” priors over unbounded parameter spaces don’t integrate; the posterior may still be proper (or may not be).