Skip to content
mentorship

concepts

Bayes' rule and the posterior

How to update beliefs given evidence: posterior ∝ likelihood × prior. The foundation of Bayesian inference, naive Bayes, and probabilistic graphical models.

Reviewed · 3 min read

One-line definition

For random variables (parameters / hypothesis) and (data / evidence):

The posterior is proportional to the likelihood times the prior .

Why it matters

Bayes’ rule is the only mathematically consistent way to update probabilistic beliefs given new evidence. It underlies probabilistic ML (Gaussian processes, Bayesian deep learning), classification (naive Bayes), generative models (latent variable inference), and many engineering systems (Kalman filtering, sensor fusion).

The connection to MLE: the posterior peak (MAP estimate) collapses to MLE under a uniform prior. So MLE is a special case of Bayesian inference with no prior beliefs.

The four pieces

PieceNameWhat it is
PriorBeliefs about before seeing data
LikelihoodHow probable the data is under each hypothesis
PosteriorUpdated beliefs after seeing data
Evidence / marginal likelihoodNormalizing constant;

The evidence is often intractable (high-dimensional integral). For point estimates and many decisions you can ignore it.

The classic example

A medical test is 99% accurate: and . The disease has prevalence . A random person tests positive. What is ?

Despite the 99% test accuracy, ~91% of positive results are false. The posterior depends crucially on the prior.

Conjugate priors

A prior is conjugate to a likelihood if the resulting posterior is in the same family (so updating stays in closed form).

LikelihoodConjugate priorPosterior family
Bernoulli/binomialBetaBeta
Categorical/multinomialDirichletDirichlet
Gaussian (mean, known )GaussianGaussian
Gaussian (precision)GammaGamma
PoissonGammaGamma

Used in: Thompson sampling for bandits (Beta-Bernoulli), online recsys updates, conjugate Gibbs samplers.

Approximate inference (when conjugacy fails)

Modern Bayesian deep learning rarely has closed-form posteriors. Standard approximations:

  • Laplace approximation: Gaussian centered at MAP with covariance from the Hessian.
  • Variational inference: optimize a parametric family to minimize .
  • MCMC (Metropolis-Hastings, HMC, NUTS): draw samples from the posterior asymptotically.
  • Stochastic-gradient Langevin / SGHMC: scale to large data via mini-batches.

Common pitfalls

  • Confusing likelihood with posterior. is not a probability distribution over ; it does not integrate to 1 over .
  • Ignoring the prior in low-data regimes. With small , the posterior is dominated by the prior.
  • Reporting MAP without uncertainty. A posterior contains more than its mode; the spread is often the more useful information.
  • Improper priors. Some “uniform” priors over unbounded parameter spaces don’t integrate; the posterior may still be proper (or may not be).