Skip to content
mentorship

concepts

Markov chains

Stochastic processes where the future depends only on the present, not the past. Foundation of HMMs, MCMC, and many sequence models.

Reviewed · 3 min read

One-line definition

A Markov chain is a sequence of random variables such that

The conditional distribution of the future given the present is independent of the past. For finite state spaces, the dynamics are summarized by a transition matrix where .

Why it matters

Markov chains underlie:

  • Hidden Markov models (HMMs) for speech, biology, finance.
  • Markov chain Monte Carlo (MCMC) for Bayesian inference (Metropolis-Hastings, Gibbs).
  • PageRank (random walk on the web graph).
  • N-gram language models.
  • Reinforcement learning (Markov decision processes).
  • Diffusion models (forward and reverse Markov processes over noise levels).

The Markov property is the cleanest assumption that makes long-range stochastic systems tractable.

Stationary distribution

A distribution is stationary for if (treating as a row vector). It’s a left eigenvector of with eigenvalue 1.

For an irreducible (any state reachable from any other) and aperiodic chain, the stationary distribution exists, is unique, and the chain converges to it from any starting state:

The rate of convergence is governed by the second-largest eigenvalue of (mixing rate).

Detailed balance and reversibility

A chain is reversible if there is a such that

Detailed balance implies is stationary (sum both sides over ). MCMC algorithms (Metropolis-Hastings) construct chains satisfying detailed balance with respect to a target distribution. This is the trick that makes them sample from posteriors.

Common cases in ML

Use caseWhat is the Markov chain
HMMHidden state evolves as Markov chain
MCMCSampler defines a chain with target as stationary
PageRankRandom walk on web graph; = page rank vector
Diffusion modelSequence of noise levels (Gaussian)
MDP / RLState transitions given action
Language modelEach token depends on previous (long-context Markov chain)

Higher-order chains

A chain where depends on the last states (-th order Markov) can be re-cast as first-order on the state space of -tuples. Trigram language models are 2nd-order Markov over tokens, equivalent to first-order over bigrams.

Common pitfalls

  • Assuming Markov when data has long-range dependence. Often a useful approximation but check by holding out structure.
  • Non-converging MCMC. A chain not yet at stationarity gives biased samples; use multiple chains and convergence diagnostics (, ESS).
  • Confusing transition matrix conventions. Some texts use ; others . Check whether acts on rows or columns.
  • Mistaking the stationary distribution for the marginal of . Marginal at finite depends on initial distribution; stationary is the limit.