One-line definition
A Markov chain is a sequence of random variables such that
The conditional distribution of the future given the present is independent of the past. For finite state spaces, the dynamics are summarized by a transition matrix where .
Why it matters
Markov chains underlie:
- Hidden Markov models (HMMs) for speech, biology, finance.
- Markov chain Monte Carlo (MCMC) for Bayesian inference (Metropolis-Hastings, Gibbs).
- PageRank (random walk on the web graph).
- N-gram language models.
- Reinforcement learning (Markov decision processes).
- Diffusion models (forward and reverse Markov processes over noise levels).
The Markov property is the cleanest assumption that makes long-range stochastic systems tractable.
Stationary distribution
A distribution is stationary for if (treating as a row vector). It’s a left eigenvector of with eigenvalue 1.
For an irreducible (any state reachable from any other) and aperiodic chain, the stationary distribution exists, is unique, and the chain converges to it from any starting state:
The rate of convergence is governed by the second-largest eigenvalue of (mixing rate).
Detailed balance and reversibility
A chain is reversible if there is a such that
Detailed balance implies is stationary (sum both sides over ). MCMC algorithms (Metropolis-Hastings) construct chains satisfying detailed balance with respect to a target distribution. This is the trick that makes them sample from posteriors.
Common cases in ML
| Use case | What is the Markov chain |
|---|---|
| HMM | Hidden state evolves as Markov chain |
| MCMC | Sampler defines a chain with target as stationary |
| PageRank | Random walk on web graph; = page rank vector |
| Diffusion model | Sequence of noise levels (Gaussian) |
| MDP / RL | State transitions given action |
| Language model | Each token depends on previous (long-context Markov chain) |
Higher-order chains
A chain where depends on the last states (-th order Markov) can be re-cast as first-order on the state space of -tuples. Trigram language models are 2nd-order Markov over tokens, equivalent to first-order over bigrams.
Common pitfalls
- Assuming Markov when data has long-range dependence. Often a useful approximation but check by holding out structure.
- Non-converging MCMC. A chain not yet at stationarity gives biased samples; use multiple chains and convergence diagnostics (, ESS).
- Confusing transition matrix conventions. Some texts use ; others . Check whether acts on rows or columns.
- Mistaking the stationary distribution for the marginal of . Marginal at finite depends on initial distribution; stationary is the limit.