One-line definition
For a hidden Markov model with timesteps and states, forward-backward computes for every in time. Viterbi computes in the same time. Brute force would be .
Why it matters
These are the canonical examples of dynamic programming on sequences. Speech recognition (HMM-based decoding), part-of-speech tagging, gene-finding, conditional random fields for sequence labeling, and any structured-prediction task with a chain factor graph relies on one or both. The duality between sum (forward-backward) and max (Viterbi) is the textbook example of how the same DP works in two different semirings.
Even in modern deep-learning sequence models, Viterbi shows up: CTC decoding for speech, beam search as an approximation when the state space is too large, CRF layers on top of BERT for NER.
The setup
A hidden Markov model has:
- States .
- Observations .
- Initial .
- Transitions .
- Emissions .
Joint probability of a hidden path and observation sequence:
There are possible paths. Both algorithms factor through a DP table.
Forward algorithm
Define the forward variable
Recurrence:
The total likelihood is . Computing the full table is .
Backward algorithm
The mirror image:
with and
Posterior over states (forward-backward)
This is the per-timestep posterior used in EM training of HMMs and in any system that needs marginal beliefs over hidden states.
Viterbi: the max version
Replace the sum in the forward recurrence with a max:
with . Track the argmax to reconstruct the path:
After the forward pass, the most likely path is recovered by backtracking from through . Same cost.
Sum vs max: the semiring view
Both algorithms have the same shape; only the operations differ:
| Algorithm | ”Add" | "Multiply” |
|---|---|---|
| Forward | ||
| Viterbi | (or in log space) |
Both work because the operations form a semiring (associativity, distributivity). The same DP framework computes max-marginals (Viterbi), sum-marginals (forward-backward), counts (probability of inputs), expectations (segment-level expected counts), and gradients of any of the above.
In log space (always)
Multiplying many small probabilities underflows in float32. Always work in log space:
The logsumexp trick (subtract the max before exponentiating) keeps everything stable.
Modern uses
- CRF decoding for NER: BERT produces per-token logits; a CRF layer with a learned transition matrix runs Viterbi at inference and forward-backward at training.
- CTC decoding for speech: a sum-product algorithm over alignments. Different state structure but the same DP machinery.
- Beam search as approximate Viterbi: when is too large for full DP (e.g. autoregressive language models with vocab 100k), beam search keeps only the top- partial paths at each step.
Common pitfalls
- Working in probability space instead of log space. Numerical underflow guaranteed beyond .
- Forgetting to renormalize when doing forward-backward in float32. Some implementations renormalize at each step and accumulate the log-normalizer separately.
- Confusing the per-timestep argmax of forward-backward with Viterbi. They are different: Viterbi gives the most likely full sequence; per-timestep argmax gives the sequence of most likely states, which can be infeasible (i.e., have zero probability under the model).