Skip to content
mentorship

concepts

Forward-backward and Viterbi: dynamic programming on chains

Sum and max over exponentially many paths in linear time. Forward-backward computes posteriors over hidden states; Viterbi finds the most likely state sequence. The same idea, two semirings.

Reviewed · 3 min read

One-line definition

For a hidden Markov model with timesteps and states, forward-backward computes for every in time. Viterbi computes in the same time. Brute force would be .

Why it matters

These are the canonical examples of dynamic programming on sequences. Speech recognition (HMM-based decoding), part-of-speech tagging, gene-finding, conditional random fields for sequence labeling, and any structured-prediction task with a chain factor graph relies on one or both. The duality between sum (forward-backward) and max (Viterbi) is the textbook example of how the same DP works in two different semirings.

Even in modern deep-learning sequence models, Viterbi shows up: CTC decoding for speech, beam search as an approximation when the state space is too large, CRF layers on top of BERT for NER.

The setup

A hidden Markov model has:

  • States .
  • Observations .
  • Initial .
  • Transitions .
  • Emissions .

Joint probability of a hidden path and observation sequence:

There are possible paths. Both algorithms factor through a DP table.

Forward algorithm

Define the forward variable

Recurrence:

The total likelihood is . Computing the full table is .

Backward algorithm

The mirror image:

with and

Posterior over states (forward-backward)

This is the per-timestep posterior used in EM training of HMMs and in any system that needs marginal beliefs over hidden states.

Viterbi: the max version

Replace the sum in the forward recurrence with a max:

with . Track the argmax to reconstruct the path:

After the forward pass, the most likely path is recovered by backtracking from through . Same cost.

Sum vs max: the semiring view

Both algorithms have the same shape; only the operations differ:

Algorithm”Add""Multiply”
Forward
Viterbi (or in log space)

Both work because the operations form a semiring (associativity, distributivity). The same DP framework computes max-marginals (Viterbi), sum-marginals (forward-backward), counts (probability of inputs), expectations (segment-level expected counts), and gradients of any of the above.

In log space (always)

Multiplying many small probabilities underflows in float32. Always work in log space:

The logsumexp trick (subtract the max before exponentiating) keeps everything stable.

Modern uses

  • CRF decoding for NER: BERT produces per-token logits; a CRF layer with a learned transition matrix runs Viterbi at inference and forward-backward at training.
  • CTC decoding for speech: a sum-product algorithm over alignments. Different state structure but the same DP machinery.
  • Beam search as approximate Viterbi: when is too large for full DP (e.g. autoregressive language models with vocab 100k), beam search keeps only the top- partial paths at each step.

Common pitfalls

  • Working in probability space instead of log space. Numerical underflow guaranteed beyond .
  • Forgetting to renormalize when doing forward-backward in float32. Some implementations renormalize at each step and accumulate the log-normalizer separately.
  • Confusing the per-timestep argmax of forward-backward with Viterbi. They are different: Viterbi gives the most likely full sequence; per-timestep argmax gives the sequence of most likely states, which can be infeasible (i.e., have zero probability under the model).