Forward-backward and Viterbi: dynamic programming on chains

One-line definition

For a hidden Markov model with $T$ timesteps and $K$ states, forward-backward computes $P (z_{t} ∣ x_{1 : T})$ for every $t$ in $O (T K^{2})$ time. Viterbi computes $ar g max_{z_{1 : T}} P (z_{1 : T} ∣ x_{1 : T})$ in the same time. Brute force would be $O (K^{T})$ .

Why it matters

These are the canonical examples of dynamic programming on sequences. Speech recognition (HMM-based decoding), part-of-speech tagging, gene-finding, conditional random fields for sequence labeling, and any structured-prediction task with a chain factor graph relies on one or both. The duality between sum (forward-backward) and max (Viterbi) is the textbook example of how the same DP works in two different semirings.

Even in modern deep-learning sequence models, Viterbi shows up: CTC decoding for speech, beam search as an approximation when the state space is too large, CRF layers on top of BERT for NER.

The setup

A hidden Markov model has:

States $z_{t} \in {1, \dots, K}$ .
Observations $x_{t}$ .
Initial $π_{k} = P (z_{1} = k)$ .
Transitions $A_{ij} = P (z_{t + 1} = j ∣ z_{t} = i)$ .
Emissions $B_{k} (x) = P (x ∣ z_{t} = k)$ .

Joint probability of a hidden path and observation sequence:

P (z_{1 : T}, x_{1 : T}) = π_{z_{1}} B_{z_{1}} (x_{1}) t = 2 \prod T A_{z_{t - 1}, z_{t}} B_{z_{t}} (x_{t}) .

There are $K^{T}$ possible paths. Both algorithms factor through a $T \times K$ DP table.

Forward algorithm

Define the forward variable

α_{t} (k) = P (x_{1 : t}, z_{t} = k) .

Recurrence:

α_{1} (k) = π_{k} B_{k} (x_{1}), α_{t} (k) = B_{k} (x_{t}) i \sum α_{t - 1} (i) A_{ik} .

The total likelihood is $P (x_{1 : T}) = \sum_{k} α_{T} (k)$ . Computing the full table is $O (T K^{2})$ .

Backward algorithm

The mirror image:

β_{t} (k) = P (x_{t + 1 : T} ∣ z_{t} = k),

with $β_{T} (k) = 1$ and

β_{t} (k) = i \sum A_{k i} B_{i} (x_{t + 1}) β_{t + 1} (i) .

Posterior over states (forward-backward)

P (z_{t} = k ∣ x_{1 : T}) = \frac{α _{t} ( k ) β _{t} ( k )}{\sum _{i} α _{t} ( i ) β _{t} ( i )} .

This is the per-timestep posterior used in EM training of HMMs and in any system that needs marginal beliefs over hidden states.

Viterbi: the max version

Replace the sum in the forward recurrence with a max:

δ_{t} (k) = i max δ_{t - 1} (i) A_{ik} \cdot B_{k} (x_{t}),

with $δ_{1} (k) = π_{k} B_{k} (x_{1})$ . Track the argmax to reconstruct the path:

ψ_{t} (k) = ar g i max δ_{t - 1} (i) A_{ik} .

After the forward pass, the most likely path is recovered by backtracking from $z_{T}^{*} = ar g max_{k} δ_{T} (k)$ through $ψ$ . Same $O (T K^{2})$ cost.

Sum vs max: the semiring view

Both algorithms have the same shape; only the operations differ:

Algorithm	”Add"	"Multiply”
Forward	$+$	$\times$
Viterbi	$max$	$\times$ (or $+$ in log space)

Both work because the operations form a semiring (associativity, distributivity). The same DP framework computes max-marginals (Viterbi), sum-marginals (forward-backward), counts (probability of inputs), expectations (segment-level expected counts), and gradients of any of the above.

In log space (always)

Multiplying many small probabilities underflows in float32. Always work in log space:

lo g α_{t} (k) = lo g B_{k} (x_{t}) + logsumexp_{i} (lo g α_{t - 1} (i) + lo g A_{ik}) .

The logsumexp trick (subtract the max before exponentiating) keeps everything stable.

Modern uses

CRF decoding for NER: BERT produces per-token logits; a CRF layer with a learned transition matrix runs Viterbi at inference and forward-backward at training.
CTC decoding for speech: a sum-product algorithm over alignments. Different state structure but the same DP machinery.
Beam search as approximate Viterbi: when $K$ is too large for full DP (e.g. autoregressive language models with vocab 100k), beam search keeps only the top- $k$ partial paths at each step.

Common pitfalls

Working in probability space instead of log space. Numerical underflow guaranteed beyond $T \approx 50$ .
Forgetting to renormalize when doing forward-backward in float32. Some implementations renormalize $α_{t}$ at each step and accumulate the log-normalizer separately.
Confusing the per-timestep argmax of forward-backward with Viterbi. They are different: Viterbi gives the most likely full sequence; per-timestep argmax gives the sequence of most likely states, which can be infeasible (i.e., have zero probability under the model).