One-line definition
A CRF is a discriminative, undirected graphical model that defines over a structured output (e.g. a label sequence), scoring entire labelings jointly through feature functions over cliques — most commonly a linear-chain CRF that couples adjacent labels.
Why it matters
CRFs are the classic answer to structured prediction: when your outputs are interdependent (the label of token depends on token ), independent per-token softmax classification is wrong because it can produce illegal or incoherent label sequences (e.g. an I-PER tag right after an O tag in BIO tagging). A CRF layer fixes this by modeling transitions.
They remain interview-relevant because:
- A linear-chain CRF on top of a BiLSTM or transformer encoder was the standard NER / POS / chunking architecture and still appears in production sequence taggers.
- CRF vs HMM is the cleanest way to show you understand generative vs discriminative modeling of sequences.
- The training objective is a clean example of a globally-normalized log-likelihood with a forward-algorithm partition function.
The model
A linear-chain CRF scores a full label sequence given input :
where is the emission / unary score (how well label fits position — often the logits from a neural encoder), is a learned transition score between adjacent labels, and
is the partition function: a sum over all possible labelings. Crucially couples the whole sequence — this is global normalization, the key difference from per-token softmax (which normalizes locally and independently).
Training and inference
The partition function looks intractable ( terms) but factorizes over the chain:
- Training: maximize . The gradient needs and the marginals, both computed by the forward algorithm (the same dynamic program as HMM forward-backward) in .
- Decoding: find with the Viterbi algorithm, also .
So a CRF reuses exactly the forward-backward and Viterbi machinery — but on a discriminatively trained, globally normalized model.
CRF vs HMM vs softmax tagger
| Models | Normalization | Features | |
|---|---|---|---|
| HMM | generative | local (per emission/transition) | tied to generative story |
| MEMM | , per-step | local (per step) → label bias | rich, but biased |
| Linear-chain CRF | global (one per sequence) | rich, no label bias | |
| Independent softmax | local, independent | no transition modeling |
The CRF’s global normalization is what cures the label-bias problem of MEMMs (locally normalized models that can’t redistribute probability mass once committed at a step).
The neural CRF (BiLSTM-CRF / Transformer-CRF)
In modern systems the encoder (BiLSTM or transformer) produces the emission scores , and a small learned transition matrix sits on top. The whole stack is trained end-to-end with the CRF negative log-likelihood. The encoder captures rich context; the CRF enforces valid, coherent label transitions. This combination reliably beats a softmax-per-token head on tasks with strong output structure (NER, slot filling).
What an interviewer expects you to say
- State that a CRF models over the whole sequence, with emission + transition scores and a global partition function .
- Explain why it beats independent softmax: it models label dependencies / transitions and avoids illegal sequences.
- Know that training uses the forward algorithm (for ) and decoding uses Viterbi, both .
- Place it on the generative-vs-discriminative map (HMM is the generative cousin) and mention the label-bias problem CRFs fix relative to MEMMs.
- Bonus: the BiLSTM-CRF / encoder-CRF pattern — neural encoder for emissions, CRF layer for structure.
Common confusions
- “CRF = HMM.” HMM is generative and locally normalized; CRF is discriminative and globally normalized. CRFs can use arbitrary, overlapping input features.
- “You need a CRF whenever you tag sequences.” Only when output structure matters. With a strong contextual encoder (large transformer), the marginal gain of a CRF head shrinks because the encoder already captures most dependencies — but it still helps enforce hard constraints.
- “The partition function is intractable.” For a chain it’s an forward pass. It’s only intractable for general (loopy) graph structures.
- “CRFs are obsolete.” The CRF layer is still a standard, cheap way to enforce coherent label sequences on top of any encoder.
Related: Forward-backward and Viterbi, Hidden Markov models, Belief propagation, Graphical models.