Skip to content
mentorship

concepts

Conditional random fields (CRFs)

The discriminative model for structured prediction. A CRF directly models p(labels | input) over a whole sequence, scoring label transitions jointly instead of classifying each token independently — the reason a linear-chain CRF sits on top of so many taggers.

Reviewed · 4 min read

One-line definition

A CRF is a discriminative, undirected graphical model that defines over a structured output (e.g. a label sequence), scoring entire labelings jointly through feature functions over cliques — most commonly a linear-chain CRF that couples adjacent labels.

Why it matters

CRFs are the classic answer to structured prediction: when your outputs are interdependent (the label of token depends on token ), independent per-token softmax classification is wrong because it can produce illegal or incoherent label sequences (e.g. an I-PER tag right after an O tag in BIO tagging). A CRF layer fixes this by modeling transitions.

They remain interview-relevant because:

  • A linear-chain CRF on top of a BiLSTM or transformer encoder was the standard NER / POS / chunking architecture and still appears in production sequence taggers.
  • CRF vs HMM is the cleanest way to show you understand generative vs discriminative modeling of sequences.
  • The training objective is a clean example of a globally-normalized log-likelihood with a forward-algorithm partition function.

The model

A linear-chain CRF scores a full label sequence given input :

where is the emission / unary score (how well label fits position — often the logits from a neural encoder), is a learned transition score between adjacent labels, and

is the partition function: a sum over all possible labelings. Crucially couples the whole sequence — this is global normalization, the key difference from per-token softmax (which normalizes locally and independently).

Training and inference

The partition function looks intractable ( terms) but factorizes over the chain:

  • Training: maximize . The gradient needs and the marginals, both computed by the forward algorithm (the same dynamic program as HMM forward-backward) in .
  • Decoding: find with the Viterbi algorithm, also .

So a CRF reuses exactly the forward-backward and Viterbi machinery — but on a discriminatively trained, globally normalized model.

CRF vs HMM vs softmax tagger

ModelsNormalizationFeatures
HMM generativelocal (per emission/transition)tied to generative story
MEMM, per-steplocal (per step) → label biasrich, but biased
Linear-chain CRFglobal (one per sequence)rich, no label bias
Independent softmaxlocal, independentno transition modeling

The CRF’s global normalization is what cures the label-bias problem of MEMMs (locally normalized models that can’t redistribute probability mass once committed at a step).

The neural CRF (BiLSTM-CRF / Transformer-CRF)

In modern systems the encoder (BiLSTM or transformer) produces the emission scores , and a small learned transition matrix sits on top. The whole stack is trained end-to-end with the CRF negative log-likelihood. The encoder captures rich context; the CRF enforces valid, coherent label transitions. This combination reliably beats a softmax-per-token head on tasks with strong output structure (NER, slot filling).

What an interviewer expects you to say

  1. State that a CRF models over the whole sequence, with emission + transition scores and a global partition function .
  2. Explain why it beats independent softmax: it models label dependencies / transitions and avoids illegal sequences.
  3. Know that training uses the forward algorithm (for ) and decoding uses Viterbi, both .
  4. Place it on the generative-vs-discriminative map (HMM is the generative cousin) and mention the label-bias problem CRFs fix relative to MEMMs.
  5. Bonus: the BiLSTM-CRF / encoder-CRF pattern — neural encoder for emissions, CRF layer for structure.

Common confusions

  • “CRF = HMM.” HMM is generative and locally normalized; CRF is discriminative and globally normalized. CRFs can use arbitrary, overlapping input features.
  • “You need a CRF whenever you tag sequences.” Only when output structure matters. With a strong contextual encoder (large transformer), the marginal gain of a CRF head shrinks because the encoder already captures most dependencies — but it still helps enforce hard constraints.
  • “The partition function is intractable.” For a chain it’s an forward pass. It’s only intractable for general (loopy) graph structures.
  • “CRFs are obsolete.” The CRF layer is still a standard, cheap way to enforce coherent label sequences on top of any encoder.

Related: Forward-backward and Viterbi, Hidden Markov models, Belief propagation, Graphical models.