Conditional random fields (CRFs)

One-line definition

A CRF is a discriminative, undirected graphical model that defines $p (y ∣ x)$ over a structured output $y$ (e.g. a label sequence), scoring entire labelings jointly through feature functions over cliques — most commonly a linear-chain CRF that couples adjacent labels.

Why it matters

CRFs are the classic answer to structured prediction: when your outputs are interdependent (the label of token $t$ depends on token $t - 1$ ), independent per-token softmax classification is wrong because it can produce illegal or incoherent label sequences (e.g. an I-PER tag right after an O tag in BIO tagging). A CRF layer fixes this by modeling transitions.

They remain interview-relevant because:

A linear-chain CRF on top of a BiLSTM or transformer encoder was the standard NER / POS / chunking architecture and still appears in production sequence taggers.
CRF vs HMM is the cleanest way to show you understand generative vs discriminative modeling of sequences.
The training objective is a clean example of a globally-normalized log-likelihood with a forward-algorithm partition function.

The model

A linear-chain CRF scores a full label sequence $y = (y_{1}, \dots, y_{T})$ given input $x$ :

p (y ∣ x) = \frac{1}{Z ( x )} exp (t \sum ψ_{t} (y_{t}, x) + t \sum A (y_{t - 1}, y_{t})),

where $ψ_{t}$ is the emission / unary score (how well label $y_{t}$ fits position $t$ — often the logits from a neural encoder), $A (y_{t - 1}, y_{t})$ is a learned transition score between adjacent labels, and

Z (x) = y^{'} \sum exp (\dots)

is the partition function: a sum over all $∣ V ∣^{T}$ possible labelings. Crucially $Z$ couples the whole sequence — this is global normalization, the key difference from per-token softmax (which normalizes locally and independently).

Training and inference

The partition function looks intractable ( $∣ V ∣^{T}$ terms) but factorizes over the chain:

Training: maximize $lo g p (y ∣ x)$ . The gradient needs $Z (x)$ and the marginals, both computed by the forward algorithm (the same dynamic program as HMM forward-backward) in $O (T ∣ V ∣^{2})$ .
Decoding: find $ar g max_{y} p (y ∣ x)$ with the Viterbi algorithm, also $O (T ∣ V ∣^{2})$ .

So a CRF reuses exactly the forward-backward and Viterbi machinery — but on a discriminatively trained, globally normalized model.

CRF vs HMM vs softmax tagger

	Models	Normalization	Features
HMM	$p (x, y)$ generative	local (per emission/transition)	tied to generative story
MEMM	$p (y ∣ x)$ , per-step	local (per step) → label bias	rich, but biased
Linear-chain CRF	$p (y ∣ x)$	global (one $Z$ per sequence)	rich, no label bias
Independent softmax	$\prod_{t} p (y_{t} ∣ x)$	local, independent	no transition modeling

The CRF’s global normalization is what cures the label-bias problem of MEMMs (locally normalized models that can’t redistribute probability mass once committed at a step).

The neural CRF (BiLSTM-CRF / Transformer-CRF)

In modern systems the encoder (BiLSTM or transformer) produces the emission scores $ψ_{t}$ , and a small learned transition matrix $A$ sits on top. The whole stack is trained end-to-end with the CRF negative log-likelihood. The encoder captures rich context; the CRF enforces valid, coherent label transitions. This combination reliably beats a softmax-per-token head on tasks with strong output structure (NER, slot filling).

What an interviewer expects you to say

State that a CRF models $p (y ∣ x)$ over the whole sequence, with emission + transition scores and a global partition function $Z$ .
Explain why it beats independent softmax: it models label dependencies / transitions and avoids illegal sequences.
Know that training uses the forward algorithm (for $Z$ ) and decoding uses Viterbi, both $O (T ∣ V ∣^{2})$ .
Place it on the generative-vs-discriminative map (HMM is the generative cousin) and mention the label-bias problem CRFs fix relative to MEMMs.
Bonus: the BiLSTM-CRF / encoder-CRF pattern — neural encoder for emissions, CRF layer for structure.

Common confusions

“CRF = HMM.” HMM is generative and locally normalized; CRF is discriminative and globally normalized. CRFs can use arbitrary, overlapping input features.
“You need a CRF whenever you tag sequences.” Only when output structure matters. With a strong contextual encoder (large transformer), the marginal gain of a CRF head shrinks because the encoder already captures most dependencies — but it still helps enforce hard constraints.
“The partition function is intractable.” For a chain it’s an $O (T ∣ V ∣^{2})$ forward pass. It’s only intractable for general (loopy) graph structures.
“CRFs are obsolete.” The CRF layer is still a standard, cheap way to enforce coherent label sequences on top of any encoder.