Connectionist Temporal Classification (CTC)

One-line definition

CTC is a loss function that trains a frame-level classifier to output a shorter label sequence without requiring a frame-to-label alignment, by introducing a special blank symbol and summing the probability of all alignments that collapse to the target.

Why it matters

The canonical interview topic for speech, handwriting, and any monotonic, unaligned sequence-to-sequence task. It answers the question every ASR interviewer eventually asks: “You have 1000 audio frames and a 5-word transcript. How do you train without per-frame labels?”

CTC matters because:

It removes the need for a separate alignment model (the old HMM-GMM pipeline forced-aligned audio to phones first).
It is the foundation that RNN-T and many streaming ASR systems build on or contrast against.
The forward-backward dynamic program is the same idea as the HMM forward-backward algorithm — a clean way to show you understand marginalization over latent structure.

The setup

The network emits, for each of $T$ input frames, a probability distribution over the vocabulary $V$ plus a blank token $\emptyset$ :

y_{t} \in Δ^{∣ V ∣ + 1}, t = 1 \dots T .

A path (or alignment) $π$ is one label per frame, e.g. for target CAT:

C C ∅ A ∅ T T   →  collapse  →  CAT
∅ C A A ∅ T ∅   →  collapse  →  CAT

The collapse function $B$ does two things, in order:

Merge consecutive repeated labels.
Remove all blanks.

The blank is what lets the model emit the same letter twice (L L in HELLO): insert a blank between them, L ∅ L, and the merge step won’t collapse them.

The loss

The probability of a target sequence $l$ is the sum over every path that collapses to it:

p (l ∣ X) = π \in B^{- 1} (l) \sum t = 1 \prod T y_{t}^{π_{t}} .

The CTC loss is $- lo g p (l ∣ X)$ . The number of valid paths is exponential in $T$ , so the sum is computed with a forward-backward dynamic program over an augmented label sequence $l^{'}$ (the target with a blank inserted before, after, and between every label).

Define $α_{t} (s)$ = total probability of all paths ending in symbol $l_{s}^{'}$ at frame $t$ :

α_{t} (s) = (α_{t - 1} (s) + α_{t - 1} (s - 1) + α_{t - 1} (s - 2)) y_{t}^{l_{s}^{'}},

where the $s - 2$ term is only allowed when moving between two distinct non-blank labels (it skips a blank). The total is $α_{T}$ summed over the final two states. Gradients flow through this DP via the backward variables $β$ , giving an exact $O (T \cdot ∣ l ∣)$ gradient.

The conditional-independence assumption

CTC factorizes $p (l ∣ X) = \sum_{π} \prod_{t} y_{t}^{π_{t}}$ — each frame’s output depends only on $X$ , not on previously emitted labels. There is no internal language model. This is CTC’s defining limitation and the main reason RNN-T exists.

Practically, CTC ASR systems are decoded with an external language model (shallow fusion / beam search with a KenLM or neural LM) to recover the linguistic dependencies CTC ignores.

Decoding

Method	What it does	When
Greedy / best-path	argmax per frame, then collapse	Fast, approximate; no LM
Prefix beam search	Beam over collapsed prefixes, merging paths	Standard with an external LM
CTC + LM (shallow fusion)	Add $λ lo g p_{L M}$ during beam search	Production ASR

Greedy decoding is not the argmax over label sequences (best path ≠ best labeling), because many paths can collapse to the same string. Beam search approximates the true argmax.

What an interviewer expects you to say

Frame the problem: unknown alignment between $T$ frames and a shorter label sequence.
Introduce the blank symbol and the collapse rule (merge repeats, then drop blanks).
State that the loss marginalizes over all alignments via forward-backward DP — exact, not sampled.
Name the conditional-independence-across-frames assumption and its consequence: CTC has no built-in LM, so you fuse an external one.
Bonus: contrast with attention-based seq2seq (no monotonicity assumption, but harder to stream) and RNN-T (adds a label-dependent prediction network).

Common confusions

“The blank means silence.” No. Blank means “emit nothing / no label transition here.” Silence is just a region the acoustic model maps to blanks, but blank is a structural token, not a phoneme.
“Greedy decoding gives the most likely transcript.” It gives the most likely path, which after collapsing may not be the most likely transcript.
“CTC needs aligned data.” The whole point is that it doesn’t — it learns the alignment implicitly.
“CTC models language.” It doesn’t; it is conditionally independent across frames. Linguistic structure comes from an external LM at decode time.
“CTC only works for speech.” It works for any monotonic alignment task: handwriting recognition, OCR, lip reading, keyword spotting.