Skip to content
mentorship

concepts

Connectionist Temporal Classification (CTC)

How you train a sequence model to map audio (or pixels) to text without knowing the alignment. CTC marginalizes over every possible alignment with a blank symbol and a forward-backward sum.

Reviewed · 4 min read

One-line definition

CTC is a loss function that trains a frame-level classifier to output a shorter label sequence without requiring a frame-to-label alignment, by introducing a special blank symbol and summing the probability of all alignments that collapse to the target.

Why it matters

The canonical interview topic for speech, handwriting, and any monotonic, unaligned sequence-to-sequence task. It answers the question every ASR interviewer eventually asks: “You have 1000 audio frames and a 5-word transcript. How do you train without per-frame labels?”

CTC matters because:

  • It removes the need for a separate alignment model (the old HMM-GMM pipeline forced-aligned audio to phones first).
  • It is the foundation that RNN-T and many streaming ASR systems build on or contrast against.
  • The forward-backward dynamic program is the same idea as the HMM forward-backward algorithm — a clean way to show you understand marginalization over latent structure.

The setup

The network emits, for each of input frames, a probability distribution over the vocabulary plus a blank token :

A path (or alignment) is one label per frame, e.g. for target CAT:

C C ∅ A ∅ T T   →  collapse  →  CAT
∅ C A A ∅ T ∅   →  collapse  →  CAT

The collapse function does two things, in order:

  1. Merge consecutive repeated labels.
  2. Remove all blanks.

The blank is what lets the model emit the same letter twice (L L in HELLO): insert a blank between them, L ∅ L, and the merge step won’t collapse them.

The loss

The probability of a target sequence is the sum over every path that collapses to it:

The CTC loss is . The number of valid paths is exponential in , so the sum is computed with a forward-backward dynamic program over an augmented label sequence (the target with a blank inserted before, after, and between every label).

Define = total probability of all paths ending in symbol at frame :

where the term is only allowed when moving between two distinct non-blank labels (it skips a blank). The total is summed over the final two states. Gradients flow through this DP via the backward variables , giving an exact gradient.

The conditional-independence assumption

CTC factorizes — each frame’s output depends only on , not on previously emitted labels. There is no internal language model. This is CTC’s defining limitation and the main reason RNN-T exists.

Practically, CTC ASR systems are decoded with an external language model (shallow fusion / beam search with a KenLM or neural LM) to recover the linguistic dependencies CTC ignores.

Decoding

MethodWhat it doesWhen
Greedy / best-pathargmax per frame, then collapseFast, approximate; no LM
Prefix beam searchBeam over collapsed prefixes, merging pathsStandard with an external LM
CTC + LM (shallow fusion)Add during beam searchProduction ASR

Greedy decoding is not the argmax over label sequences (best path ≠ best labeling), because many paths can collapse to the same string. Beam search approximates the true argmax.

What an interviewer expects you to say

  1. Frame the problem: unknown alignment between frames and a shorter label sequence.
  2. Introduce the blank symbol and the collapse rule (merge repeats, then drop blanks).
  3. State that the loss marginalizes over all alignments via forward-backward DP — exact, not sampled.
  4. Name the conditional-independence-across-frames assumption and its consequence: CTC has no built-in LM, so you fuse an external one.
  5. Bonus: contrast with attention-based seq2seq (no monotonicity assumption, but harder to stream) and RNN-T (adds a label-dependent prediction network).

Common confusions

  • “The blank means silence.” No. Blank means “emit nothing / no label transition here.” Silence is just a region the acoustic model maps to blanks, but blank is a structural token, not a phoneme.
  • “Greedy decoding gives the most likely transcript.” It gives the most likely path, which after collapsing may not be the most likely transcript.
  • “CTC needs aligned data.” The whole point is that it doesn’t — it learns the alignment implicitly.
  • “CTC models language.” It doesn’t; it is conditionally independent across frames. Linguistic structure comes from an external LM at decode time.
  • “CTC only works for speech.” It works for any monotonic alignment task: handwriting recognition, OCR, lip reading, keyword spotting.

Related: RNN-Transducer (RNN-T), Automatic speech recognition, Forward-backward and Viterbi, Hidden Markov models.