Automatic speech recognition (ASR)

The end-to-end map from a waveform to text: features, the three modeling paradigms (CTC, RNN-T, attention), language-model fusion, and how the field moved from HMM-GMM pipelines to a single neural model.

Reviewed May 31, 2026 · 3 min read

One-line definition

ASR maps an audio waveform to a text transcript. Modern systems are end-to-end neural models — CTC, RNN-T, or attention encoder-decoders — trained directly on (audio, text) pairs, replacing the old multi-stage HMM-GMM + pronunciation-lexicon + n-gram pipeline.

Why it matters

ASR is the canonical “sequence in, sequence out, unknown alignment” problem, and it pulls together several interview-favorite ideas: feature extraction, alignment-free losses, language-model fusion, and streaming-vs-accuracy tradeoffs. It’s also a frequent applied-scientist domain (voice assistants, captioning, call-center analytics, medical scribing).

The pipeline

waveform → features → acoustic model → (decoder + LM) → text

1. Audio features

Raw audio is ~16 kHz samples. Models rarely consume raw samples directly; they use frames:

Window the signal (e.g. 25 ms windows, 10 ms hop → 100 frames/sec).
Compute a log-mel spectrogram: short-time Fourier transform → mel filterbank → log. This mimics human pitch perception and is the de-facto standard input.
MFCCs (mel-frequency cepstral coefficients) add a DCT on top; common in classical/HMM systems, less needed for deep nets which prefer raw log-mel.

Augmentation is critical: SpecAugment (time/frequency masking + time warping on the spectrogram) is the single highest-leverage ASR augmentation, plus speed perturbation and noise mixing.

2. Acoustic / sequence model

The three end-to-end paradigms:

Paradigm	Idea	Streams?	Built-in LM?
CTC	Frame classifier + blank, marginalize alignments	Yes	No
RNN-T	CTC + label-conditioned prediction net	Yes	Yes
Attention enc-dec (LAS, Whisper)	Decoder attends over encoded audio	Hard	Yes

Encoders are usually Conformer or transformer stacks over log-mel frames. Whisper is a plain transformer encoder-decoder trained on ~680k hours of weakly-supervised web audio.

3. Language model fusion

Acoustic models benefit from an external LM, especially CTC (which has no internal LM):

Shallow fusion: add $λ lo g p_{L M} (y)$ to the acoustic score during beam search.
Deep / cold fusion: integrate LM hidden states into the decoder.
Rescoring: generate an n-best list / lattice, then re-score with a large LM (often a transformer LM).

The historical contrast (why end-to-end won)

The classical pipeline was HMM-GMM (later HMM-DNN): a pronunciation lexicon mapped words → phones, an HMM modeled phone-state transitions, a GMM/DNN modeled acoustics, and a separate n-gram LM handled language. It required forced alignment and lots of expert-built components.

End-to-end models collapse all of this into one network trained on (audio, text) pairs. They win on simplicity and, with enough data, on accuracy — at the cost of needing more data and giving up some modularity.

Evaluation

The primary metric is Word Error Rate (WER):

WER = \frac{S + D + I}{N},

where $S, D, I$ are substitutions, deletions, insertions (via edit distance to the reference) and $N$ is the number of reference words. Character Error Rate (CER) is the analog for languages without clear word boundaries. Note WER can exceed 100% (insertions).

What an interviewer expects you to say

Describe the pipeline: log-mel features → neural encoder → decoder/LM → text, and that SpecAugment is the key augmentation.
Compare the three paradigms (CTC / RNN-T / attention) on streaming, internal LM, and accuracy.
Explain external LM fusion (shallow fusion, rescoring) and why CTC needs it most.
Know WER and how it’s computed.
Bonus: explain why the field moved from HMM-GMM to end-to-end, and when you’d still prefer a streaming RNN-T (on-device, low latency) over an offline attention model (Whisper, max accuracy).

Common confusions

“Models eat raw waveforms.” Usually log-mel spectrogram frames; raw-waveform front-ends (wav2vec 2.0, SincNet) exist but features are still the norm.
“WER ≤ 100%.” False — insertions can push it above 100%.
“Whisper streams.” It’s an offline attention encoder-decoder; it needs (chunks of) the full utterance. Streaming use requires chunking hacks. RNN-T is the native streaming choice.
“Self-supervised pretraining is irrelevant.” wav2vec 2.0 / HuBERT-style self-supervised pretraining on unlabeled audio is now standard for low-resource ASR.