One-line definition
RNN-T extends CTC with a prediction network (an internal language model over emitted tokens) and a joint network that combines acoustic and label context, producing a streamable model that marginalizes over alignments while conditioning each output on the tokens already produced.
Why it matters
RNN-T is the model behind most on-device, streaming speech recognition (Google’s Gboard/Assistant dictation, many production ASR systems). It is the natural “what fixes CTC?” follow-up in any speech interview.
The key selling points:
- Streaming by construction. Unlike attention encoder-decoder models, RNN-T emits tokens left-to-right as audio arrives, with bounded latency.
- No frame-independence assumption. The prediction network conditions on output history, so RNN-T has a built-in language model — the thing CTC lacks.
- Still alignment-free. Like CTC, it marginalizes over all valid alignments during training.
The three components
| Component | Input | Role |
|---|---|---|
| Encoder (transcription net) | Acoustic frames | Acoustic representation (the “audio” tower) |
| Prediction net | Previously emitted non-blank tokens | Label-history representation (an internal LM) |
| Joint net | Combine → distribution over |
The joint network is typically a small feed-forward net:
The output lattice and blank
RNN-T defines a 2D grid indexed by acoustic frame and label position . At each node it predicts either:
- a real token → move “up” (), staying on the same frame, or
- the blank → move “right” (), advancing the audio.
A path from bottom-left to top-right is one alignment. The training loss sums over all monotonic paths through this lattice:
computed exactly with a forward-backward DP over the lattice (cost — heavier than CTC’s and notoriously memory-hungry, which is why fused/streaming RNN-T loss kernels exist).
RNN-T vs CTC vs attention seq2seq
| Internal LM? | Streamable? | Alignment | Training cost | |
|---|---|---|---|---|
| CTC | No (frame-independent) | Yes | Monotonic, marginalized | |
| RNN-T | Yes (prediction net) | Yes | Monotonic, marginalized | |
| Attention enc-dec (LAS / Whisper) | Yes (decoder) | Hard (needs full utterance) | Soft, learned attention |
The mental model: RNN-T = CTC + an internal language model, at the cost of a 2D alignment lattice. Attention models are more accurate offline but stream poorly; RNN-T is the streaming sweet spot.
Practical notes
- The encoder is usually the largest component; modern systems use Conformer (convolution-augmented transformer) encoders.
- The prediction network can be surprisingly small — even stateless (one-token context) variants work well, because the joint acoustic+text signal carries most of the information.
- RNN-T loss is memory-heavy; production training uses function-merging / pruned RNN-T losses to fit the tensor.
- Latency is tuned by limiting the encoder’s right-context (how far into the future it looks).
What an interviewer expects you to say
- State that RNN-T fixes CTC’s conditional-independence weakness with a prediction network conditioned on emitted tokens.
- Name the three nets: encoder, prediction, joint.
- Describe the blank = advance time, token = advance label lattice and that training marginalizes over all monotonic paths.
- Explain why it streams: outputs are produced left-to-right with bounded right-context, unlike global attention.
- Bonus: mention Conformer encoders, pruned/streaming RNN-T loss, and the accuracy-vs-latency tradeoff against attention models like Whisper.
Common confusions
- “The prediction network sees audio.” It does not. It only sees previously emitted labels — it is a language model. The joint net fuses it with the encoder’s audio features.
- “RNN-T is just CTC with an LSTM.” The architectural difference is the label-conditioned prediction net and the 2D lattice; that’s what removes frame independence.
- “Attention models can’t beat RNN-T.” Offline, full-context attention models (Whisper) are typically more accurate. RNN-T wins on streaming latency and on-device deployment.
- “Blank means silence.” As in CTC, blank is a structural token meaning “advance to the next frame,” not acoustic silence.
Related: Connectionist Temporal Classification (CTC), Automatic speech recognition, Transformer architecture, LSTM and GRU.