Skip to content
mentorship

concepts

RNN-Transducer (RNN-T)

The streaming-ASR workhorse. RNN-T fixes CTC's biggest weakness — its frame-independence assumption — by adding a prediction network that conditions on previously emitted tokens, while staying naturally streamable.

Reviewed · 4 min read

One-line definition

RNN-T extends CTC with a prediction network (an internal language model over emitted tokens) and a joint network that combines acoustic and label context, producing a streamable model that marginalizes over alignments while conditioning each output on the tokens already produced.

Why it matters

RNN-T is the model behind most on-device, streaming speech recognition (Google’s Gboard/Assistant dictation, many production ASR systems). It is the natural “what fixes CTC?” follow-up in any speech interview.

The key selling points:

  • Streaming by construction. Unlike attention encoder-decoder models, RNN-T emits tokens left-to-right as audio arrives, with bounded latency.
  • No frame-independence assumption. The prediction network conditions on output history, so RNN-T has a built-in language model — the thing CTC lacks.
  • Still alignment-free. Like CTC, it marginalizes over all valid alignments during training.

The three components

ComponentInputRole
Encoder (transcription net)Acoustic frames Acoustic representation (the “audio” tower)
Prediction netPreviously emitted non-blank tokens Label-history representation (an internal LM)
Joint netCombine → distribution over

The joint network is typically a small feed-forward net:

The output lattice and blank

RNN-T defines a 2D grid indexed by acoustic frame and label position . At each node it predicts either:

  • a real token → move “up” (), staying on the same frame, or
  • the blank → move “right” (), advancing the audio.

A path from bottom-left to top-right is one alignment. The training loss sums over all monotonic paths through this lattice:

computed exactly with a forward-backward DP over the lattice (cost — heavier than CTC’s and notoriously memory-hungry, which is why fused/streaming RNN-T loss kernels exist).

RNN-T vs CTC vs attention seq2seq

Internal LM?Streamable?AlignmentTraining cost
CTCNo (frame-independent)YesMonotonic, marginalized
RNN-TYes (prediction net)YesMonotonic, marginalized
Attention enc-dec (LAS / Whisper)Yes (decoder)Hard (needs full utterance)Soft, learned attention

The mental model: RNN-T = CTC + an internal language model, at the cost of a 2D alignment lattice. Attention models are more accurate offline but stream poorly; RNN-T is the streaming sweet spot.

Practical notes

  • The encoder is usually the largest component; modern systems use Conformer (convolution-augmented transformer) encoders.
  • The prediction network can be surprisingly small — even stateless (one-token context) variants work well, because the joint acoustic+text signal carries most of the information.
  • RNN-T loss is memory-heavy; production training uses function-merging / pruned RNN-T losses to fit the tensor.
  • Latency is tuned by limiting the encoder’s right-context (how far into the future it looks).

What an interviewer expects you to say

  1. State that RNN-T fixes CTC’s conditional-independence weakness with a prediction network conditioned on emitted tokens.
  2. Name the three nets: encoder, prediction, joint.
  3. Describe the blank = advance time, token = advance label lattice and that training marginalizes over all monotonic paths.
  4. Explain why it streams: outputs are produced left-to-right with bounded right-context, unlike global attention.
  5. Bonus: mention Conformer encoders, pruned/streaming RNN-T loss, and the accuracy-vs-latency tradeoff against attention models like Whisper.

Common confusions

  • “The prediction network sees audio.” It does not. It only sees previously emitted labels — it is a language model. The joint net fuses it with the encoder’s audio features.
  • “RNN-T is just CTC with an LSTM.” The architectural difference is the label-conditioned prediction net and the 2D lattice; that’s what removes frame independence.
  • “Attention models can’t beat RNN-T.” Offline, full-context attention models (Whisper) are typically more accurate. RNN-T wins on streaming latency and on-device deployment.
  • “Blank means silence.” As in CTC, blank is a structural token meaning “advance to the next frame,” not acoustic silence.

Related: Connectionist Temporal Classification (CTC), Automatic speech recognition, Transformer architecture, LSTM and GRU.