One-line definition
An encoder-decoder model has two networks: an encoder that reads the input sequence and produces hidden representations, and a decoder that generates the output sequence one token at a time, attending to both the encoder’s representations and the partial output so far.
Why it matters
Encoder-decoder is the canonical architecture for sequence-to-sequence tasks where the input and output may differ in length, structure, or modality:
- Machine translation (original transformer, 2017).
- Summarization (BART, T5).
- Text-to-speech.
- Image captioning, image-to-text.
- Modern diffusion models (text encoder + denoising decoder).
Modern decoder-only LLMs (GPT, Llama, Mistral) are the dominant chat architecture, but encoder-decoder remains better for tasks with a clear input-output split and constrained output length (translation, summarization).
The structure
Input → [Encoder Stack] → Encoder hidden states
↓
[Decoder Stack] → Output (autoregressive)
↑
Self-attention on previous output
↑
Cross-attention to encoder hidden states
Each decoder block has two attention sub-blocks:
- Self-attention over previous decoder outputs (causal masking).
- Cross-attention with from decoder state, from encoder hidden states.
Encoder-only vs. decoder-only vs. encoder-decoder
| Model class | Use cases | Examples |
|---|---|---|
| Encoder-only | Embeddings, classification, retrieval | BERT, RoBERTa, sentence-T5 |
| Decoder-only | Generation, chat, code | GPT-2/3/4, Llama, Mistral, Claude |
| Encoder-decoder | Translation, summarization, structured output | T5, BART, mT5, FLAN-T5 |
Why decoder-only models took over chat
Several reasons:
- In-context learning emerged in decoder-only LLMs at scale; the encoder-decoder split is unnecessary if the task is conveyed in the prompt.
- Simpler training: one stack of identical blocks, one objective (next-token prediction).
- Easier to scale: weight-sharing between encoder and decoder is awkward; decoder-only just adds layers.
- Single inference path: no separate encoder pass.
For tasks where the input is fixed and the output is a transformation of it (translation, summarization, code completion from a spec), encoder-decoder still has efficiency advantages.
Variants and their distinctions
- Original transformer (Vaswani 2017): encoder-decoder for NMT.
- BERT: encoder-only, masked-language-model objective.
- GPT: decoder-only, autoregressive next-token.
- T5 (Raffel 2019): encoder-decoder, span-corruption objective; everything is text-to-text.
- BART: encoder-decoder, denoising autoencoder.
- FLAN-T5: T5 + instruction fine-tuning.
Cross-attention complexity
Cross-attention from decoder to encoder is . For long inputs (long-document summarization), this dominates. Variants:
- Sparse cross-attention: BigBird-style for long inputs.
- Encoder caching: encoder runs once per input; cached for all decoder steps.
Common pitfalls
- Confusing T5 (encoder-decoder) and BERT (encoder-only). Different training objectives, different uses.
- Using decoder-only for translation when encoder-decoder is better. Decoder-only translation works but encoder-decoder usually trains faster and gives slightly better quality at smaller scale.
- Sharing position encodings between encoder and decoder naively. Often they need different schemes (relative for encoder, RoPE for decoder).
- Treating encoder hidden states as static during decoding. They are; the encoder runs once per input. Don’t recompute.
Related
- Transformer architecture. Block-level structure.
- Attention mechanism. Both self- and cross-attention.