One-line definition
Self-attention computes where are all derived from the same sequence. Cross-attention uses from one sequence and from another. Same kernel, different routing.
Why it matters
The distinction defines the difference between encoder-only models (BERT: self-attention only), decoder-only models (GPT: causal self-attention only), and encoder-decoder models (T5, original Transformer, Whisper: both). Every multi-modal architecture (image-conditioned text, text-conditioned audio) routes the modalities through cross-attention.
If you can write the matmuls correctly and explain why a given layer uses one or the other, you understand most of the transformer architecture landscape.
Self-attention
Inputs: a single sequence .
Each token attends to every other token (subject to masking). Used in:
- BERT encoder layers: bidirectional self-attention.
- GPT decoder layers: causal self-attention. The mask zeros out positions so token cannot attend to future tokens.
Cross-attention
Inputs: a query sequence and a key-value sequence .
Same softmax, same scaling. The shape of the attention matrix is now .
Used wherever the model needs to “look up” information from a different source:
- Encoder-decoder transformers: decoder layers cross-attend to the encoder output. Translation, summarization, speech-to-text.
- Diffusion models with text conditioning: image-side latents cross-attend to text embeddings. Stable Diffusion, DiT.
- Perceiver / Q-Former: a small set of learned latent queries cross-attend to a large input (image patches, audio frames) to compress it.
- RAG-style architectures: model output cross-attends to retrieved document representations.
Where each lives in a transformer block
Encoder block (BERT, T5 encoder):
- Self-attention.
- FFN.
Decoder block (GPT):
- Causal self-attention.
- FFN.
Encoder-decoder block (T5 decoder, original Transformer decoder):
- Causal self-attention.
- Cross-attention to encoder output.
- FFN.
The decoder reads its own past tokens (self-attention) and the encoder’s output (cross-attention) at every layer.
Tradeoffs
- Compute: self-attention is . Cross-attention is , which can be much cheaper if is small (compressed conditioning) or much larger (cross-attending to a long context).
- KV-cache: at decoder inference, both self- and cross-attention have a KV-cache. The cross-attention cache is computed once from the encoder output and reused for every decoded token, which makes it cheaper than the self-attention cache that grows with the decoded sequence.
Variants
- Causal cross-attention: rare. The query side is causally masked but are not from the same sequence, so causality is moot.
- Cross-attention with caching: precompute from a fixed conditioning sequence (system prompt, retrieved docs) and reuse across decoding steps.
- Asymmetric cross-attention: in Perceiver, the queries are a small learned set (e.g. 256), the K/V are massive (e.g. all image patches). The model compresses high-dim input into a fixed-size latent.
Common pitfalls
- Calling decoder self-attention “cross-attention.” They are different. Self-attention reads from the same sequence (the previously generated tokens); cross-attention reads from another sequence (encoder output, retrieved docs).
- Forgetting that decoder-only LLMs do not use cross-attention. Their conditioning is the prompt prefix, attended via self-attention, not a separate cross-attention path.
- Conflating attention masks with attention types. The mask shape differs (causal mask is square and triangular; cross-attention is rectangular and usually unmasked) but the operation is the same.