One-line definition
For queries , keys , and values :
Each query is replaced by a weighted average of values, with weights given by query-key similarities normalized by softmax.
Why it matters
Attention is the single most important architectural primitive of the past decade:
- Transformers are stacks of attention + FFN. Every modern LLM is mostly attention by parameter and FLOP count.
- Retrieval (two-tower, cross-encoder) is dot-product attention between queries and items.
- Vision transformers, graph attention networks, diffusion models all use it.
- Memory-augmented networks use attention to access external memory.
Understanding attention at the computational and conceptual level is non-negotiable for senior ML.
The mechanism step by step
For a single query and set of key-value pairs:
- Score each key against the query: .
- Normalize with softmax: . The sum to 1. They form an attention distribution.
- Aggregate: output .
Each output is a convex combination of values, biased toward keys most similar to the query.
The scaling is critical: without it, dot products grow linearly in , pushing softmax into saturation regions where gradients vanish.
Self-attention vs. cross-attention
- Self-attention: are all derived from the same input. Each token attends to all other tokens in the same sequence. Used in transformer encoder layers and the self-attention sub-block of decoder layers.
- Cross-attention: comes from one source (e.g., decoder hidden state), from another (e.g., encoder outputs). Used in encoder-decoder transformers (T5, NMT) and modern diffusion text conditioning.
Multi-head attention
Run attention times in parallel with different learned projections (each of dimension ), concatenate the outputs, project back. Each “head” can specialize to different relationships (syntactic, semantic, positional). Standard transformer uses 8–96 heads.
In modern LLMs, heads are reduced via grouped-query attention where multiple Q heads share K/V heads.
Causal (autoregressive) masking
For decoder language models: token should not attend to tokens . Implement by adding to the corresponding entries of before softmax. After softmax, those positions become 0 weight.
This is what enables next-token prediction without leakage.
Connection to retrieval
Dot-product attention with a single query against many keys is mathematically identical to nearest-neighbor retrieval with cosine similarity (after softmax). The softmax just turns the top- retrieval into a soft weighting.
Cost
- Forward: FLOPs.
- Memory: for the attention matrix. The dominant cost at long context.
FlashAttention reorders the computation to never materialize the full matrix, dropping memory to .
Common pitfalls
- Forgetting the scaling. Softmax saturates; gradients vanish; training fails.
- Wrong masking for causal LM. Off-by-one errors leak future tokens; quality looks great in training but inference is broken.
- Treating attention weights as interpretation. Attention weights show what was averaged, not what was used; downstream computation may ignore the weighted result. Don’t over-interpret heatmaps.
- Confusing attention with self-attention. Attention is the general operation; self-attention is one usage.
Related
- Transformer architecture. Full assembly.
- FlashAttention. Efficient implementation.
- Grouped-query attention. Modern KV-cache optimization.