One-line definition
A two-tower model encodes the query (or user) and the item with two independent neural networks (“towers”) into a shared embedding space, scores them with a dot product or cosine, and is trained with a contrastive or sampled-softmax loss so that positive pairs score higher than negatives.
Why it matters
Two-tower (a.k.a. dual-encoder) is the dominant architecture for the retrieval stage of large-scale ranking systems: web search, e-commerce search, YouTube recommendations, ad targeting, dense passage retrieval for RAG, semantic search.
The structural advantage: item embeddings can be precomputed once per item and indexed. At query time you only run the query tower (cheap) and do an approximate-nearest-neighbor lookup (sub-linear in catalog size). Cross-encoders (where query and item are concatenated into a single network) cannot be precomputed and are 100–10000× too slow for retrieval at scale.
Architecture
query → query_tower → q ∈ R^d
item → item_tower → i ∈ R^d
score = q · i (or cosine)
- Towers: typically transformers, MLPs, or a mix. Towers usually do not share weights (different input modalities or feature sets).
- Embedding dim : 64–512 in production. Higher is more expressive; lower is faster to index and more cache-friendly.
- Output normalization: L2-normalize so dot product equals cosine; lets the index use Inner Product mode (see embedding spaces).
Training
Standard losses:
In-batch sampled softmax
For a batch of positive (query, item) pairs, treat the other items in the batch as negatives. Loss per query:
Cheap, parallelizes well, but biases toward popular items (popular items appear as negatives more often).
Importance-corrected sampled softmax
Correct the in-batch sampling bias by subtracting from each negative’s logit. Standard in YouTube’s two-tower (Yi et al., 2019).
Hard negative mining
Sample hard negatives (high-scoring but incorrect items) explicitly. More expensive but improves quality, especially after the model is past the easy-negatives stage.
Two-stage architecture
In production systems, two-tower is almost always the retrieval stage, followed by a cross-encoder ranker:
- Retrieval (recall-oriented): two-tower returns top-K (e.g., 1000) candidates from millions of items in <10 ms via ANN.
- Ranking (precision-oriented): cross-encoder or feature-rich tree model ranks the K candidates with full feature interactions.
Tradeoffs vs. cross-encoder
| Property | Two-tower | Cross-encoder |
|---|---|---|
| Latency at scale | sub-linear (ANN) | linear in catalog |
| Quality | lower (no query-item interactions) | higher |
| Memory | one vector per item | none (recomputed per query) |
| Use case | retrieval | reranking |
Common pitfalls
- Using two-tower for ranking when accuracy matters. Lacks fine-grained feature interactions.
- Ignoring negative sampling bias. In-batch sampled softmax favors popular items; always combine with importance correction or popularity de-biasing.
- Forgetting to refresh item embeddings. When the item tower changes (new training run), all item embeddings must be re-encoded and re-indexed. Plan for periodic offline re-embedding.
- Comparing dot vs. cosine inconsistently. Pick one (usually L2-normalized + dot) and use it everywhere.
Related
- Embedding spaces. Vector representations and indexing.
- RAG overview. Retrieval-augmented generation uses two-tower for the retrieval step.