Two-tower retrieval

One-line definition

A two-tower model encodes the query (or user) and the item with two independent neural networks (“towers”) into a shared embedding space, scores them with a dot product or cosine, and is trained with a contrastive or sampled-softmax loss so that positive pairs score higher than negatives.

Why it matters

Two-tower (a.k.a. dual-encoder) is the dominant architecture for the retrieval stage of large-scale ranking systems: web search, e-commerce search, YouTube recommendations, ad targeting, dense passage retrieval for RAG, semantic search.

The structural advantage: item embeddings can be precomputed once per item and indexed. At query time you only run the query tower (cheap) and do an approximate-nearest-neighbor lookup (sub-linear in catalog size). Cross-encoders (where query and item are concatenated into a single network) cannot be precomputed and are 100–10000× too slow for retrieval at scale.

Architecture

query  → query_tower  → q ∈ R^d
item   → item_tower   → i ∈ R^d
score  = q · i  (or cosine)

Towers: typically transformers, MLPs, or a mix. Towers usually do not share weights (different input modalities or feature sets).
Embedding dim $d$ : 64–512 in production. Higher $d$ is more expressive; lower $d$ is faster to index and more cache-friendly.
Output normalization: L2-normalize so dot product equals cosine; lets the index use Inner Product mode (see embedding spaces).

Training

Standard losses:

In-batch sampled softmax

For a batch of $B$ positive (query, item) pairs, treat the other $B - 1$ items in the batch as negatives. Loss per query:

L = - lo g \frac{exp ( q \cdot i ^{+} )}{\sum _{j = 1}^{B} exp ( q \cdot i _{j} )} .

Cheap, parallelizes well, but biases toward popular items (popular items appear as negatives more often).

Importance-corrected sampled softmax

Correct the in-batch sampling bias by subtracting $lo g p (item_{j})$ from each negative’s logit. Standard in YouTube’s two-tower (Yi et al., 2019).

Hard negative mining

Sample hard negatives (high-scoring but incorrect items) explicitly. More expensive but improves quality, especially after the model is past the easy-negatives stage.

Two-stage architecture

In production systems, two-tower is almost always the retrieval stage, followed by a cross-encoder ranker:

Retrieval (recall-oriented): two-tower returns top-K (e.g., 1000) candidates from millions of items in <10 ms via ANN.
Ranking (precision-oriented): cross-encoder or feature-rich tree model ranks the K candidates with full feature interactions.

Tradeoffs vs. cross-encoder

Property	Two-tower	Cross-encoder
Latency at scale	sub-linear (ANN)	linear in catalog
Quality	lower (no query-item interactions)	higher
Memory	one vector per item	none (recomputed per query)
Use case	retrieval	reranking

Common pitfalls

Using two-tower for ranking when accuracy matters. Lacks fine-grained feature interactions.
Ignoring negative sampling bias. In-batch sampled softmax favors popular items; always combine with importance correction or popularity de-biasing.
Forgetting to refresh item embeddings. When the item tower changes (new training run), all item embeddings must be re-encoded and re-indexed. Plan for periodic offline re-embedding.
Comparing dot vs. cosine inconsistently. Pick one (usually L2-normalized + dot) and use it everywhere.

Embedding spaces. Vector representations and indexing.
RAG overview. Retrieval-augmented generation uses two-tower for the retrieval step.