Skip to content
mentorship

concepts

Two-tower retrieval

Encode queries and items with separate networks into a shared embedding space; retrieve by approximate nearest neighbors. The default architecture for industrial recommenders and search.

Reviewed · 3 min read

One-line definition

A two-tower model encodes the query (or user) and the item with two independent neural networks (“towers”) into a shared embedding space, scores them with a dot product or cosine, and is trained with a contrastive or sampled-softmax loss so that positive pairs score higher than negatives.

Why it matters

Two-tower (a.k.a. dual-encoder) is the dominant architecture for the retrieval stage of large-scale ranking systems: web search, e-commerce search, YouTube recommendations, ad targeting, dense passage retrieval for RAG, semantic search.

The structural advantage: item embeddings can be precomputed once per item and indexed. At query time you only run the query tower (cheap) and do an approximate-nearest-neighbor lookup (sub-linear in catalog size). Cross-encoders (where query and item are concatenated into a single network) cannot be precomputed and are 100–10000× too slow for retrieval at scale.

Architecture

query  → query_tower  → q ∈ R^d
item   → item_tower   → i ∈ R^d
score  = q · i  (or cosine)
  • Towers: typically transformers, MLPs, or a mix. Towers usually do not share weights (different input modalities or feature sets).
  • Embedding dim : 64–512 in production. Higher is more expressive; lower is faster to index and more cache-friendly.
  • Output normalization: L2-normalize so dot product equals cosine; lets the index use Inner Product mode (see embedding spaces).

Training

Standard losses:

In-batch sampled softmax

For a batch of positive (query, item) pairs, treat the other items in the batch as negatives. Loss per query:

Cheap, parallelizes well, but biases toward popular items (popular items appear as negatives more often).

Importance-corrected sampled softmax

Correct the in-batch sampling bias by subtracting from each negative’s logit. Standard in YouTube’s two-tower (Yi et al., 2019).

Hard negative mining

Sample hard negatives (high-scoring but incorrect items) explicitly. More expensive but improves quality, especially after the model is past the easy-negatives stage.

Two-stage architecture

In production systems, two-tower is almost always the retrieval stage, followed by a cross-encoder ranker:

  1. Retrieval (recall-oriented): two-tower returns top-K (e.g., 1000) candidates from millions of items in <10 ms via ANN.
  2. Ranking (precision-oriented): cross-encoder or feature-rich tree model ranks the K candidates with full feature interactions.

Tradeoffs vs. cross-encoder

PropertyTwo-towerCross-encoder
Latency at scalesub-linear (ANN)linear in catalog
Qualitylower (no query-item interactions)higher
Memoryone vector per itemnone (recomputed per query)
Use caseretrievalreranking

Common pitfalls

  • Using two-tower for ranking when accuracy matters. Lacks fine-grained feature interactions.
  • Ignoring negative sampling bias. In-batch sampled softmax favors popular items; always combine with importance correction or popularity de-biasing.
  • Forgetting to refresh item embeddings. When the item tower changes (new training run), all item embeddings must be re-encoded and re-indexed. Plan for periodic offline re-embedding.
  • Comparing dot vs. cosine inconsistently. Pick one (usually L2-normalized + dot) and use it everywhere.
  • Embedding spaces. Vector representations and indexing.
  • RAG overview. Retrieval-augmented generation uses two-tower for the retrieval step.