System design case study: building personalized search ranking

A personalized search ranking system at scale, structured as: problem framing, architecture, evaluation, operations. The structure generalizes to most ML system design problems.

1. Framing the problem

Before any architecture, clarify:

What surface? A consumer product’s main search? A vertical search (jobs, videos, products)? An enterprise search? Each has very different scale, latency, and quality bars.
Who’s the user? B2C with millions of users? B2B with thousands of accounts? Each user with a clear context, or strangers querying once?
What does “good ranking” mean? Click-through rate? Long-term engagement? Task completion? User-stated satisfaction? These point to different model architectures and labels.
What’s the scale? Number of items in the catalog (thousands? billions?), queries per second, latency budget (10ms? 100ms? 500ms?), cost budget.
What’s the personalization signal? Do we have user history? Demographics? Real-time context? How much can we use without privacy issues?

For this case study, assume:

A consumer streaming product (like Netflix or Spotify), main search bar.
Catalog: ~10M items.
Users: ~100M MAU.
Queries: ~10K QPS.
Latency: p99 < 200ms (search needs to feel instant).
Optimizing for: long-term engagement (a combination of immediate clicks, watch time, and next-week return).

2. Top-level architecture

Search ranking uses a multi-stage retrieve-then-rank pipeline. Why:

Scoring 10M candidates per query at production latency is infeasible with anything but the simplest models.
The top 10 results matter; positions 100-10M are noise.
The information you can extract from a query is limited; for 99% of items, “this is irrelevant” can be decided cheaply. For the top 1000, you want to invest more compute.

So:

Query → [Stage 1: Candidate Generation] → ~5K candidates →
        [Stage 2: First-pass Ranking]    → ~500 candidates →
        [Stage 3: Final Ranking]         → ~50 candidates →
        [Stage 4: Re-ranking / Diversification] → top 10 → User

Each stage uses a more expensive model on a smaller set of candidates. Total budget: ~150ms.

3. Stage 1: Candidate generation (~5ms budget)

Reduce 10M items to ~5K. Multiple parallel sources:

Lexical retrieval (BM25)

Standard inverted index over titles, descriptions, transcripts, metadata.
Handles exact-match queries, named entities, error messages.
Critical for queries where embeddings underperform (rare terms, out-of-distribution words, exact title lookups).

Embedding-based retrieval (two-tower)

Pre-compute embeddings for all items via a content tower (text + metadata + popularity features).
At query time, compute query embedding via a query tower (query text + user features).
Approximate nearest-neighbor lookup (HNSW, Scann) in 1-5ms over the full catalog.
Captures semantic similarity that BM25 misses.

User-personalized retrieval

For users with history: sequence model over recent watches/clicks generates a “user state” embedding.
ANN lookup against item embeddings to retrieve “items similar to your recent activity.”
Independent of the query, generates a candidate set that biases toward what this user is likely to want.

Top-K most-clicked items overall and per-segment (genre, region, demographic).
Catches what’s hot regardless of personalization.

Hybrid combination: union the candidate lists from each source. ~5K candidates total. Order doesn’t matter at this stage; the next stage will re-score.

Why this design

Each source has distinct failure modes. Combining them reduces tail failure (queries where one source returns nothing usable).
Pre-computed embeddings + ANN gives sub-10ms scaling to billions of items.
Lexical retrieval is fast and handles cases embeddings don’t.
User-personalized retrieval brings in items the user might want even if they didn’t query for them precisely.

4. Stage 2: First-pass ranking (~30ms budget)

Score 5K candidates with a moderately-expensive model. Goal: cut to ~500.

A typical model here:

Architecture: a lightweight neural ranker (DCN-V2, or a small transformer over feature embeddings).
Features: query text, item metadata, user features, retrieval signals (which sources returned this item with what scores), recent interaction context.
Output: a single relevance score per (query, item) pair.
Inference: batched scoring, ~30ms for 5K candidates.

This stage doesn’t need to be perfect, it just needs to keep the right ~500 in the top-500. Recall@500 is the metric to optimize here, not NDCG.

5. Stage 3: Final ranking (~50ms budget)

Score 500 candidates with a more expensive model. Goal: cut to ~50.

Differences from stage 2:

Larger model with more capacity and richer features.
Multi-task: predict multiple targets simultaneously (immediate click probability, watch duration, completion probability, next-day return probability).
Combine multiple targets into a final score, possibly with calibration.

Modern recipe: a transformer over (query, user_history, candidate_features), producing per-task scores combined into a final ranking score.

6. Stage 4: Re-ranking and diversification (~20ms budget)

The top 50 from stage 3 are individually good but might form a bad list. A re-ranking step ensures:

Diversity: not 10 results from the same artist or franchise.
Calibrated representation: the right mix of categories given the query.
Freshness: some boost for recent content.
Business rules: licensing, regional availability, content-policy constraints.

Algorithm: typically a small heuristic re-rank (DPP-based diversity, or a simple max-marginal-relevance), followed by hard rule filtering. Output: top 10 results.

7. Labels and training

The trickiest part of search ranking is what to predict. Choices:

Implicit feedback (clicks, watches)

Abundant but heavily biased: we only see what we showed. Result: a model trained on observational data can learn the existing system’s biases.
Mitigation: inverse propensity weighting (IPW) on examined positions, counterfactual policy evaluation.

Explicit feedback (ratings, surveys)

High quality but very sparse.
Use as auxiliary training signal or for evaluation.

Long-term outcomes (next-day return, subscription retention)

Most aligned with business but hardest to predict directly.
Multi-task: predict short-term proxies, combine using long-term-aware weighting.

A common production recipe:

Multi-task ranker predicts click, watch_complete, watch_30s, next_day_return, etc.
Combine into a final score: final = sum(w_i * P(task_i)) with weights tuned offline by counterfactual eval and confirmed in A/B.

8. Cold-start

Two cold-start problems:

New users (no history)

Strategy: start with popularity / segment-based rankings. Apply minimal personalization.
Quickly accumulate signal in the first few sessions; transition to full personalization after ~10 interactions.

New items (no engagement signal)

Strategy: rely on content-based features (title, description, embedding from text/audio/visual).
Boost new items in early hours to gather initial signal.
After enough interactions accumulate, the model uses both content and engagement signals.

9. Evaluation

Offline metrics

NDCG@K, MRR, Recall@K on a held-out set of (query, ideal item) pairs.
Counterfactual estimators (IPS, doubly robust) for unbiased estimates on logged data.
Caveat: offline metrics on observational data are unreliable. Treat as sanity checks, not as decision metrics.

Online metrics

Primary: long-term satisfaction proxy (next-week return rate).
Secondary: immediate engagement (CTR, watch time on retrieved items).
Guardrails: latency p99, cost per query, fairness slices (per-creator-type, per-region, per-demographic), diversity metrics.
A/B testing: minimum-detectable-effect calculation upfront; pre-committed sample size; multiple-comparison correction across slices.

Human evaluation

Quarterly rated relevance evaluations by trained raters on sampled queries.
Used to calibrate the offline metrics and to detect systematic issues.

10. Serving infrastructure

Real-time path

Query understanding (spell correction, query rewriting): <5ms.
Stage 1 (candidate generation): ~5ms.
Stage 2 (first-pass ranking): ~30ms.
Stage 3 (final ranking): ~50ms.
Stage 4 (re-ranking): ~20ms.
Total: ~110ms; p99 budget of 200ms gives ~90ms headroom.

Feature store

User features computed in batch (daily) plus real-time updates for the most recent activity.
Item features computed in batch when content is added or updated.
Query features computed at request time.

Index management

Item embeddings re-computed weekly (or on content update).
ANN index rebuilt nightly.
Lexical index updated continuously as items are added.

Model deployment

Models versioned with full lineage (training data hash, code hash, hyperparameters).
Shadow deployment for new models: run alongside production, compare predictions, no user impact.
Gradual rollout: 1% → 10% → 50% → 100% with monitoring at each step.
Automatic rollback on guardrail regression.

11. Monitoring

What to monitor:

Quality

Online primary metric and guardrails (compared against control).
Offline-online correlation (is the offline eval predicting online correctly?).
Failure-mode tracking: queries with no clicks, queries with regret (immediate query reformulation).

System

Latency per stage at p50 / p95 / p99.
Cost per query.
Cache hit rates.
Index freshness.

Distribution

Query distribution shift (are users asking new things?).
Content distribution shift (are new items being added at expected rates?).
User behavior shift (is engagement pattern changing?).

Failure modes

Stage failures (timeout, exception): graceful degradation to simpler ranking.
Empty result rate (queries returning <5 results): often indicates a real bug.
Latency spikes correlated with index updates or feature pipeline lag.

12. The hard problems

A few areas that are hard and tend to consume most of the senior judgment on a project like this:

Feedback loops

Today’s recommendations create tomorrow’s training data. The system can self-reinforce a narrow distribution: items that get shown get clicked; clicks become training labels; the model learns to show those items more. Over time, diversity collapses.

Mitigations:

Explicit diversity terms in re-ranking.
Exploration in candidate generation (epsilon-greedy or Thompson sampling on a small fraction of impressions).
Counterfactual augmentation: train the model on what would have happened if it had shown a wider distribution.

Long-term vs short-term trade-offs

Optimizing immediate clicks can lead to clickbait, dissatisfaction, and long-term decline. Optimizing long-term metrics is hard because they’re sparse and slow.

Mitigations:

Multi-task with long-term proxies (next-week return, survey responses).
Online experiments with long enough horizons to detect long-term effects (often 4-8 weeks).
Periodic “diversity injection” to ensure the system doesn’t trap itself.

Calibration across heads

In multi-task ranking, each head is trained on different positive/negative ratios. Without explicit calibration, the noisiest head dominates the final score.

Mitigation: per-head calibration (Platt scaling or isotonic regression) on a held-out set before combining scores.

Cold-start distribution

The “ideal” candidate generation is for the long tail (rare items / users), but those are exactly where the model has least signal. Disproportionate engineering effort goes into the cold-start tail relative to its share of traffic, but it’s where competitive moat lives.

Mitigation: explicit content-based signals; treat cold-start as a separate model with separate training data (metadata-rich items only, with engagement signal artificially excluded).

13. Interview probing areas

If this came up in an interview, the interviewer would push on:

“Why two-stage and not one-stage?”: latency math, recall vs precision trade-off.
“How do you train the candidate generator?”: in-batch negatives + hard negatives + distillation from the full ranker.
“What’s the eval bar for shipping?”: primary metric significant lift, no guardrail regressions, no slice regressions in critical segments, novelty effect decayed.
“Tell me about a failure mode you’d worry about.”: feedback loops, calibration drift, distribution shift in either content or query distribution.
“What if your offline eval and online A/B disagree?”: trust online; investigate why offline misled; rebuild offline eval on production-like distribution.

A senior candidate has clear answers to all of these. A staff candidate also discusses organizational concerns: how the team’s metric definitions get aligned with product strategy; how model release cadence interacts with the eval pipeline; how to onboard new team members on this system.

14. What to build next (roadmap)

Even after the v1 ships:

v1.5: better cold-start for new users (using content-based signals more aggressively in early sessions).
v2: query understanding (LLM-powered query rewriting, intent classification, multi-step queries).
v2.5: cross-surface personalization (use signals from related surfaces, what the user clicked on the home page, to inform search).
v3: generative re-ranking (LLM that takes top 20 results and re-orders / generates explanations).
v3.5: agentic search for complex queries (multi-step tool-using agent for queries that require synthesis across multiple items).

The specific roadmap depends on what’s working and what isn’t, but having a multi-quarter view of where the system is going is itself a senior signal.

This is the depth to aim for in a real senior interview. The interviewer doesn’t need every detail; they need to see that you’ve actually thought about the system, not just listed its components.

If you can structure a 45-minute conversation around these 14 layers and answer follow-ups on any one of them, you’re operating at the senior-staff bar.

1. Framing the problem

2. Top-level architecture

3. Stage 1: Candidate generation (~5ms budget)

Lexical retrieval (BM25)

Embedding-based retrieval (two-tower)

User-personalized retrieval

Trending / popularity

Why this design

4. Stage 2: First-pass ranking (~30ms budget)

5. Stage 3: Final ranking (~50ms budget)

6. Stage 4: Re-ranking and diversification (~20ms budget)

7. Labels and training

Implicit feedback (clicks, watches)

Explicit feedback (ratings, surveys)

Long-term outcomes (next-day return, subscription retention)

8. Cold-start

New users (no history)

New items (no engagement signal)

9. Evaluation

Offline metrics

Online metrics

Human evaluation

10. Serving infrastructure

Real-time path

Feature store

Index management

Model deployment

11. Monitoring

Quality

System

Distribution

Failure modes

12. The hard problems

Feedback loops

Long-term vs short-term trade-offs

Calibration across heads

Cold-start distribution

13. Interview probing areas

14. What to build next (roadmap)