Questions

The senior ML interview canon

50 questions across 8 categories. Each with what L4 / L5 / L6 answers actually sound like, the tells that get a strong-hire vote, and the tells that get you down-leveled. Press / to search.

ML Fundamentals 8

Bayesian vs frequentist: a practitioner's framing
The textbook distinction is philosophical. The practitioner distinction is whether you can sample from a posterior cheaply, and whether you need uncertainty for downstream decisions.
Explain backprop in your own words
The textbook answer is the chain rule. The senior answer is what backprop is doing as a system: a reverse-mode auto-diff pass that reuses intermediate computations to get all gradients in one extra forward-cost pass.
How do you choose a learning rate?
The right answer is a procedure, not a number. The wrong answers are 'use the default' and 'try a few values.'
How do you choose a loss function?
The loss is the objective. Picking the wrong one means optimizing for the wrong thing, no matter how well you train. The senior answer derives the loss from the problem, not from a list.
L1 vs L2 regularization, beyond the formula
The math is identical to most candidates: penalty terms in the loss. The senior signal is the Bayesian interpretation, the optimization geometry, and when each is the right choice.
Walk me through the bias-variance tradeoff
The classic warm-up question. The L4 answer is the formula; the L6 answer is what it tells you about model selection in production.
When would you not use cross-validation?
Cross-validation is a tool, not a default. The senior answer names the cases where it's wrong, expensive, or misleading.
Why does dropout work?
The trick is that there are three valid explanations and they all matter. Which ones you reach for tells the interviewer your level.

Deep Learning Production 6

Explain backprop through time
BPTT is just backprop on the unrolled computation graph of a recurrent network. The interview signal is whether you understand truncation and what it costs.
How do you deal with class imbalance in 2026?
Class weighting and SMOTE are the textbook answers and often the wrong ones. The senior answer matches the technique to the imbalance ratio, the cost asymmetry, and the metric you actually care about.
How would you debug a model that's not learning?
The 'tell me how you'd debug' question is a behavioral round in disguise. The interviewer is probing your debugging instinct, not testing facts.
Mixed precision: what's actually happening?
Beyond 'use BF16'. The senior answer explains what stays in FP32, why loss scaling exists for FP16, and the memory split.
Walk me through how you'd train a 100B parameter model
The question is about parallelism and memory, not about modeling. The L6 answer combines data, tensor, pipeline, and FSDP/ZeRO sharding into a coherent strategy.
Why does Adam sometimes generalize worse than SGD?
Adam usually trains faster but in some settings finds sharper minima with worse generalization. The senior answer names the regimes where this happens and the modern fixes.

LLM Systems 13

Build an LLM coding assistant from scratch
The architecture decision space is large: model choice, context retrieval, IDE integration, evals. The senior answer scopes the use case before any of it.
Design a RAG system for legal documents
Legal RAG amplifies every standard RAG concern: precise citations, no hallucinations, regulated domain, dense documents with structure. The senior answer addresses each.
Design a system for safe LLM deployment in healthcare
Healthcare adds three constraints on top of normal LLM deployment: regulatory compliance, low tolerance for harm, and a workflow that already has clinicians as the final decision-maker.
Fine-tuning vs prompting: the deep version
Past the basic decision tree. The senior answer covers SFT, LoRA, DPO, continued pretraining, and the operational trade-offs each introduces.
How do you A/B test a chatbot?
Chatbot A/B testing has all the hard parts of regular A/B testing plus delayed feedback, conversational state, and metrics that are hard to define.
How do you evaluate an agent?
Agent eval is harder than chat eval because there are intermediate steps, tool calls, and long-horizon outcomes. The senior answer evaluates trajectories, not just final outputs.
How do you handle hallucinations in production?
There is no single solution. The senior answer is a layered system that catches different hallucination types at different stages.
How would you build evals for a coding assistant?
Code is one of the few LLM domains where ground truth is verifiable. Use that. The senior answer combines verifiable metrics with human review for what verification can't catch.
How would you evaluate an LLM application you've built?
A level-defining question. The same words elicit a junior, senior, or staff answer. The rubric below shows the differences.
How would you reduce LLM inference cost by 10x?
The cost-engineering question. The L6 answer doesn't pick a technique, it diagnoses where the cost is, then picks five.
Implement attention from scratch
The coding question that doubles as a depth check. The code is short; the conversation around it tells the level.
Walk me through speculative decoding
The interview signal is whether you understand why decoding is memory-bound and why the verify pass is essentially free.
When would you fine-tune vs prompt vs RAG?
The most-asked LLM design question of 2026. The answer is a decision tree, not a preference.

Recsys & Search 8

Design Amazon's people also bought
A simple-sounding feature with deep recsys ground underneath. The senior answer chooses between item-item collaborative filtering, embedding similarity, and learned co-purchase models, with explicit handling of feedback loops.
Design Spotify's homepage
A multi-shelf, multi-objective recommendation surface. The senior answer scopes the shelves first, then designs each as its own ranker with a meta-layer above.
Design YouTube's recommender
The canonical recsys design question. The real test is whether you'll dive into model architecture or scope the problem first.
How would you do cold-start for a new user?
Cold-start is solved by combining minimal explicit signal, demographic and contextual fallbacks, and aggressive exploration in the first few sessions.
How would you evaluate a search ranker?
Search ranking eval is offline metrics for development, A/B for shipping, and human raters for absolute calibration. The senior answer uses all three and respects what each measures.
Negative sampling strategies: what actually matters
Choice of negatives often matters more than choice of model. The senior answer ranks the strategies (in-batch, hard, BM25-mined, model-mined) and explains the trade-offs.
Recsys in the LLM era: what changes?
Most of recsys hasn't changed; LLMs add new capabilities at specific stages. The senior answer names which stages benefit and which don't.
Two-tower vs cross-encoder: when to use which?
The recsys / search architecture decision that comes up in every retrieval interview. The right answer is 'both, in sequence.'

ML System Design 5

Design a content moderation system
Moderation is a multi-policy classification problem at scale, with appeals, human review, and adversarial users. The senior answer separates policy from model and treats human review as part of the system.
Design a feature store from scratch
A feature store solves training-serving skew, feature reuse, and lineage. The senior answer explains why each property matters and what minimum viable looks like.
Design fraud detection for a payment company
Fraud has the worst data of any ML problem: heavily imbalanced, biased labels, adversarial actors, and direct money on the line. The senior answer respects all four.
Design ML monitoring
Most ML systems fail silently. Monitoring is what tells you. The senior answer monitors data, model, and outcome layers separately.
Design real-time personalization
Real-time personalization fails most often at the data infrastructure, not the model. The senior answer designs the feature freshness and serving stack first.

Behavioral 5

How do you decide what to work on?
The senior signal here is that you have an explicit prioritization framework, not just a list of interests. The L6 answer connects user value, technical leverage, and team strategy.
How do you scope an ambiguous problem?
Scoping is the single most important senior skill. The interview tests whether you have a process, not just a definition.
Tell me about a time you disagreed with someone senior
The standard behavioral question. The interviewer is checking whether you can hold technical positions, push back productively, and update on new information.
Tell me about your most ambitious project
The interview is checking the size of problem you can hold in your head and the structure of how you describe it. Specificity wins.
What's the most over-rated technique in ML right now?
A trap question that rewards taste. Strong opinions, defended with reasoning, are the senior signal. Weak opinions or 'I don't know' both lose.

Math 3

Derive logistic regression from MLE
Standard math-screen question. The senior signal is whether you can derive it cleanly and connect MLE to cross-entropy.
Explain the reparameterization trick
How VAEs propagate gradients through a sampling step. The senior answer explains the why (you can't differentiate through a sample) and the how (move the randomness outside the parameters).
Why is softmax + cross-entropy the right pairing?
The gradient simplifies to (p - y), and that's not a coincidence. The senior answer derives this and connects to GLMs and numerical stability.

Coding 2

Debug this training loop
A live coding question with a paste of buggy training code. The senior signal is the order in which you find bugs and what your debugging procedure looks like.
Implement KNN efficiently
The naive solution is one line. The interview is about scaling: when does naive fail, and what do you do?