Asked in: coding rounds at LLM-team and senior MLE interviews.
The interviewer pastes a training loop with several bugs (typically 2 to 5). You read it, find the bugs, and explain the impact of each. The signal is your procedure, not your speed.
Example: a bug-laden PyTorch loop
import torch
import torch.nn as nn
import torch.optim as optim
model = MyModel()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
for epoch in range(10):
for x, y in dataloader:
# forward
pred = model(x)
loss = criterion(pred, y)
# backward
loss.backward()
optimizer.step()
# eval inside training loop
if epoch % 2 == 0:
model.eval()
val_pred = model(x_val)
val_loss = criterion(val_pred, y_val)
print(f"epoch {epoch} val loss: {val_loss}")
What’s wrong (in order of severity):
-
Missing
optimizer.zero_grad(). Gradients accumulate across batches; effective learning rate is N times larger than intended. Will likely diverge or produce wrong dynamics. -
MSE loss on what looks like a classification task (the variable is
criterion = nn.MSELoss()butMyModelmay produce class logits). If classification: vanishing gradients on confident-wrong predictions; should benn.CrossEntropyLoss(). -
model.eval()not followed bymodel.train(). After the eval block, the model stays in eval mode (dropout off, batch norm using running stats). Training resumes with broken behavior. -
No
torch.no_grad()around eval. Validation forward pass builds a computation graph and stores activations, wasting memory. Should be wrapped inwith torch.no_grad():. -
Validation on the same batch as training (
xis from training loader; the eval block usesx_valonly for the eval forward). Ifx_valis a fixed pre-loaded batch, validation isn’t representative; if not defined, the code crashes. -
No data movement to device. If model is on GPU,
xandyneed to be moved (x = x.to(device)). Will crash or silently work on CPU.
Senior debugging procedure (the actual signal)
“I’d read the loop top to bottom and find issues in this order, by impact:
Optimizer / gradient flow first. Missing
zero_grad, missingoptimizer.step, detached tensors,.item()calls that drop gradient. These break training entirely.Loss function vs task. MSE on classification, CrossEntropy on probabilities (instead of logits), wrong reduction.
Train/eval mode. Forgetting to switch back, forgetting
no_gradin eval. These work silently incorrectly.Device placement. GPU/CPU mismatches, missing
.to(device), parameters on wrong device.Data shape and type issues. Batch dimension mismatches, label encoding wrong, target dtype wrong.
Numerical stability. No grad clipping for transformers, no LR warmup for Adam, no fp32 for stability ops.
The order isn’t arbitrary; it’s roughly by likelihood-of-being-wrong and severity.”
What an L4 candidate misses
- Doesn’t mention
zero_grad(the most common bug). - Spots the eval-mode issue but not the
no_gradissue. - Doesn’t have an order; finds bugs randomly.
What an L5 candidate finds
- All the gradient-flow bugs.
- The loss / task mismatch.
- The eval-mode bug.
What an L6 candidate adds
- A procedure explained out loud as they read.
- Severity ranking (“this is the most impactful, but this other one would silently corrupt eval metrics”).
- A note on what they would test to verify each bug.
Tells that get you a strong-hire vote
- You explain your scanning order out loud.
- You catch
zero_gradimmediately. - You distinguish bugs that crash from bugs that silently mislead.
- You suggest adding asserts or tests to prevent recurrence.
Tells that get you down-leveled
- Reading line by line in random order.
- Missing
zero_grad. - Spotting bugs but not characterizing their impact.
- Suggesting “use a framework” as the fix.
Common follow-up
“You’ve fixed all the bugs. The model still isn’t training. What now?”
The L6 answer is the same as the debug a model that’s not learning procedure: overfit a single batch, check gradient norms, LR sweep, look at predictions, etc.
Related: How would you debug a model that’s not learning?, Implement attention from scratch, How do you choose a learning rate?.