Skip to content
mentorship

Concepts

Concept notes

124 concept notes, alphabetical within each subcategory. Each follows the same template: one-line definition, why it matters, the mechanism, what an interviewer expects, common confusions. Press / to search.

Linear Algebra & Math 6

  • Determinant and volume

    The determinant of a matrix is the signed volume scaling factor of the linear map. Zero determinant means the map collapses dimensions.

  • Eigenvalues and the spectral theorem

    Eigenvectors are directions a matrix only stretches. The spectral theorem says symmetric matrices have a full orthogonal eigenbasis with real eigenvalues.

  • Matrices as linear maps

    A matrix is a linear function from one vector space to another. Every operation in ML. Projection, rotation, basis change, gradient flow. Is matrix multiplication.

  • Matrix calculus for ML

    Gradients, Jacobians, and Hessians for vector- and matrix-valued functions. The minimum needed to derive backprop and second-order methods.

  • Positive (semi-)definite matrices

    Matrices that define inner products and proper covariances. The geometry of PSD: ellipsoids, not arbitrary shapes.

  • SVD and PCA

    The singular value decomposition factorizes any matrix into rotation × stretching × rotation. PCA is SVD applied to mean-centered data.

Probability & Statistics 8

  • Bayes' rule and the posterior

    How to update beliefs given evidence: posterior ∝ likelihood × prior. The foundation of Bayesian inference, naive Bayes, and probabilistic graphical models.

  • Bias and variance of estimators

    An estimator has bias (systematic error) and variance (sample-to-sample wobble). Mean-squared error decomposes into the two.

  • Central limit theorem

    Sums of many independent random variables become Gaussian. Why nearly every error bar in ML and statistics is computed from a normal distribution.

  • Exponential family

    A unified family of distributions (Gaussian, Bernoulli, Poisson, Beta, Gamma, etc.) with shared properties: sufficient statistics, conjugate priors, simple MLE.

  • KL divergence

    Asymmetric distance between probability distributions. Cross-entropy minus entropy. The mathematical glue holding most of probabilistic ML together.

  • Markov chains

    Stochastic processes where the future depends only on the present, not the past. Foundation of HMMs, MCMC, and many sequence models.

  • Maximum likelihood estimation

    The dominant statistical principle: pick parameters that make the observed data most probable. Reduces to minimizing cross-entropy for classification and MSE for Gaussian regression.

  • Monte Carlo and importance sampling

    Estimate expectations by averaging over random samples. The simplest way to compute integrals you can't compute analytically.

Classical ML 10

  • DBSCAN

    Density-based clustering: form clusters from regions of high point density, label sparse points as noise. Handles arbitrary cluster shapes; no k to specify.

  • Decision trees

    Recursively split the feature space along axis-aligned thresholds chosen to maximize a purity criterion. The base learner of GBDT and random forests.

  • Gradient boosting (xgboost, lightgbm, catboost)

    Train trees sequentially, each one fitting the gradient of the loss with respect to the current ensemble's prediction. The dominant tabular learner in 2026.

  • k-means clustering

    Partition n points into k clusters by minimizing within-cluster variance. Lloyd's algorithm: alternate assigning points to nearest center and recomputing centers.

  • Linear regression

    Predict a continuous target as a linear combination of features by minimizing squared error. Closed-form solution, MLE under Gaussian noise, and the foundation everything else builds on.

  • Logistic regression

    Linear regression for binary classification: pass a linear combination through a sigmoid, train by maximum likelihood. Still the strongest non-trivial baseline for tabular classification.

  • Matrix factorization for recsys

    Decompose the user-item interaction matrix into user and item embeddings whose dot product approximates the rating. The original collaborative filtering.

  • Naive Bayes

    A trivially simple generative classifier that assumes features are conditionally independent given the class. Fast, parameter-light, surprisingly hard to beat on text.

  • Random forests

    Bag deep decision trees plus random feature subsets per split. Variance averaging beats any single tree; the dominant out-of-the-box ensemble before GBDT.

  • SVM and the kernel trick

    Maximum-margin classifier with a kernel that lets it operate in implicit high-dimensional feature spaces. Beautiful theory; less common in 2026 production.

Deep Learning Foundations 8

  • Activation functions

    ReLU, GELU, swish, sigmoid, tanh. What each does, why GELU/swish replaced ReLU in transformers, and when to use which.

  • Autoregressive vs. diffusion generation

    Two paradigms for generative modeling: predict the next element step-by-step (autoregressive) or iteratively denoise from pure noise (diffusion). Different costs, different strengths.

  • Backpropagation

    Reverse-mode automatic differentiation applied to a computation graph. The algorithm that computes gradients for every parameter in one backward pass.

  • Encoder-decoder architectures

    An encoder summarizes the input into a representation; a decoder generates the output conditioned on it. The structure behind translation, T5, summarization, and many multimodal models.

  • Exploding and vanishing gradients

    Why deep networks were untrainable before residuals, normalization, and ReLU. The math of gradient magnitudes through depth and the standard fixes.

  • Residual connections

    Add the input of a block to its output. Lets gradients flow unimpeded through depth and made networks deeper than 30 layers practical for the first time.

  • The attention mechanism

    Compute a weighted sum of values, weights derived from query-key similarity. The single operation that powers transformers, retrieval, and most of modern ML.

  • Universal approximation theorem

    A neural network with one hidden layer and enough units can approximate any continuous function on a bounded domain. What it does and doesn't say about deep learning.

Generative Models 4

  • Diffusion models

    Learn to invert a fixed noising process. The dominant generative paradigm for images, audio, video, and molecules in 2026.

  • Generative adversarial networks (GANs)

    Two networks compete: a generator produces samples, a discriminator distinguishes them from real data. Sharp samples, training instability, mostly displaced by diffusion in 2026.

  • Normalizing flows

    Generative models built from invertible transformations. Compute exact likelihoods and sample efficiently. At the cost of architectural restrictions.

  • Variational autoencoders (VAE)

    Encode inputs to a latent distribution, decode samples back, optimize evidence lower bound. The cleanest gateway to deep generative models.

Probabilistic Models 4

  • Expectation-Maximization (EM)

    Iterate between estimating latent variables given parameters (E-step) and updating parameters given latents (M-step). The standard tool for latent-variable MLE when the latents are unobserved.

  • Gaussian mixture models

    Model data as a weighted sum of K Gaussians. Soft clustering, density estimation, and the canonical EM example.

  • Hidden Markov models

    A latent Markov chain emits observations through a per-state distribution. Forward-backward, Viterbi, Baum-Welch. The classical sequence model toolkit.

  • Probabilistic graphical models

    Express joint distributions as graphs whose structure encodes conditional independence. Bayesian networks (directed) and Markov random fields (undirected).

Reinforcement Learning 4

  • Policy gradient methods

    Directly optimize the policy by following the gradient of expected return. REINFORCE, actor-critic, and the foundation of modern RL.

  • Proximal Policy Optimization (PPO)

    Constrain policy updates with a clipped surrogate objective. The default actor-critic algorithm in 2026. For robotics, games, and RLHF.

  • Q-learning

    Learn the action-value function Q(s, a) by Bellman backups. The foundation of value-based RL. DQN, Rainbow, and the original Atari breakthroughs.

  • Value-based vs. policy-based RL

    Two paradigms in reinforcement learning. Value-based learns Q(s, a) and acts greedily; policy-based directly parametrizes the policy. When to use which.

Computer Vision 4

  • CNN architecture

    Convolutions encode translation equivariance and locality. The structural inductive bias that powered the deep learning revolution in vision.

  • Object detection: Faster R-CNN, YOLO, DETR

    Localize and classify objects in an image. The three main architectural families: two-stage proposal-based, one-stage grid-based, and transformer-based.

  • ResNet

    Residual connections enabled networks deeper than 30 layers to train. Still the dominant backbone for transfer learning in 2026.

  • Vision transformers (ViT)

    Apply a standard transformer to a sequence of image patches. Beats CNNs at scale; the dominant backbone for foundation vision models in 2026.

LLM Internals 18

Training Fundamentals 16

  • Activation checkpointing

    Trade compute for memory: drop activations during the forward pass and recompute them during the backward pass. The cheapest way to fit a larger model on the same GPU.

  • Adam, AdamW, and the modern optimizer landscape

    Why Adam works, why AdamW is the version you actually want, and what's changed in the optimizer landscape since 2018.

  • BatchNorm vs LayerNorm (and the transformer wrinkle)

    These look similar and aren't. Mixing them up in interviews is one of the cheapest ways to lose level points. Here's the right mental model.

  • Calibration: when your model says 80% it should be right 80% of the time

    Accuracy isn't enough; you also want predictions to mean what they say. Calibration is the difference.

  • Cross-entropy and softmax

    The pairing isn't arbitrary. Cross-entropy is the negative log-likelihood under a categorical distribution, and the softmax+CE gradient simplifies to (p − y), which is why it's stable.

  • Dropout

    Randomly zero out a fraction of activations during training. The simplest stochastic regularizer; still standard in vision and many NLP architectures.

  • Gradient accumulation

    Run several forward-backward passes before each optimizer step to simulate a larger effective batch size without the memory cost.

  • Gradient clipping

    Cap the norm of the gradient before each optimizer step. The simplest and most reliable defense against training instability.

  • Label smoothing

    Replace one-hot targets with a softened distribution that puts ε mass on the wrong classes. Improves calibration, sometimes hurts retrieval.

  • Learning rate schedules: warmup and cosine decay

    Why almost every modern training run linearly warms up the LR over a few hundred steps and then decays it on a cosine to near zero.

  • Mixed precision training: FP16, BF16, and FP8

    How modern transformers train at 2-4× the throughput of FP32 without quality loss. The bit layouts matter; the loss-scaling recipe matters more.

  • Mixup and CutMix

    Two data-augmentation schemes that train on convex combinations of pairs of inputs and their labels. Strong regularization for image classification; sometimes used in audio and tabular.

  • Regularization: L1, L2, dropout, early stopping, and the modern view

    The classical regularizers + the modern reality that SGD's noise is itself a regularizer. The hierarchy of choices when your model is overfitting.

  • SGD with momentum

    Add a moving average of past gradients to the update. Smoother trajectories, faster convergence in narrow valleys, and the foundation of Adam's first moment.

  • Weight decay vs. L2 regularization

    L2 adds ½λ‖θ‖² to the loss; weight decay shrinks θ multiplicatively at each step. They are equivalent under SGD but not under Adam. Which is why AdamW exists.

  • Weight initialization (Kaiming, Xavier)

    Set the initial variance of each layer's weights so that activations and gradients neither explode nor vanish through depth. The single most impactful one-line decision in deep nets.

Systems & Infrastructure 7

ML Systems & Evaluation 10

  • A/B testing for ML systems

    The framework for proving a model change actually helps. Statistical power, novelty effects, network effects, all the things people get wrong.

  • Confusion matrix and classification metrics

    The 2x2 (or KxK) table of predictions vs. truth that every classification metric is computed from. The Rosetta stone of binary classification.

  • Cross-validation strategies

    Hold-out, k-fold, stratified, grouped, and time-series CV. And when each one is and isn't appropriate.

  • Embedding spaces and similarity metrics

    How learned vector representations encode meaning, and why cosine similarity is the default metric for retrieval and recsys.

  • Expected Calibration Error (ECE)

    How well do predicted probabilities match empirical frequencies? Bin predictions by confidence, compare bin-mean confidence to bin-accuracy.

  • Perplexity and bits per token

    The standard intrinsic metric for language models. What it measures, what units to use, and why it's a poor end-product evaluation.

  • Precision, recall, and F1

    The three metrics every classifier interview asks about. Their definitions, when to optimize which, and the F-beta generalization.

  • Ranking metrics: NDCG, MAP, MRR

    Beyond binary precision-recall: how to measure ranking quality when order matters and labels are graded.

  • ROC, PR curves, and AUC

    What ROC-AUC and PR-AUC measure, when to use which, and why ROC-AUC is misleading on heavy class imbalance.

  • Two-tower retrieval

    Encode queries and items with separate networks into a shared embedding space; retrieve by approximate nearest neighbors. The default architecture for industrial recommenders and search.

Other

  • Actor-critic methods

    Policy gradient with a learned value baseline. The actor picks actions; the critic estimates how good they were. The architecture under PPO, A3C, SAC, and most modern RL.

  • Advantage estimation and GAE

    Policy gradients need a low-variance estimate of how much better an action was than average. GAE is the standard answer: an exponentially weighted blend of n-step returns.

  • Alternating least squares for collaborative filtering

    Factorize the user-item matrix into two low-rank factors. Each is a linear regression given the other, so alternate. The classical recsys workhorse before deep learning.

  • Anchor boxes and non-maximum suppression

    Object detectors predict thousands of overlapping boxes. Anchors give each prediction a prior shape; NMS prunes near-duplicates. The pre-DETR pipeline that defined the field for a decade.

  • Approximate nearest neighbors: HNSW, IVF, and product quantization

    Exact k-NN over a billion vectors is infeasible. ANN trades a small recall hit for a 100x to 10,000x speedup. The reason vector search at scale exists.

  • BERT and masked language modeling

    Train a transformer to fill in randomly masked tokens. The result is a bidirectional encoder that broke a dozen NLP benchmarks at once and defined the pretrain-then-finetune era.

  • Convolution as matrix multiplication (im2col)

    A 2D convolution is a matmul in disguise. Unfold the input into columns, multiply by a flattened filter matrix. The reason CNNs run fast on the same hardware as transformers.

  • Decoding strategies: greedy, beam, top-k, top-p, temperature

    Same model, different samplers, very different outputs. The choice of decoder is often more impactful than the last percent of training. Know the tradeoffs.

  • Epistemic vs aleatoric uncertainty

    Epistemic uncertainty shrinks with more data; aleatoric does not. Conflating them produces miscalibrated systems and wasted data collection. The distinction every senior ML engineer should be able to articulate.

  • Exploration vs exploitation: epsilon-greedy, UCB, Thompson sampling

    An RL or bandit agent has to keep trying new actions to learn while taking the best-known action to score. Three classical strategies, each with a different way of resolving the tension.

  • Factorization machines

    Linear models can't capture feature interactions. Polynomial models have too many parameters. Factorization machines find a middle path: factorize the interaction matrix and learn an embedding per feature.

  • Forward-backward and Viterbi: dynamic programming on chains

    Sum and max over exponentially many paths in linear time. Forward-backward computes posteriors over hidden states; Viterbi finds the most likely state sequence. The same idea, two semirings.

  • Gaussian processes

    A distribution over functions defined entirely by a covariance kernel. Predicts both a mean and a calibrated uncertainty. Beautiful theory, brutal scaling.

  • Graph neural networks: message passing as A·X·W

    Neighbors carry signal. A graph neural network averages each node's neighborhood and projects with a learned matrix. The same matmul as a CNN, on irregular structure.

  • Kernel methods and the kernel trick

    Compute inner products in a high-dimensional feature space without ever materializing the features. The mathematical move that lets a linear classifier draw nonlinear boundaries.

  • Knowledge distillation

    Train a small student to match a large teacher's outputs. The student gets richer signal than from hard labels because the teacher's soft probabilities encode similarity structure.

  • LSTM and GRU: gating as Hadamard products

    Recurrent networks fail because gradients vanish through repeated matmul. Gates fix this by using elementwise multiplication to control information flow. Then transformers replaced them anyway.

  • Microannealing and midtraining

    A short cooldown applied to a mostly-trained checkpoint with a small fraction of candidate data mixed in. The standard mid-training probe for whether a new dataset is worth including.

  • Multi-head attention: why one head is not enough

    Run h independent attention computations in parallel, then concatenate. Each head specializes in a different relation. The mechanism most senior candidates can write but few can motivate.

  • Pruning: structured vs unstructured sparsity

    Set unimportant weights to zero, recover most of the accuracy. Unstructured pruning shrinks model size; structured pruning shrinks inference time. They solve different problems.

  • Self-attention vs cross-attention

    Same operation, different inputs. Self-attention reads from one sequence; cross-attention reads from another. The distinction every encoder-decoder architecture rests on.

  • t-SNE and UMAP: nonlinear dimensionality reduction

    Both project high-dimensional data to 2D for visualization by preserving local neighborhoods. Both are easy to misread. Know what they show and what they hide.

  • Word embeddings: Word2Vec, GloVe, and the geometry of meaning

    Map words to dense vectors so that similar words land near each other. The breakthrough that proved meaning lives in geometry, not symbols.

  • WSD and WSD-S learning rate schedules

    Warmup-Stable-Decay holds the LR flat for most of training and decays at the end. WSD-S adds cyclic decay-and-rewarm probes. Both are designed for pretraining where you don't know the total token budget upfront.

  • Z-loss

    An auxiliary loss term that penalizes the squared log-partition function of the softmax. Started as a stability hack for logit blowup. Now used as the default regularizer on logit scale during long or deep cooldowns.