Concepts
Concept notes
124 concept notes, alphabetical within each subcategory. Each follows the same template: one-line definition, why it matters, the mechanism, what an interviewer expects, common confusions. Press / to search.
Linear Algebra & Math 6
- Determinant and volume
The determinant of a matrix is the signed volume scaling factor of the linear map. Zero determinant means the map collapses dimensions.
- Eigenvalues and the spectral theorem
Eigenvectors are directions a matrix only stretches. The spectral theorem says symmetric matrices have a full orthogonal eigenbasis with real eigenvalues.
- Matrices as linear maps
A matrix is a linear function from one vector space to another. Every operation in ML. Projection, rotation, basis change, gradient flow. Is matrix multiplication.
- Matrix calculus for ML
Gradients, Jacobians, and Hessians for vector- and matrix-valued functions. The minimum needed to derive backprop and second-order methods.
- Positive (semi-)definite matrices
Matrices that define inner products and proper covariances. The geometry of PSD: ellipsoids, not arbitrary shapes.
- SVD and PCA
The singular value decomposition factorizes any matrix into rotation × stretching × rotation. PCA is SVD applied to mean-centered data.
Probability & Statistics 8
- Bayes' rule and the posterior
How to update beliefs given evidence: posterior ∝ likelihood × prior. The foundation of Bayesian inference, naive Bayes, and probabilistic graphical models.
- Bias and variance of estimators
An estimator has bias (systematic error) and variance (sample-to-sample wobble). Mean-squared error decomposes into the two.
- Central limit theorem
Sums of many independent random variables become Gaussian. Why nearly every error bar in ML and statistics is computed from a normal distribution.
- Exponential family
A unified family of distributions (Gaussian, Bernoulli, Poisson, Beta, Gamma, etc.) with shared properties: sufficient statistics, conjugate priors, simple MLE.
- KL divergence
Asymmetric distance between probability distributions. Cross-entropy minus entropy. The mathematical glue holding most of probabilistic ML together.
- Markov chains
Stochastic processes where the future depends only on the present, not the past. Foundation of HMMs, MCMC, and many sequence models.
- Maximum likelihood estimation
The dominant statistical principle: pick parameters that make the observed data most probable. Reduces to minimizing cross-entropy for classification and MSE for Gaussian regression.
- Monte Carlo and importance sampling
Estimate expectations by averaging over random samples. The simplest way to compute integrals you can't compute analytically.
Classical ML 10
- DBSCAN
Density-based clustering: form clusters from regions of high point density, label sparse points as noise. Handles arbitrary cluster shapes; no k to specify.
- Decision trees
Recursively split the feature space along axis-aligned thresholds chosen to maximize a purity criterion. The base learner of GBDT and random forests.
- Gradient boosting (xgboost, lightgbm, catboost)
Train trees sequentially, each one fitting the gradient of the loss with respect to the current ensemble's prediction. The dominant tabular learner in 2026.
- k-means clustering
Partition n points into k clusters by minimizing within-cluster variance. Lloyd's algorithm: alternate assigning points to nearest center and recomputing centers.
- Linear regression
Predict a continuous target as a linear combination of features by minimizing squared error. Closed-form solution, MLE under Gaussian noise, and the foundation everything else builds on.
- Logistic regression
Linear regression for binary classification: pass a linear combination through a sigmoid, train by maximum likelihood. Still the strongest non-trivial baseline for tabular classification.
- Matrix factorization for recsys
Decompose the user-item interaction matrix into user and item embeddings whose dot product approximates the rating. The original collaborative filtering.
- Naive Bayes
A trivially simple generative classifier that assumes features are conditionally independent given the class. Fast, parameter-light, surprisingly hard to beat on text.
- Random forests
Bag deep decision trees plus random feature subsets per split. Variance averaging beats any single tree; the dominant out-of-the-box ensemble before GBDT.
- SVM and the kernel trick
Maximum-margin classifier with a kernel that lets it operate in implicit high-dimensional feature spaces. Beautiful theory; less common in 2026 production.
Deep Learning Foundations 8
- Activation functions
ReLU, GELU, swish, sigmoid, tanh. What each does, why GELU/swish replaced ReLU in transformers, and when to use which.
- Autoregressive vs. diffusion generation
Two paradigms for generative modeling: predict the next element step-by-step (autoregressive) or iteratively denoise from pure noise (diffusion). Different costs, different strengths.
- Backpropagation
Reverse-mode automatic differentiation applied to a computation graph. The algorithm that computes gradients for every parameter in one backward pass.
- Encoder-decoder architectures
An encoder summarizes the input into a representation; a decoder generates the output conditioned on it. The structure behind translation, T5, summarization, and many multimodal models.
- Exploding and vanishing gradients
Why deep networks were untrainable before residuals, normalization, and ReLU. The math of gradient magnitudes through depth and the standard fixes.
- Residual connections
Add the input of a block to its output. Lets gradients flow unimpeded through depth and made networks deeper than 30 layers practical for the first time.
- The attention mechanism
Compute a weighted sum of values, weights derived from query-key similarity. The single operation that powers transformers, retrieval, and most of modern ML.
- Universal approximation theorem
A neural network with one hidden layer and enough units can approximate any continuous function on a bounded domain. What it does and doesn't say about deep learning.
Generative Models 4
- Diffusion models
Learn to invert a fixed noising process. The dominant generative paradigm for images, audio, video, and molecules in 2026.
- Generative adversarial networks (GANs)
Two networks compete: a generator produces samples, a discriminator distinguishes them from real data. Sharp samples, training instability, mostly displaced by diffusion in 2026.
- Normalizing flows
Generative models built from invertible transformations. Compute exact likelihoods and sample efficiently. At the cost of architectural restrictions.
- Variational autoencoders (VAE)
Encode inputs to a latent distribution, decode samples back, optimize evidence lower bound. The cleanest gateway to deep generative models.
Probabilistic Models 4
- Expectation-Maximization (EM)
Iterate between estimating latent variables given parameters (E-step) and updating parameters given latents (M-step). The standard tool for latent-variable MLE when the latents are unobserved.
- Gaussian mixture models
Model data as a weighted sum of K Gaussians. Soft clustering, density estimation, and the canonical EM example.
- Hidden Markov models
A latent Markov chain emits observations through a per-state distribution. Forward-backward, Viterbi, Baum-Welch. The classical sequence model toolkit.
- Probabilistic graphical models
Express joint distributions as graphs whose structure encodes conditional independence. Bayesian networks (directed) and Markov random fields (undirected).
Reinforcement Learning 4
- Policy gradient methods
Directly optimize the policy by following the gradient of expected return. REINFORCE, actor-critic, and the foundation of modern RL.
- Proximal Policy Optimization (PPO)
Constrain policy updates with a clipped surrogate objective. The default actor-critic algorithm in 2026. For robotics, games, and RLHF.
- Q-learning
Learn the action-value function Q(s, a) by Bellman backups. The foundation of value-based RL. DQN, Rainbow, and the original Atari breakthroughs.
- Value-based vs. policy-based RL
Two paradigms in reinforcement learning. Value-based learns Q(s, a) and acts greedily; policy-based directly parametrizes the policy. When to use which.
Computer Vision 4
- CNN architecture
Convolutions encode translation equivariance and locality. The structural inductive bias that powered the deep learning revolution in vision.
- Object detection: Faster R-CNN, YOLO, DETR
Localize and classify objects in an image. The three main architectural families: two-stage proposal-based, one-stage grid-based, and transformer-based.
- ResNet
Residual connections enabled networks deeper than 30 layers to train. Still the dominant backbone for transfer learning in 2026.
- Vision transformers (ViT)
Apply a standard transformer to a sequence of image patches. Beats CNNs at scale; the dominant backbone for foundation vision models in 2026.
LLM Internals 18
- Continuous batching for LLM serving
Let new requests join an in-flight batch at every decode step instead of waiting for the slowest one. The other half of why vLLM is fast.
- FlashAttention
I/O-aware exact attention. Replaces the O(n²) HBM traffic with a tiled streaming softmax in SRAM. The single most important kernel-level optimization in modern transformers.
- Grouped-query and multi-query attention (GQA, MQA)
Share K and V heads across query heads to shrink the KV cache 4-8x with negligible quality loss. Standard in modern decoder LLMs.
- KV cache: how LLM inference avoids quadratic decode cost
The single most important optimization in autoregressive decoding. Without it, generating 1000 tokens would cost O(1000²) attention operations.
- Linear attention (Linformer, Performer, kernel methods)
Approximate the softmax attention matrix with a low-rank or kernel factorization so cost is linear in sequence length.
- Long-context LLMs: training and serving techniques
What makes a 1M-token context model work. Position-encoding extension, attention kernels, KV-cache management, and the tradeoffs.
- Mixture of Experts (MoE)
Replace one large feed-forward block with N smaller experts and a router that activates only k of them per token. Trades parameter count for compute.
- PagedAttention and the vLLM serving model
Treat the KV cache like virtual memory: allocate in fixed-size pages, share pages across sequences, eliminate fragmentation. The reason vLLM is the default LLM server.
- Prefill vs. decode: the two phases of LLM inference
LLM inference has two cost regimes with very different bottlenecks. Mixing them up leads to wrong cost models and bad serving decisions.
- Quantization: INT8, INT4, FP8, and the inference cost picture
Reduce model precision to shrink memory and speed up inference. The trade-offs are real but increasingly small with modern techniques.
- RAG: retrieval-augmented generation
The standard pattern for grounding LLMs in your own data. Reference page; the full essay is linked at the bottom.
- RLHF, DPO, and the alignment training stack
How LLMs get from 'next-token predictor' to 'helpful assistant.' The post-training pipeline in 2026.
- RoPE, ALiBi, and the modern positional encoding landscape
Sinusoidal positional encoding is in the original transformer paper and not in any modern LLM. Here's what replaced it and why.
- Rotary position embeddings (RoPE)
The dominant position encoding for modern LLMs. Encodes relative position by rotating Q and K in 2D subspaces, enabling clean context extrapolation.
- Sparse attention (BigBird, Longformer)
Replace the dense n×n attention mask with a sparse pattern that has O(n) non-zeros while preserving information flow across the full sequence.
- Speculative decoding
Break the autoregressive serial bottleneck without changing the output distribution. 2-3× inference speedup, free.
- Tokenization: BPE, WordPiece, and the LLM era
The critical input layer between text and model. Tokenization mismatch is a frequent source of production LLM bugs.
- Transformer architecture: a senior-level mental model
Strip away the diagram clutter. A transformer is a stack of (residual + LayerNorm + (attention or FFN)) blocks. Understanding why each piece is there is more important than memorizing the diagram.
Training Fundamentals 16
- Activation checkpointing
Trade compute for memory: drop activations during the forward pass and recompute them during the backward pass. The cheapest way to fit a larger model on the same GPU.
- Adam, AdamW, and the modern optimizer landscape
Why Adam works, why AdamW is the version you actually want, and what's changed in the optimizer landscape since 2018.
- BatchNorm vs LayerNorm (and the transformer wrinkle)
These look similar and aren't. Mixing them up in interviews is one of the cheapest ways to lose level points. Here's the right mental model.
- Calibration: when your model says 80% it should be right 80% of the time
Accuracy isn't enough; you also want predictions to mean what they say. Calibration is the difference.
- Cross-entropy and softmax
The pairing isn't arbitrary. Cross-entropy is the negative log-likelihood under a categorical distribution, and the softmax+CE gradient simplifies to (p − y), which is why it's stable.
- Dropout
Randomly zero out a fraction of activations during training. The simplest stochastic regularizer; still standard in vision and many NLP architectures.
- Gradient accumulation
Run several forward-backward passes before each optimizer step to simulate a larger effective batch size without the memory cost.
- Gradient clipping
Cap the norm of the gradient before each optimizer step. The simplest and most reliable defense against training instability.
- Label smoothing
Replace one-hot targets with a softened distribution that puts ε mass on the wrong classes. Improves calibration, sometimes hurts retrieval.
- Learning rate schedules: warmup and cosine decay
Why almost every modern training run linearly warms up the LR over a few hundred steps and then decays it on a cosine to near zero.
- Mixed precision training: FP16, BF16, and FP8
How modern transformers train at 2-4× the throughput of FP32 without quality loss. The bit layouts matter; the loss-scaling recipe matters more.
- Mixup and CutMix
Two data-augmentation schemes that train on convex combinations of pairs of inputs and their labels. Strong regularization for image classification; sometimes used in audio and tabular.
- Regularization: L1, L2, dropout, early stopping, and the modern view
The classical regularizers + the modern reality that SGD's noise is itself a regularizer. The hierarchy of choices when your model is overfitting.
- SGD with momentum
Add a moving average of past gradients to the update. Smoother trajectories, faster convergence in narrow valleys, and the foundation of Adam's first moment.
- Weight decay vs. L2 regularization
L2 adds ½λ‖θ‖² to the loss; weight decay shrinks θ multiplicatively at each step. They are equivalent under SGD but not under Adam. Which is why AdamW exists.
- Weight initialization (Kaiming, Xavier)
Set the initial variance of each layer's weights so that activations and gradients neither explode nor vanish through depth. The single most impactful one-line decision in deep nets.
Systems & Infrastructure 7
- All-reduce and other collectives
The communication primitives behind every distributed training job. All-reduce, all-gather, reduce-scatter, broadcast. What they do, costs, and when each is used.
- Floating-point formats: FP32, FP16, BF16, FP8, TF32
How modern accelerators trade precision for speed. The bit layouts of every numeric format that appears in deep learning.
- FSDP and ZeRO: sharding optimizer state, gradients, and parameters
How modern training scales beyond a single GPU's memory by partitioning the optimizer state, gradients, and parameters across the data-parallel group.
- GPU memory hierarchy: HBM, SRAM, and why I/O matters more than FLOPs
Modern GPUs are memory-bound for almost everything except big matmuls. Understanding HBM vs. SRAM bandwidth is the prerequisite for FlashAttention, KV-cache reasoning, and inference cost models.
- Pipeline parallelism
Split the model across GPUs by layer; pipeline mini-batches through the stages. The way to scale across slow interconnects when TP isn't viable.
- Sequence packing with block-diagonal masks
Concatenate multiple short examples into one fixed-length sequence to eliminate padding waste. The single largest throughput win for training on skewed-length corpora.
- Tensor parallelism
Split a single matrix multiplication across multiple GPUs. The way to fit one transformer layer that doesn't fit on a single device.
ML Systems & Evaluation 10
- A/B testing for ML systems
The framework for proving a model change actually helps. Statistical power, novelty effects, network effects, all the things people get wrong.
- Confusion matrix and classification metrics
The 2x2 (or KxK) table of predictions vs. truth that every classification metric is computed from. The Rosetta stone of binary classification.
- Cross-validation strategies
Hold-out, k-fold, stratified, grouped, and time-series CV. And when each one is and isn't appropriate.
- Embedding spaces and similarity metrics
How learned vector representations encode meaning, and why cosine similarity is the default metric for retrieval and recsys.
- Expected Calibration Error (ECE)
How well do predicted probabilities match empirical frequencies? Bin predictions by confidence, compare bin-mean confidence to bin-accuracy.
- Perplexity and bits per token
The standard intrinsic metric for language models. What it measures, what units to use, and why it's a poor end-product evaluation.
- Precision, recall, and F1
The three metrics every classifier interview asks about. Their definitions, when to optimize which, and the F-beta generalization.
- Ranking metrics: NDCG, MAP, MRR
Beyond binary precision-recall: how to measure ranking quality when order matters and labels are graded.
- ROC, PR curves, and AUC
What ROC-AUC and PR-AUC measure, when to use which, and why ROC-AUC is misleading on heavy class imbalance.
- Two-tower retrieval
Encode queries and items with separate networks into a shared embedding space; retrieve by approximate nearest neighbors. The default architecture for industrial recommenders and search.
Other
- Actor-critic methods
Policy gradient with a learned value baseline. The actor picks actions; the critic estimates how good they were. The architecture under PPO, A3C, SAC, and most modern RL.
- Advantage estimation and GAE
Policy gradients need a low-variance estimate of how much better an action was than average. GAE is the standard answer: an exponentially weighted blend of n-step returns.
- Alternating least squares for collaborative filtering
Factorize the user-item matrix into two low-rank factors. Each is a linear regression given the other, so alternate. The classical recsys workhorse before deep learning.
- Anchor boxes and non-maximum suppression
Object detectors predict thousands of overlapping boxes. Anchors give each prediction a prior shape; NMS prunes near-duplicates. The pre-DETR pipeline that defined the field for a decade.
- Approximate nearest neighbors: HNSW, IVF, and product quantization
Exact k-NN over a billion vectors is infeasible. ANN trades a small recall hit for a 100x to 10,000x speedup. The reason vector search at scale exists.
- BERT and masked language modeling
Train a transformer to fill in randomly masked tokens. The result is a bidirectional encoder that broke a dozen NLP benchmarks at once and defined the pretrain-then-finetune era.
- Convolution as matrix multiplication (im2col)
A 2D convolution is a matmul in disguise. Unfold the input into columns, multiply by a flattened filter matrix. The reason CNNs run fast on the same hardware as transformers.
- Decoding strategies: greedy, beam, top-k, top-p, temperature
Same model, different samplers, very different outputs. The choice of decoder is often more impactful than the last percent of training. Know the tradeoffs.
- Epistemic vs aleatoric uncertainty
Epistemic uncertainty shrinks with more data; aleatoric does not. Conflating them produces miscalibrated systems and wasted data collection. The distinction every senior ML engineer should be able to articulate.
- Exploration vs exploitation: epsilon-greedy, UCB, Thompson sampling
An RL or bandit agent has to keep trying new actions to learn while taking the best-known action to score. Three classical strategies, each with a different way of resolving the tension.
- Factorization machines
Linear models can't capture feature interactions. Polynomial models have too many parameters. Factorization machines find a middle path: factorize the interaction matrix and learn an embedding per feature.
- Forward-backward and Viterbi: dynamic programming on chains
Sum and max over exponentially many paths in linear time. Forward-backward computes posteriors over hidden states; Viterbi finds the most likely state sequence. The same idea, two semirings.
- Gaussian processes
A distribution over functions defined entirely by a covariance kernel. Predicts both a mean and a calibrated uncertainty. Beautiful theory, brutal scaling.
- Graph neural networks: message passing as A·X·W
Neighbors carry signal. A graph neural network averages each node's neighborhood and projects with a learned matrix. The same matmul as a CNN, on irregular structure.
- Kernel methods and the kernel trick
Compute inner products in a high-dimensional feature space without ever materializing the features. The mathematical move that lets a linear classifier draw nonlinear boundaries.
- Knowledge distillation
Train a small student to match a large teacher's outputs. The student gets richer signal than from hard labels because the teacher's soft probabilities encode similarity structure.
- LSTM and GRU: gating as Hadamard products
Recurrent networks fail because gradients vanish through repeated matmul. Gates fix this by using elementwise multiplication to control information flow. Then transformers replaced them anyway.
- Microannealing and midtraining
A short cooldown applied to a mostly-trained checkpoint with a small fraction of candidate data mixed in. The standard mid-training probe for whether a new dataset is worth including.
- Multi-head attention: why one head is not enough
Run h independent attention computations in parallel, then concatenate. Each head specializes in a different relation. The mechanism most senior candidates can write but few can motivate.
- Pruning: structured vs unstructured sparsity
Set unimportant weights to zero, recover most of the accuracy. Unstructured pruning shrinks model size; structured pruning shrinks inference time. They solve different problems.
- Self-attention vs cross-attention
Same operation, different inputs. Self-attention reads from one sequence; cross-attention reads from another. The distinction every encoder-decoder architecture rests on.
- t-SNE and UMAP: nonlinear dimensionality reduction
Both project high-dimensional data to 2D for visualization by preserving local neighborhoods. Both are easy to misread. Know what they show and what they hide.
- Word embeddings: Word2Vec, GloVe, and the geometry of meaning
Map words to dense vectors so that similar words land near each other. The breakthrough that proved meaning lives in geometry, not symbols.
- WSD and WSD-S learning rate schedules
Warmup-Stable-Decay holds the LR flat for most of training and decays at the end. WSD-S adds cyclic decay-and-rewarm probes. Both are designed for pretraining where you don't know the total token budget upfront.
- Z-loss
An auxiliary loss term that penalizes the squared log-partition function of the softmax. Started as a stability hack for logit blowup. Now used as the default regularizer on logit scale during long or deep cooldowns.