One-line definition
Matrix calculus extends the ordinary derivative to functions whose inputs and/or outputs are vectors or matrices. The two organizing objects are the gradient (for scalar-valued functions) and the Jacobian (for vector-valued functions); the Hessian is the matrix of second partial derivatives.
Why it matters
Every learning algorithm computes derivatives of a scalar loss with respect to vector- or tensor-valued parameters. Knowing the right shapes and conventions saves hours of debugging. Backpropagation is matrix calculus applied recursively through a computation graph.
The four shape combinations
| Gradient | Jacobian , |
| Second derivative: Hessian , | (rarely used; tensor-valued) |
Two competing conventions exist:
- Numerator layout (used in physics, calculus textbooks): is a column vector matching .
- Denominator layout (used in stats, ML): same; modern ML universally uses gradient = column vector with the same shape as the parameter.
For matrix parameters , the gradient has the same shape as .
Identities you actually use
| Function | Gradient |
|---|---|
| , symmetric | |
| ( PD) | |
Chain rule
For :
- Scalar scalar scalar: .
- Vector vector scalar: where .
Backprop is just this chain rule applied to a computation graph layer-by-layer, with the Jacobian-vector product computed implicitly to avoid materializing the full Jacobian.
Hessian and second-order methods
The Hessian describes local curvature. Second-order methods (Newton’s method, K-FAC, Shampoo) use it to scale gradients by inverse curvature: .
In modern deep learning, the Hessian is too large to materialize ( for parameters). Approximations:
- Diagonal: keep only diagonal entries (RMSProp, Adam’s approximates this).
- Block-diagonal: per-layer Fisher information (K-FAC).
- Hessian-vector products: via two backward passes. Used in conjugate gradient, influence functions.
Common pitfalls
- Mismatched layout conventions. Always check whether your reference uses numerator or denominator layout; the difference is a transpose.
- Treating gradients as having the same shape as the loss. They have the shape of the parameter, not the loss.
- Computing Jacobians explicitly. For a vector-to-vector function with both large, the Jacobian is entries. Use vector-Jacobian products via
torch.autograd.gradorjax.vjpinstead. - Forgetting bias dimensions. has summed over the batch dimension.