Gaussian processes

One-line definition

A Gaussian process (GP) places a prior over functions $f : X \to R$ such that any finite collection of values ${f (x_{1}), \dots, f (x_{n})}$ is jointly Gaussian. The GP is fully specified by a mean function $m (x)$ (usually 0) and a covariance kernel $k (x, x^{'})$ .

Why it matters

GPs are the canonical Bayesian regression model. They give a closed-form posterior over functions, with a predictive mean and a predictive variance per query point. The variance is calibrated and grows away from training data, which makes GPs the standard tool for Bayesian optimization, active learning, and any setting where uncertainty quantification matters as much as the prediction.

The cost is brutal scaling: $O (n^{3})$ exact inference. GPs are practical up to a few thousand training points without approximation; modern variants (sparse, structured kernel, deep kernel) push that to millions.

The mechanism

For training inputs $X$ , training targets $y$ , test point $x_{*}$ , prior covariance kernel $k$ , and noise variance $σ^{2}$ :

The joint distribution is

[y f (x_{*})] \sim N (0, [K + σ^{2} I k_{*}^{⊤} k_{*} k_{**}]),

where $K_{ij} = k (x_{i}, x_{j})$ , $k_{*}$ is the vector of $k (x_{*}, x_{i})$ , and $k_{**} = k (x_{*}, x_{*})$ .

Conditioning on training data gives the Gaussian posterior:

μ (x_{*}) Σ (x_{*}) = k_{*}^{⊤} (K + σ^{2} I)^{- 1} y, = k_{**} - k_{*}^{⊤} (K + σ^{2} I)^{- 1} k_{*} .

Two matrix-vector products against $(K + σ^{2} I)^{- 1}$ give you both the predictive mean and the predictive variance at any test point. The hard part is $(K + σ^{2} I)^{- 1}$ , which costs $O (n^{3})$ to compute and $O (n^{2})$ to store.

Choosing the kernel

The kernel encodes prior beliefs about the function:

RBF / squared exponential: $k (x, x^{'}) = σ_{f}^{2} exp (- ∥ x - x^{'} ∥^{2} /2 ℓ^{2})$ . Smooth functions, infinite differentiability. Default.
Matern: lets you tune smoothness via a half-integer parameter $ν$ . $ν = 5/2$ is a common modern default; smoother than $ν = 3/2$ , less restrictive than RBF.
Periodic: $k (x, x^{'}) = σ_{f}^{2} exp (- 2 sin^{2} (π ∣ x - x^{'} ∣/ p) / ℓ^{2})$ . For periodic data.
Linear: $k (x, x^{'}) = x^{⊤} x^{'}$ . Recovers Bayesian linear regression.
Sums and products of valid kernels are valid kernels. Add a periodic and an RBF for “trend plus seasonality.”

Hyperparameters (lengthscale $ℓ$ , variance $σ_{f}^{2}$ , noise $σ^{2}$ ) are typically learned by maximizing the marginal likelihood:

lo g p (y ∣ X, θ) = - \frac{1}{2} y^{⊤} (K + σ^{2} I)^{- 1} y - \frac{1}{2} lo g ∣ K + σ^{2} I ∣ - \frac{n}{2} lo g 2 π .

Scaling: sparse and approximate variants

Inducing points (Snelson & Ghahramani, 2006). Pick $m ≪ n$ inducing inputs, approximate $K$ via a low-rank decomposition. $O (n m^{2})$ training, $O (m^{2})$ prediction.
SVGP (Hensman et al., 2013). Variational inference over inducing point values. Stochastic mini-batch training. The standard for large-data GPs.
KISS-GP / structured kernels (Wilson & Nickisch, 2015). Exploit grid structure for $O (n)$ inference.
Deep kernels. Replace $k (x, x^{'})$ with $k (ϕ (x), ϕ (x^{'}))$ where $ϕ$ is a neural network. Combines deep features with calibrated uncertainty.

Where GPs are still the right tool

Bayesian optimization of expensive functions (hyperparameter search, materials discovery, drug design). The acquisition function uses the posterior variance to balance exploration and exploitation.
Geospatial / time-series modeling with structured covariance (kriging is just a GP).
Small-data regression where calibrated uncertainty matters more than predictive accuracy.
Probabilistic numerics (treat numerical algorithms as Bayesian inference).

For most large-scale supervised learning, deep nets with bootstrapping or deep ensembles deliver competitive uncertainty estimates at much better scaling.

Common pitfalls

Using RBF without setting the lengthscale. Default lengthscale produces nearly-flat predictions or wildly oscillating ones. Always optimize.
Ignoring the noise term $σ^{2}$ . Without it, $K$ is often singular and inversion fails. Add a small jitter even for noiseless data.
Reading posterior variance as “model uncertainty.” GP variance is uncertainty in the function value given the kernel and the prior. Misspecified kernels give miscalibrated variance.
Treating GPs as automatic. Kernel choice is a strong prior. The model can fail silently if the kernel does not match the data structure.