Kernel methods and the kernel trick

One-line definition

A kernel $k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'})⟩$ computes the inner product of two points in a feature space defined by $ϕ$ , without explicitly evaluating $ϕ$ . Algorithms expressible in terms of inner products (SVM, Gaussian processes, kernel PCA, kernel ridge regression) can swap dot products for $k$ and operate implicitly in feature space.

Why it matters

Linear models are limited to linear decision boundaries. The classical fix was to engineer nonlinear features and run a linear model on them. Kernel methods give you that move for free: any positive-definite kernel implicitly defines a (potentially infinite-dimensional) feature space, and the algorithm only ever touches inner products.

This was the dominant nonlinear ML approach from roughly 1995 to 2012. Then deep learning replaced it for almost every supervised task. Kernels remain central to Gaussian processes, attention (the softmax of $Q K^{⊤}$ is a kernelized similarity), and certain interpretability and theoretical-analysis tools (NTK, kernel ridge regression as a baseline).

The trick

Suppose your algorithm only reads data through inner products $⟨ x_{i}, x_{j} ⟩$ . Substitute $k (x_{i}, x_{j})$ . You are now running the algorithm in the feature space defined by $ϕ$ , where $k (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j})⟩$ , without touching $ϕ$ .

Concrete example: polynomial kernel of degree 2 in $R^{d}$ :

k (x, x^{'}) = (x^{⊤} x^{'} + 1)^{2} .

Expand it. You will find this equals $⟨ ϕ (x), ϕ (x^{'})⟩$ where $ϕ$ maps to a $(2 d + 2)$ -dimensional space of monomials up to degree 2. Computing $ϕ$ explicitly is $O (d^{2})$ memory and compute; computing $k$ is $O (d)$ .

For the RBF kernel $k (x, x^{'}) = exp (- ∥ x - x^{'} ∥^{2} /2 σ^{2})$ , the implicit feature space is infinite-dimensional. You cannot compute $ϕ$ explicitly even in principle.

The Gram matrix

For a dataset of $N$ points, the kernel defines an $N \times N$ Gram matrix $K_{ij} = k (x_{i}, x_{j})$ . Many kernel algorithms reduce to operations on $K$ :

Kernel ridge regression: $\hat{f} (x) = \sum_{i} α_{i} k (x_{i}, x)$ where $α = (K + λ I)^{- 1} y$ .
Kernel PCA: eigendecompose the centered $K$ .
Gaussian processes: use $K + σ^{2} I$ as the covariance.
SVM: dual formulation depends only on $K$ and labels.

The Gram matrix shape is the bottleneck: $O (N^{2})$ memory, $O (N^{3})$ to invert. Limits naive kernel methods to roughly $N \leq 1 0^{5}$ .

What makes a valid kernel

$k$ must be positive semi-definite: for any finite set of points, the Gram matrix $K$ is PSD ( $v^{⊤} K v \geq 0$ for all $v$ ). Equivalently, $k (x, x^{'}) = ⟨ ϕ (x), ϕ (x^{'})⟩$ for some $ϕ$ into some inner-product space (Mercer’s theorem).

Common kernels:

Linear: $k (x, x^{'}) = x^{⊤} x^{'}$ . The trivial case.
Polynomial: $k (x, x^{'}) = (x^{⊤} x^{'} + c)^{d}$ .
RBF / Gaussian: $k (x, x^{'}) = exp (- γ ∥ x - x^{'} ∥^{2})$ . The default.
Laplacian: $k (x, x^{'}) = exp (- γ ∥ x - x^{'} ∥_{1})$ .
String kernels: count shared substrings.
Graph kernels: count shared subgraphs.

Combinations of valid kernels (sums, products, scalings) are valid kernels. Standard recipe for engineering domain-specific kernels.

Kernel trick in attention

A bilinear attention score $s (q, k) = q^{⊤} k / d$ is the linear kernel. Linear attention papers replace this with feature-map kernels $ϕ (q)^{⊤} ϕ (k)$ for cheaper computation; a softmax-attention variant can be approximated by random Fourier features for the RBF kernel.

Why deep learning won

Kernels make a strong assumption: the right similarity function is fixed in advance. Deep learning learns the feature representation jointly with the task. For high-dimensional structured data (images, text, audio), learned representations beat hand-picked kernels by huge margins.

Kernels remain useful when:

Data is small (Gaussian processes).
The kernel encodes domain knowledge (string kernels in computational biology).
Theoretical analysis is the goal (NTK, infinite-width networks).

Common pitfalls

Confusing the kernel trick with the kernel matrix. The trick is the substitution; the matrix is the data structure.
Using RBF without scaling. Standardize or whiten features first; the bandwidth $γ$ is sensitive to feature scale.
Treating kernel methods as scalable. The $O (N^{2})$ Gram matrix kills naive applications above $\sim 1 0^{5}$ points. Approximations (Nyström, random Fourier features, inducing points for GPs) exist.
Conflating “kernel” in the SVM sense with “kernel” in the convolution sense. Two unrelated meanings of the same word.