One-line definition
A Support Vector Machine finds the hyperplane that maximally separates the two classes (largest margin). The kernel trick replaces with an implicit nonlinear feature map that is never computed. Only the inner products matter.
Why it matters
SVMs were the dominant classification method from ~1998 to ~2012, before deep learning took over for unstructured data and GBDT for tabular. They remain useful in low-data, high-dimensional regimes (small biology and physics datasets) and as a teaching example of margin maximization, convex optimization, and kernel methods.
The kernel trick itself remains relevant in Gaussian processes, kernel ridge regression, and modern theory (NTK).
The hard-margin SVM (separable case)
Find minimizing subject to for all , with . The constraint defines a margin of width ; minimizing maximizes margin.
Convex quadratic program with linear constraints. Has a unique solution (for separable data).
The soft-margin SVM (non-separable)
Allow some violations with slack variables :
trades margin width against violations. Equivalently, minimize the hinge loss plus L2 regularization.
The dual formulation and the kernel trick
The dual problem is
The data appears only as inner products . Replace with for any positive-definite kernel. Fits in the implicit feature space without ever computing it.
Common kernels:
| Kernel | Implicit feature space | |
|---|---|---|
| Linear | original | |
| Polynomial | all monomials of degree | |
| RBF (Gaussian) | infinite-dimensional | |
| Sigmoid | (not always PSD) |
Support vectors
After training, the optimal (in primal) or its kernelized analog. Most are zero; the points with are the support vectors. They sit on or inside the margin and entirely determine the decision boundary. Removing all non-support vectors leaves the model unchanged.
When to use SVMs in 2026
| Setting | SVM vs. alternative |
|---|---|
| Small high-dim data, clean features | RBF SVM still strong baseline |
| Tabular with categorical features | GBDT wins |
| Text / images / structured | Neural nets win |
| Huge data () | SVMs scale poorly: to training |
| Online learning | Logistic / linear models |
For most 2026 production work, SVMs have been displaced. Their main uses are pedagogical and in legacy codebases.
Common pitfalls
- Forgetting to scale features. RBF kernels are extremely sensitive to feature scale.
- Tuning and separately. They interact; do a 2D grid search.
- Calling SVM “non-parametric.” With the kernel trick the parameter count grows with the number of support vectors (effectively ); behaves more like nearest-neighbor than like a fixed-parameter model.
- Confusing hinge loss with logistic loss. Hinge gives margins; logistic gives probabilities. SVM is not directly probabilistic without Platt scaling.
Related
- Calibration. SVM scores need Platt scaling for probabilities.
- Linear regression. Kernel ridge regression is the regression analog.