One-line definition
DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) groups points into clusters based on local density: any point with at least neighbors within distance is a core point; clusters are formed by connected core points and the points within of them; points reachable from no core are labeled as noise.
Why it matters
Unlike k-means, DBSCAN:
- Discovers the number of clusters automatically. No required.
- Handles arbitrary cluster shapes (concentric rings, elongated curves, anything connected).
- Identifies noise / outliers as a first-class concept.
Used in: anomaly detection, geospatial clustering (Uber, taxi data), molecule grouping, social network community detection.
The algorithm
Two parameters: (neighborhood radius) and (minimum points to form a dense region).
For each unvisited point :
- Find = all points within distance .
- If , label as noise (may later be reclassified as part of another cluster).
- Otherwise, is a core point. Start a new cluster:
- Add and all points in to the cluster.
- For each new core point in the cluster, expand by adding its -neighbors.
- Continue until no more points can be added.
Each non-noise point ends up in exactly one cluster.
Three types of points
- Core: neighbors within .
- Border: fewer than neighbors but within of a core point.
- Noise: neither. Sparsely located, far from any dense region.
Choosing and
Heuristics:
- : rule of thumb where is the data dimensionality; often for 2D, for higher.
- : plot the sorted distances to the -th nearest neighbor for each point. Pick at the “knee” of the curve. Below it the region is dense, above it sparse.
Choosing these badly produces either one giant cluster (too large ) or all noise (too small ).
Strengths and weaknesses
Strengths:
- No to specify.
- Arbitrary cluster shapes.
- Robust to outliers (gets noise labels).
- Deterministic given parameters and order of points (mostly).
Weaknesses:
- One for all clusters: fails when clusters have very different densities.
- Curse of dimensionality: in high-d, all points become roughly equidistant, so becomes meaningless. Use HDBSCAN or alternative.
- Compute: naive for distance computation; with a spatial index (KD-tree, ball tree) in low dimensions.
Variants
- HDBSCAN (Campello et al., 2013): extends DBSCAN to varying densities by building a hierarchy and extracting clusters from a stability-based selection. Often the better default in practice.
- OPTICS: similar idea, produces a reachability plot rather than discrete clusters.
When to use
| Setting | DBSCAN vs. alternative |
|---|---|
| Geospatial clusters | DBSCAN (or HDBSCAN) |
| Outlier detection | DBSCAN gets it for free |
| Convex, similar-density clusters | k-means simpler |
| High-dimensional data | DBSCAN struggles; use embedding + clustering |
| Need cluster probability | GMM or HDBSCAN |
Common pitfalls
- Picking without the k-distance plot. Almost always wrong by orders of magnitude.
- Running on raw high-dim data. Reduce dimension first (PCA, UMAP, learned embedding).
- Comparing DBSCAN and k-means on accuracy. They optimize different objectives; compare on what you actually care about (e.g., user-validated cluster purity).
- Re-running DBSCAN with the same parameters on different-density datasets. doesn’t transfer across datasets without normalization.