DBSCAN

Density-based clustering: form clusters from regions of high point density, label sparse points as noise. Handles arbitrary cluster shapes; no k to specify.

Reviewed March 25, 2026 · 3 min read

One-line definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) groups points into clusters based on local density: any point with at least $minPts$ neighbors within distance $ε$ is a core point; clusters are formed by connected core points and the points within $ε$ of them; points reachable from no core are labeled as noise.

Why it matters

Unlike k-means, DBSCAN:

Discovers the number of clusters automatically. No $k$ required.
Handles arbitrary cluster shapes (concentric rings, elongated curves, anything connected).
Identifies noise / outliers as a first-class concept.

Used in: anomaly detection, geospatial clustering (Uber, taxi data), molecule grouping, social network community detection.

The algorithm

Two parameters: $ε$ (neighborhood radius) and $minPts$ (minimum points to form a dense region).

For each unvisited point $p$ :

Find $N_{ε} (p)$ = all points within distance $ε$ .
If $∣ N_{ε} (p) ∣ < minPts$ , label $p$ as noise (may later be reclassified as part of another cluster).
Otherwise, $p$ is a core point. Start a new cluster:
- Add $p$ and all points in $N_{ε} (p)$ to the cluster.
- For each new core point in the cluster, expand by adding its $ε$ -neighbors.
- Continue until no more points can be added.

Each non-noise point ends up in exactly one cluster.

Three types of points

Core: $\geq minPts$ neighbors within $ε$ .
Border: fewer than $minPts$ neighbors but within $ε$ of a core point.
Noise: neither. Sparsely located, far from any dense region.

Choosing $ε$ and $minPts$

Heuristics:

$minPts$ : rule of thumb $minPts \geq d + 1$ where $d$ is the data dimensionality; often $minPts = 4$ for 2D, $2 d$ for higher.
$ε$ : plot the sorted distances to the $minPts$ -th nearest neighbor for each point. Pick $ε$ at the “knee” of the curve. Below it the region is dense, above it sparse.

Choosing these badly produces either one giant cluster (too large $ε$ ) or all noise (too small $ε$ ).

Strengths and weaknesses

Strengths:

No $k$ to specify.
Arbitrary cluster shapes.
Robust to outliers (gets noise labels).
Deterministic given parameters and order of points (mostly).

Weaknesses:

One $ε$ for all clusters: fails when clusters have very different densities.
Curse of dimensionality: in high-d, all points become roughly equidistant, so $ε$ becomes meaningless. Use HDBSCAN or alternative.
Compute: naive $O (n^{2})$ for distance computation; $O (n lo g n)$ with a spatial index (KD-tree, ball tree) in low dimensions.

Variants

HDBSCAN (Campello et al., 2013): extends DBSCAN to varying densities by building a hierarchy and extracting clusters from a stability-based selection. Often the better default in practice.
OPTICS: similar idea, produces a reachability plot rather than discrete clusters.

When to use

Setting	DBSCAN vs. alternative
Geospatial clusters	DBSCAN (or HDBSCAN)
Outlier detection	DBSCAN gets it for free
Convex, similar-density clusters	k-means simpler
High-dimensional data	DBSCAN struggles; use embedding + clustering
Need cluster probability	GMM or HDBSCAN

Common pitfalls

Picking $ε$ without the k-distance plot. Almost always wrong by orders of magnitude.
Running on raw high-dim data. Reduce dimension first (PCA, UMAP, learned embedding).
Comparing DBSCAN and k-means on accuracy. They optimize different objectives; compare on what you actually care about (e.g., user-validated cluster purity).
Re-running DBSCAN with the same parameters on different-density datasets. $ε$ doesn’t transfer across datasets without normalization.