Skip to content
mentorship

concepts

DBSCAN

Density-based clustering: form clusters from regions of high point density, label sparse points as noise. Handles arbitrary cluster shapes; no k to specify.

Reviewed · 3 min read

One-line definition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise; Ester et al., 1996) groups points into clusters based on local density: any point with at least neighbors within distance is a core point; clusters are formed by connected core points and the points within of them; points reachable from no core are labeled as noise.

Why it matters

Unlike k-means, DBSCAN:

  • Discovers the number of clusters automatically. No required.
  • Handles arbitrary cluster shapes (concentric rings, elongated curves, anything connected).
  • Identifies noise / outliers as a first-class concept.

Used in: anomaly detection, geospatial clustering (Uber, taxi data), molecule grouping, social network community detection.

The algorithm

Two parameters: (neighborhood radius) and (minimum points to form a dense region).

For each unvisited point :

  1. Find = all points within distance .
  2. If , label as noise (may later be reclassified as part of another cluster).
  3. Otherwise, is a core point. Start a new cluster:
    • Add and all points in to the cluster.
    • For each new core point in the cluster, expand by adding its -neighbors.
    • Continue until no more points can be added.

Each non-noise point ends up in exactly one cluster.

Three types of points

  • Core: neighbors within .
  • Border: fewer than neighbors but within of a core point.
  • Noise: neither. Sparsely located, far from any dense region.

Choosing and

Heuristics:

  • : rule of thumb where is the data dimensionality; often for 2D, for higher.
  • : plot the sorted distances to the -th nearest neighbor for each point. Pick at the “knee” of the curve. Below it the region is dense, above it sparse.

Choosing these badly produces either one giant cluster (too large ) or all noise (too small ).

Strengths and weaknesses

Strengths:

  • No to specify.
  • Arbitrary cluster shapes.
  • Robust to outliers (gets noise labels).
  • Deterministic given parameters and order of points (mostly).

Weaknesses:

  • One for all clusters: fails when clusters have very different densities.
  • Curse of dimensionality: in high-d, all points become roughly equidistant, so becomes meaningless. Use HDBSCAN or alternative.
  • Compute: naive for distance computation; with a spatial index (KD-tree, ball tree) in low dimensions.

Variants

  • HDBSCAN (Campello et al., 2013): extends DBSCAN to varying densities by building a hierarchy and extracting clusters from a stability-based selection. Often the better default in practice.
  • OPTICS: similar idea, produces a reachability plot rather than discrete clusters.

When to use

SettingDBSCAN vs. alternative
Geospatial clustersDBSCAN (or HDBSCAN)
Outlier detectionDBSCAN gets it for free
Convex, similar-density clustersk-means simpler
High-dimensional dataDBSCAN struggles; use embedding + clustering
Need cluster probabilityGMM or HDBSCAN

Common pitfalls

  • Picking without the k-distance plot. Almost always wrong by orders of magnitude.
  • Running on raw high-dim data. Reduce dimension first (PCA, UMAP, learned embedding).
  • Comparing DBSCAN and k-means on accuracy. They optimize different objectives; compare on what you actually care about (e.g., user-validated cluster purity).
  • Re-running DBSCAN with the same parameters on different-density datasets. doesn’t transfer across datasets without normalization.