Skip to content
mentorship

concepts

k-means clustering

Partition n points into k clusters by minimizing within-cluster variance. Lloyd's algorithm: alternate assigning points to nearest center and recomputing centers.

Reviewed · 3 min read

One-line definition

k-means partitions data points into clusters by minimizing the sum of squared distances from each point to its cluster’s centroid:

Lloyd’s algorithm (1957) alternates: (a) assign each point to its nearest centroid; (b) recompute each centroid as the mean of its assigned points. Iterate to convergence.

Why it matters

k-means is the canonical clustering algorithm: per iteration, easy to implement, and a good first attempt at any unsupervised structure problem. It is used directly (customer segmentation, vector-quantization codebooks) and as a building block (initialization for GMMs, anchor selection for object detection, sub-sampling for retrieval index training).

The algorithm

Initialize centroids (random points or k-means++ for stability). Then repeat:

  1. Assignment: for each .
  2. Update: for each .

Stop when assignments don’t change (or change below a threshold). Convergence is guaranteed in finite steps but only to a local minimum. K-means is not convex.

k-means++ initialization

Random initialization sometimes converges to bad local minima. k-means++ (Arthur & Vassilvitskii, 2007):

  1. Pick uniformly at random.
  2. For : pick from data with probability proportional to . Points far from existing centers are more likely to be picked.

Gives an -approximation to the optimum in expectation. Used as default initialization in scikit-learn and most modern implementations.

Choosing

There is no universally right answer. Heuristics:

  • Elbow method: plot total within-cluster sum of squares vs ; look for the “elbow” where adding more clusters stops paying off.
  • Silhouette score: average similarity of each point to its own cluster vs. nearest other cluster; pick maximizing it.
  • Gap statistic: compare to clustering on uniform random data of the same shape.
  • Domain knowledge: often the right answer.

For very large data, run k-means with a few values of in parallel and pick the one with the best silhouette on a subsample.

Assumptions

k-means implicitly assumes:

  • Spherical, equal-variance clusters (penalizes non-spherical clusters by splitting them).
  • Equal cluster sizes (large clusters are sometimes split, small ones merged).
  • Euclidean distance is meaningful in the input space.

When clusters are elongated, of unequal density, or non-convex, k-means fails. Try GMMs (allows ellipsoidal clusters), DBSCAN (density-based, arbitrary shapes), or spectral clustering.

Mini-batch k-means

For huge , full assignment is expensive. Mini-batch k-means (Sculley, 2010): each iteration uses a small random subset to update centroids. Trades accuracy for huge speedup.

When k-means is and isn’t appropriate

SettingVerdict
Roughly spherical clusters in low-dExcellent
Vector quantization codebookExcellent (k = codebook size)
Image segmentation by colorGood (k = number of colors)
Customer segmentation on raw featuresTry, but standardize first
Text or sparse dataUse spherical k-means (cosine distance)
Non-convex shapes (moons, rings)Fails; use DBSCAN or spectral
Large (thousands)Use HNSW-accelerated variants

Common pitfalls

  • Forgetting to standardize features. k-means uses Euclidean distance; features with larger scale dominate.
  • Random initialization without k-means++. Different runs give different clusterings; report consensus or use k-means++.
  • Treating cluster IDs as meaningful. Cluster 0 in one run may be cluster 7 in the next; permutation invariance.
  • Using k-means on categorical data without thought. Use k-modes, k-prototypes, or one-hot + k-means with care.
  • Reporting one clustering as “the answer.” Run multiple times, take the best by within-cluster SS.