One-line definition
Approximate Nearest Neighbors finds vectors close to a query without comparing against every stored vector. The three dominant methods are graph-based (HNSW), inverted-file (IVF), and quantization-based (PQ, OPQ). Production systems combine them.
Why it matters
Modern embedding models produce 768- to 4096-dimensional vectors. Brute-force k-NN over vectors is per query. At and , that is roughly a teraflop per query, infeasible at retrieval QPS.
ANN delivers a tunable accuracy-speed-memory tradeoff. Modern vector databases (Faiss, Vespa, Milvus, Pinecone, Weaviate, pgvector) all use one or more of HNSW, IVF, and PQ. RAG, recommender retrieval, image search, and semantic search at scale would not exist without ANN.
The three families
HNSW (Hierarchical Navigable Small World)
A multi-layer proximity graph (Malkov & Yashunin, 2018).
- Each node is connected to its closest neighbors at each layer.
- Top layers are sparse (long-range jumps), bottom layer is dense.
- Search: enter at the top, greedily walk toward the query, drop down a layer, repeat.
| Recall@10 at typical config | 95 to 99 percent |
| Memory | 1.5 to 2x the raw vectors |
| Build time | Slow (graph construction is iterative) |
| Insert | Supported, online |
| Best for | High-recall, latency-critical, in-memory deployments |
IVF (Inverted File Index)
Cluster the vectors with k-means into partitions (Jégou et al., 2011). At query time, find the nearest centroids and only compare against vectors in those partitions.
| Recall trade-off | Tunable via |
| Memory | Same as raw vectors |
| Build time | Fast (one k-means) |
| Best for | Disk-backed indexes, easy to shard |
PQ (Product Quantization)
Compress each vector by splitting it into subvectors of dimension and quantizing each subvector independently to one of codewords (Jégou et al., 2011).
A 1024-dim float32 vector (4096 bytes) becomes codes of bits each. With : 64 bytes. 64x memory reduction.
Distance computation uses precomputed lookup tables: for a query, precompute the squared distance from each query subvector to each codebook entry; the distance to a stored vector is the sum of table lookups.
| Memory | 32x to 128x reduction |
| Recall | Lower than HNSW; depends on codebook quality |
| Best for | Memory-constrained scale, billion-vector indexes |
OPQ (Optimized PQ)
Apply a learned rotation before PQ to align variance with subvector boundaries. Strictly better than PQ at the same code budget.
Production combinations
Real systems combine these:
- IVF + PQ: cluster, then quantize each partition’s vectors. The Faiss workhorse.
- HNSW + PQ: HNSW graph for routing, PQ-compressed vectors for distance. Trades a recall hit for major memory savings.
- HNSW + reranking: use HNSW to get the top 100 candidates, rerank with the exact float vectors.
- IVF-HNSW: HNSW within each IVF partition.
Recall vs latency vs memory
ANN has three knobs:
- Recall@k: fraction of true k-NN found.
- Latency: query time.
- Memory: bytes per vector.
You pick two. Examples:
- HNSW with high : high recall, low latency, high memory.
- IVF-PQ with small and aggressive PQ: low memory, decent latency, lower recall.
- Brute force: 100 percent recall, infinite cost.
Common pitfalls
- Reporting recall without specifying k. Recall@1 and recall@100 measure different failure modes.
- Confusing index size with vector size. HNSW adds graph overhead; PQ subtracts vector size.
- Building once on cold data. Most systems need incremental insert and delete; pick an index that supports them (HNSW, not raw IVF-PQ).
- Over-indexing precision. For semantic search, the embedding model is usually the noisier component; chasing 99.5 percent ANN recall is wasted effort if the embeddings have 90 percent retrieval precision.