Word embeddings: Word2Vec, GloVe, and the geometry of meaning

Map words to dense vectors so that similar words land near each other. The breakthrough that proved meaning lives in geometry, not symbols.

Reviewed May 7, 2026 · 3 min read

One-line definition

Word embeddings assign each word a dense vector (typically 100 to 300 dimensions) such that distributional similarity in text corresponds to geometric proximity in the embedding space. Trained from co-occurrence patterns, no explicit supervision.

Why it matters

Pre-2013 NLP represented words as one-hot vectors. The vocabulary was the dimension; “king” and “queen” were as far apart as “king” and “table.” Word2Vec (Mikolov et al., 2013) showed that learned dense vectors satisfy famous analogies like $vec (king) - vec (man) + vec (woman) \approx vec (queen)$ . The geometry encodes meaning.

Modern transformers learn embeddings end-to-end as part of training. Pretrained Word2Vec / GloVe vectors are mostly historical, but the conceptual frame (meaning as geometry, training from distributional signal) is still the foundation of every embedding-based retrieval system.

Word2Vec: skip-gram

Predict context words from a target word. For corpus $w_{1}, \dots, w_{T}$ and window size $c$ :

L = - t = 1 \sum T - c \leq j \leq c, j \neq = 0 \sum lo g p (w_{t + j} ∣ w_{t}) .

The probability $p (w_{t + j} ∣ w_{t})$ uses two embeddings per word: a target embedding $v_{w}$ and a context embedding $u_{w}$ . The score is $u_{w_{t + j}}^{⊤} v_{w_{t}}$ , normalized over the vocabulary.

Computing the softmax over a 100k-vocabulary at every step is infeasible. Two tricks:

Hierarchical softmax: arrange the vocabulary as a binary tree. Predicting a word becomes a sequence of binary decisions, $O (lo g V)$ per step.
Negative sampling: instead of normalizing over the full vocabulary, sample a few negative examples (words sampled from a noise distribution) and treat the prediction as binary classification (positive context vs. sampled negatives). $O (k)$ per step where $k$ is the number of negatives. The dominant choice in practice.

CBOW

The mirror image of skip-gram: predict the target from the average of context embeddings. Faster but slightly worse on rare words.

GloVe

GloVe (Pennington et al., 2014) takes a different angle: factorize the global co-occurrence matrix.

Build a matrix $X$ where $X_{ij}$ counts how often word $j$ appears in the context of word $i$ . The training objective:

L = i, j \sum f (X_{ij}) \cdot (v_{i}^{⊤} u_{j} + b_{i} + b_{j} - lo g X_{ij})^{2},

where $f$ is a weighting that downweights rare and very common pairs. Closed-form intuition: GloVe is matrix factorization of $lo g X_{ij}$ .

Empirically GloVe and Word2Vec produce comparable embeddings. GloVe is sometimes preferred because the global matrix is reused across iterations.

Properties of the learned space

Linear analogies: vector arithmetic encodes relations (king - man + woman = queen, walked - walk + run = ran).
Cosine similarity is the standard metric. Magnitudes correlate with frequency, so cosine factors that out.
Polysemy: a word with multiple senses gets one vector that averages them. The cleanest motivation for contextualized embeddings (ELMo, BERT).

What replaced them

Contextualized embeddings: ELMo, BERT, every modern LLM. The same word gets different vectors in different sentences. Pretrained Word2Vec and GloVe are now mostly used as light-weight features for low-resource scenarios or as a teaching example.

Common pitfalls

Using cosine similarity on context embeddings without L2 normalization. Most modern stacks normalize before doing the dot product.
Treating analogies as deep evidence of “reasoning.” The arithmetic works because of how training data is structured, not because the model “understands” gender or tense.
Forgetting subword tokenization. Modern systems embed BPE pieces, not whole words. “Embeddings” in a 2025 LLM are subword embeddings.