Factorization machines

One-line definition

A factorization machine (Rendle, 2010) models pairwise feature interactions as $⟨ v_{i}, v_{j} ⟩$ where each feature $i$ has an embedding $v_{i} \in R^{k}$ . The full prediction is

\overset{y}{^} (x) = w_{0} + i \sum w_{i} x_{i} + i < j \sum ⟨ v_{i}, v_{j} ⟩ x_{i} x_{j} .

Why it matters

Linear models (logistic regression) are fast but miss interactions. A degree-2 polynomial model has $(2 d)$ interaction parameters, which is infeasible at $d = 1 0^{6}$ (typical for sparse categorical features) and learns nothing for unseen pairs. FMs sidestep both problems by factorizing the interaction matrix into rank- $k$ embeddings, sharing parameters across pairs.

Result: the FM has $O (d k)$ parameters instead of $O (d^{2})$ , and it generalizes to unseen feature pairs because it only needs to have seen each feature, not each pair. This made FMs the default tabular-recsys model from roughly 2010 to 2018, and they remain a strong baseline today.

The mechanism

Each feature $i$ gets a weight $w_{i} \in R$ (linear term) and an embedding $v_{i} \in R^{k}$ (interaction term). The prediction includes:

A global bias $w_{0}$ .
Per-feature linear terms $\sum_{i} w_{i} x_{i}$ .
Pairwise interactions $\sum_{i < j} ⟨ v_{i}, v_{j} ⟩ x_{i} x_{j}$ .

The naive interaction sum is $O (d^{2})$ to evaluate, but Rendle showed it can be reformulated as

i < j \sum ⟨ v_{i}, v_{j} ⟩ x_{i} x_{j} = \frac{1}{2} f = 1 \sum k (i \sum v_{i, f} x_{i})^{2} - i \sum v_{i, f}^{2} x_{i}^{2} .

Linear in $d$ . This is the trick that makes FMs scalable.

Sparse one-hot inputs

The natural use case: categorical features, one-hot encoded. Each user-id, item-id, or category becomes a feature $i$ with embedding $v_{i}$ . The pairwise interaction $⟨ v_{i}, v_{j} ⟩ x_{i} x_{j}$ is nonzero only when both $x_{i} = x_{j} = 1$ , i.e. only between active feature pairs.

For a (user, item) example with one-hot features, the prediction is:

\overset{y}{^} = w_{0} + w_{user} + w_{item} + ⟨ v_{user}, v_{item} ⟩ .

This is exactly a matrix factorization recsys model with bias terms. FMs generalize matrix factorization to arbitrary numbers of features (user, item, category, time, device), all sharing the same embedding mechanism.

Variants

Field-aware FM (FFM) (Juan et al., 2016). Each feature has multiple embeddings, one per “field” (e.g. user-feature embeddings paired against item-features differ from user-feature embeddings paired against time-of-day). More parameters, better accuracy on click prediction.
DeepFM (Guo et al., 2017). Add a deep MLP on top of the same embeddings to capture high-order interactions. The dominant CTR-prediction architecture in industry from 2017 onwards.
xDeepFM, AutoInt, DCN: subsequent variations layering self-attention or explicit cross-feature networks over the FM embedding base.

Tradeoffs


vs logistic regression	Captures pairwise interactions; needs more compute and tuning
vs polynomial regression	$O (d k)$ vs $O (d^{2})$ parameters; generalizes to unseen pairs
vs deep learning on raw features	FM is simpler, trains faster, more interpretable; deep nets can capture higher-order interactions
vs matrix factorization	FM generalizes MF to many sparse features beyond just (user, item)

For tabular click-through-rate prediction with high-cardinality categoricals, an FM-style embedding base (FM, DeepFM, FFM) is still the right starting point.

Common pitfalls

Choosing $k$ too large. $k = 8$ to $32$ is typical; larger $k$ overfits and is slower.
Forgetting the linear term. The pairwise interactions cannot model main effects; both terms matter.
Using FM on dense numeric features without binning. Dense features can be used, but the interaction $⟨ v_{i}, v_{j} ⟩ x_{i} x_{j}$ scales with the products, and the model is more sensitive to feature scaling. Bin or normalize first.
Ignoring regularization. L2 on $v$ is essential when most features are rare.
Comparing FM to LR without matched features. FM benefits from rich categorical features; on a clean numeric baseline it often loses to LR or gradient boosting.