Skip to content
mentorship

concepts

Factorization machines

Linear models can't capture feature interactions. Polynomial models have too many parameters. Factorization machines find a middle path: factorize the interaction matrix and learn an embedding per feature.

Reviewed · 3 min read

One-line definition

A factorization machine (Rendle, 2010) models pairwise feature interactions as where each feature has an embedding . The full prediction is

Why it matters

Linear models (logistic regression) are fast but miss interactions. A degree-2 polynomial model has interaction parameters, which is infeasible at (typical for sparse categorical features) and learns nothing for unseen pairs. FMs sidestep both problems by factorizing the interaction matrix into rank- embeddings, sharing parameters across pairs.

Result: the FM has parameters instead of , and it generalizes to unseen feature pairs because it only needs to have seen each feature, not each pair. This made FMs the default tabular-recsys model from roughly 2010 to 2018, and they remain a strong baseline today.

The mechanism

Each feature gets a weight (linear term) and an embedding (interaction term). The prediction includes:

  • A global bias .
  • Per-feature linear terms .
  • Pairwise interactions .

The naive interaction sum is to evaluate, but Rendle showed it can be reformulated as

Linear in . This is the trick that makes FMs scalable.

Sparse one-hot inputs

The natural use case: categorical features, one-hot encoded. Each user-id, item-id, or category becomes a feature with embedding . The pairwise interaction is nonzero only when both , i.e. only between active feature pairs.

For a (user, item) example with one-hot features, the prediction is:

This is exactly a matrix factorization recsys model with bias terms. FMs generalize matrix factorization to arbitrary numbers of features (user, item, category, time, device), all sharing the same embedding mechanism.

Variants

  • Field-aware FM (FFM) (Juan et al., 2016). Each feature has multiple embeddings, one per “field” (e.g. user-feature embeddings paired against item-features differ from user-feature embeddings paired against time-of-day). More parameters, better accuracy on click prediction.
  • DeepFM (Guo et al., 2017). Add a deep MLP on top of the same embeddings to capture high-order interactions. The dominant CTR-prediction architecture in industry from 2017 onwards.
  • xDeepFM, AutoInt, DCN: subsequent variations layering self-attention or explicit cross-feature networks over the FM embedding base.

Tradeoffs

vs logistic regressionCaptures pairwise interactions; needs more compute and tuning
vs polynomial regression vs parameters; generalizes to unseen pairs
vs deep learning on raw featuresFM is simpler, trains faster, more interpretable; deep nets can capture higher-order interactions
vs matrix factorizationFM generalizes MF to many sparse features beyond just (user, item)

For tabular click-through-rate prediction with high-cardinality categoricals, an FM-style embedding base (FM, DeepFM, FFM) is still the right starting point.

Common pitfalls

  • Choosing too large. to is typical; larger overfits and is slower.
  • Forgetting the linear term. The pairwise interactions cannot model main effects; both terms matter.
  • Using FM on dense numeric features without binning. Dense features can be used, but the interaction scales with the products, and the model is more sensitive to feature scaling. Bin or normalize first.
  • Ignoring regularization. L2 on is essential when most features are rare.
  • Comparing FM to LR without matched features. FM benefits from rich categorical features; on a clean numeric baseline it often loses to LR or gradient boosting.