Matrix factorization for recsys

One-line definition

Matrix factorization for collaborative filtering (Koren et al., 2009) factorizes a sparse user-item rating matrix $R \in R^{m \times n}$ into two low-rank matrices: $R \approx U V^{⊤}$ where $U \in R^{m \times k}$ contains user embeddings and $V \in R^{n \times k}$ contains item embeddings. Predicted rating: $\overset{r}{^}_{u i} = u_{u}^{⊤} v_{i}$ .

Why it matters

MF was the dominant collaborative-filtering method from the Netflix Prize era (2006–2009) through about 2018. It still underlies modern two-tower retrieval, embedding-based recsys, and has clean equivalences to many later techniques (matrix completion, neural MF, etc.). Knowing MF makes the move to two-tower neural models obvious.

The objective

Minimize regularized squared error on observed ratings $Ω = {(u, i) : r_{u i} observed}$ :

U, V min (u, i) \in Ω \sum (r_{u i} - u_{u}^{⊤} v_{i})^{2} + λ (∥ U ∥_{F}^{2} + ∥ V ∥_{F}^{2}) .

Often add bias terms: $\overset{r}{^}_{u i} = μ + b_{u} + b_{i} + u_{u}^{⊤} v_{i}$ ( $μ$ = global mean, $b_{u}, b_{i}$ = user/item biases).

Training

Loss is non-convex jointly in $U, V$ but convex when one is fixed. Standard solvers:

Alternating least squares (ALS): fix $V$ , solve for $U$ in closed form (it’s a per-user least squares); fix $U$ , solve for $V$ . Iterate. Highly parallelizable per user / item.
SGD: sample $(u, i, r)$ at random, update $u_{u}$ and $v_{i}$ along the gradient. Easier to extend with side information.

For implicit feedback (clicks, views. No explicit rating), the loss changes:

U, V min u, i \sum c_{u i} (p_{u i} - u_{u}^{⊤} v_{i})^{2} + λ (∥ U ∥_{F}^{2} + ∥ V ∥_{F}^{2})

where $p_{u i} = 1$ if the user interacted with $i$ , else $0$ , and $c_{u i}$ is a confidence weight (Hu et al., 2008). Critical: includes all $(u, i)$ pairs (with low confidence for negatives), not just observed positives.

Cold start

MF works only for users and items observed during training. For new users and items, MF gives no embedding. Workarounds:

Hybrid models: incorporate side information (item features, user demographics).
Two-tower neural models: encoders take features, can embed arbitrary new users/items at inference.
Average-of-similar: until enough interactions accumulate, use content-based similarity.

This is why two-tower models displaced pure MF for production: they handle cold start naturally.

Properties

Embeddings are not interpretable directly. They live in an arbitrary basis. PCA-rotate them if you want to look.
Rank $k$ is the main hyperparameter: typical 32–512 for production. Too small → underfits; too large → overfits and slow.
Implicit dimensions: $k$ “latent factors” emerge that often correlate with interpretable concepts (genre, popularity, user activity level).

Connection to two-tower models

Two-tower retrieval (see two-tower retrieval) is exactly MF generalized to neural encoders:

MF: $u_{u}, v_{i}$ are free embedding parameters.
Two-tower: $u_{u} = f_{θ} (user features)$ , $v_{i} = g_{ϕ} (item features)$ .

Training is similar (sampled-softmax loss replaces squared error for retrieval). The neural version handles cold-start, side info, and personalization beyond ID embeddings.

When to use plain MF in 2026

Small-scale prototypes, when neural infrastructure is overkill.
Strong baseline before complex models.
Pure rating prediction tasks (Netflix Prize style).
Embedding initialization for downstream models.

For most production systems: two-tower neural models or transformer-based ranking models have replaced MF as the primary architecture.

Common pitfalls

Treating MF as a similarity-based system. MF embeddings are learned for prediction, not similarity; cosine of MF embeddings is not automatically meaningful as user/item similarity.
Ignoring biases. Without user/item biases, MF spends embedding capacity modeling popular-vs-niche, which is better captured by a scalar.
Using MF with explicit-feedback loss on implicit data. The “missing = zero rating” assumption is wrong; use weighted ALS for implicit feedback.
Comparing MF accuracy on RMSE alone. RMSE doesn’t capture top-k ranking quality, which is what matters in recsys.

Two-tower retrieval. Neural generalization.
Embedding spaces. How the latent factors are used.