Alternating least squares for collaborative filtering

Factorize the user-item matrix into two low-rank factors. Each is a linear regression given the other, so alternate. The classical recsys workhorse before deep learning.

Reviewed May 7, 2026 · 3 min read

One-line definition

Alternating Least Squares (ALS) factorizes a sparse rating matrix $R \approx U V^{⊤}$ where $U \in R^{m \times k}$ holds user factors and $V \in R^{n \times k}$ holds item factors. Optimization alternates: fix $V$ , solve for $U$ in closed form (a linear regression per user); fix $U$ , solve for $V$ . Repeat.

Why it matters

The classic Netflix Prize era was largely won by matrix factorization, and ALS is the simplest training algorithm for it. SGD-based factorization is competitive on dense data, but ALS dominates when the data is implicit-feedback or stored row- and column-blocked across a cluster (Spark MLlib’s recommender is ALS).

ALS is still the right baseline for any recommender system before you reach for two-tower retrieval or sequence models. Cheap to train, easy to parallelize, well-understood failure modes.

The mechanism

Loss with regularization:

L (U, V) = (i, j) \in Ω \sum (R_{ij} - u_{i}^{⊤} v_{j})^{2} + λ (i \sum ∥ u_{i} ∥^{2} + j \sum ∥ v_{j} ∥^{2}),

where $Ω$ is the set of observed ratings.

Fix all $v_{j}$ . The loss in $u_{i}$ is a ridge regression:

u_{i} = j \in Ω_{i} \sum v_{j} v_{j}^{⊤} + λ I^{- 1} j \in Ω_{i} \sum R_{ij} v_{j} .

A $k \times k$ system per user. Solve for all $m$ users in parallel. Then fix $U$ and solve for each $v_{j}$ symmetrically. Iterate until convergence.

The objective is biconvex: convex in $U$ given $V$ and convex in $V$ given $U$ , but not jointly convex. ALS finds a local minimum, which is empirically good on real recsys data.

Implicit feedback (the practical version)

In real systems, ratings are rare. What you have is implicit signal: clicks, watches, plays. Treat all observed interactions as positives and all missing entries as weak negatives. Hu et al. (2008) reformulated ALS for this:

Replace $R_{ij}$ with a binary preference $p_{ij} \in {0, 1}$ and a confidence weight $c_{ij} = 1 + α r_{ij}$ where $r_{ij}$ is the observed interaction count.

L = i, j \sum c_{ij} (p_{ij} - u_{i}^{⊤} v_{j})^{2} + λ (∥ U ∥_{F}^{2} + ∥ V ∥_{F}^{2}) .

The sum is now over all entries, not just observed. The closed-form ALS step still works because the per-user system can be rewritten as

u_{i} = (V^{⊤} C^{i} V + λ I)^{- 1} V^{⊤} C^{i} p_{i},

with the trick that $V^{⊤} C^{i} V = V^{⊤} V + V^{⊤} (C^{i} - I) V$ . The first term is precomputed and shared across users; the second is sparse.

Bias terms

Real ratings have systematic shifts: some users rate high, some low; some items are universally loved. Add bias terms:

\hat{R}_{ij} = μ + b_{i} + b_{j} + u_{i}^{⊤} v_{j},

where $μ$ is the global mean, $b_{i}$ the user bias, $b_{j}$ the item bias. Biases are also learned in the same alternating framework.

Tradeoffs vs alternatives

Method	Pros	Cons
ALS	Closed-form per step, parallelizable, no learning rate	$O (k^{3})$ per user; large $k$ is expensive
SGD on factorization	Tiny memory, online-friendly	Needs LR tuning, slower wall-clock at scale
Two-tower neural	Cold-start via features, content awareness	Needs more data, harder to train
BPR / pairwise loss	Better implicit-feedback ranking	Not closed-form, needs negative sampling

For a fresh recsys project at moderate scale: ALS first, two-tower if you need cold-start handling or richer features.

Common pitfalls

Treating all missing entries as negatives without confidence weighting. A user not interacting with an item could be a negative or just unseen. Confidence weighting in implicit ALS handles this.
Choosing $k$ too large. Latent factors of 50 to 200 are typical; bigger $k$ overfits and is slower.
Forgetting to regularize. Without $λ$ , ALS overfits trivially on observed entries.
Comparing to baselines that include bias terms while yours does not. Always include $μ + b_{i} + b_{j}$ before declaring an improvement.
Running ALS on truly massive data without distributed setup. Spark and similar systems exist exactly for this.