Skip to content
mentorship

concepts

Alternating least squares for collaborative filtering

Factorize the user-item matrix into two low-rank factors. Each is a linear regression given the other, so alternate. The classical recsys workhorse before deep learning.

Reviewed · 3 min read

One-line definition

Alternating Least Squares (ALS) factorizes a sparse rating matrix where holds user factors and holds item factors. Optimization alternates: fix , solve for in closed form (a linear regression per user); fix , solve for . Repeat.

Why it matters

The classic Netflix Prize era was largely won by matrix factorization, and ALS is the simplest training algorithm for it. SGD-based factorization is competitive on dense data, but ALS dominates when the data is implicit-feedback or stored row- and column-blocked across a cluster (Spark MLlib’s recommender is ALS).

ALS is still the right baseline for any recommender system before you reach for two-tower retrieval or sequence models. Cheap to train, easy to parallelize, well-understood failure modes.

The mechanism

Loss with regularization:

where is the set of observed ratings.

Fix all . The loss in is a ridge regression:

A system per user. Solve for all users in parallel. Then fix and solve for each symmetrically. Iterate until convergence.

The objective is biconvex: convex in given and convex in given , but not jointly convex. ALS finds a local minimum, which is empirically good on real recsys data.

Implicit feedback (the practical version)

In real systems, ratings are rare. What you have is implicit signal: clicks, watches, plays. Treat all observed interactions as positives and all missing entries as weak negatives. Hu et al. (2008) reformulated ALS for this:

Replace with a binary preference and a confidence weight where is the observed interaction count.

The sum is now over all entries, not just observed. The closed-form ALS step still works because the per-user system can be rewritten as

with the trick that . The first term is precomputed and shared across users; the second is sparse.

Bias terms

Real ratings have systematic shifts: some users rate high, some low; some items are universally loved. Add bias terms:

where is the global mean, the user bias, the item bias. Biases are also learned in the same alternating framework.

Tradeoffs vs alternatives

MethodProsCons
ALSClosed-form per step, parallelizable, no learning rate per user; large is expensive
SGD on factorizationTiny memory, online-friendlyNeeds LR tuning, slower wall-clock at scale
Two-tower neuralCold-start via features, content awarenessNeeds more data, harder to train
BPR / pairwise lossBetter implicit-feedback rankingNot closed-form, needs negative sampling

For a fresh recsys project at moderate scale: ALS first, two-tower if you need cold-start handling or richer features.

Common pitfalls

  • Treating all missing entries as negatives without confidence weighting. A user not interacting with an item could be a negative or just unseen. Confidence weighting in implicit ALS handles this.
  • Choosing too large. Latent factors of 50 to 200 are typical; bigger overfits and is slower.
  • Forgetting to regularize. Without , ALS overfits trivially on observed entries.
  • Comparing to baselines that include bias terms while yours does not. Always include before declaring an improvement.
  • Running ALS on truly massive data without distributed setup. Spark and similar systems exist exactly for this.