One-line definition
Alternating Least Squares (ALS) factorizes a sparse rating matrix where holds user factors and holds item factors. Optimization alternates: fix , solve for in closed form (a linear regression per user); fix , solve for . Repeat.
Why it matters
The classic Netflix Prize era was largely won by matrix factorization, and ALS is the simplest training algorithm for it. SGD-based factorization is competitive on dense data, but ALS dominates when the data is implicit-feedback or stored row- and column-blocked across a cluster (Spark MLlib’s recommender is ALS).
ALS is still the right baseline for any recommender system before you reach for two-tower retrieval or sequence models. Cheap to train, easy to parallelize, well-understood failure modes.
The mechanism
Loss with regularization:
where is the set of observed ratings.
Fix all . The loss in is a ridge regression:
A system per user. Solve for all users in parallel. Then fix and solve for each symmetrically. Iterate until convergence.
The objective is biconvex: convex in given and convex in given , but not jointly convex. ALS finds a local minimum, which is empirically good on real recsys data.
Implicit feedback (the practical version)
In real systems, ratings are rare. What you have is implicit signal: clicks, watches, plays. Treat all observed interactions as positives and all missing entries as weak negatives. Hu et al. (2008) reformulated ALS for this:
Replace with a binary preference and a confidence weight where is the observed interaction count.
The sum is now over all entries, not just observed. The closed-form ALS step still works because the per-user system can be rewritten as
with the trick that . The first term is precomputed and shared across users; the second is sparse.
Bias terms
Real ratings have systematic shifts: some users rate high, some low; some items are universally loved. Add bias terms:
where is the global mean, the user bias, the item bias. Biases are also learned in the same alternating framework.
Tradeoffs vs alternatives
| Method | Pros | Cons |
|---|---|---|
| ALS | Closed-form per step, parallelizable, no learning rate | per user; large is expensive |
| SGD on factorization | Tiny memory, online-friendly | Needs LR tuning, slower wall-clock at scale |
| Two-tower neural | Cold-start via features, content awareness | Needs more data, harder to train |
| BPR / pairwise loss | Better implicit-feedback ranking | Not closed-form, needs negative sampling |
For a fresh recsys project at moderate scale: ALS first, two-tower if you need cold-start handling or richer features.
Common pitfalls
- Treating all missing entries as negatives without confidence weighting. A user not interacting with an item could be a negative or just unseen. Confidence weighting in implicit ALS handles this.
- Choosing too large. Latent factors of 50 to 200 are typical; bigger overfits and is slower.
- Forgetting to regularize. Without , ALS overfits trivially on observed entries.
- Comparing to baselines that include bias terms while yours does not. Always include before declaring an improvement.
- Running ALS on truly massive data without distributed setup. Spark and similar systems exist exactly for this.