Factor analysis and probabilistic PCA

The latent linear-Gaussian model behind PCA. Factor analysis explains observed variables as a few latent factors plus per-feature noise; probabilistic PCA is the special case that recovers classical PCA as a maximum-likelihood limit.

Reviewed May 31, 2026 · 3 min read

One-line definition

Factor analysis (FA) is a latent linear-Gaussian model: each observation is a linear map of a few low-dimensional latent factors plus Gaussian noise. Probabilistic PCA (PPCA) is the special case with isotropic noise, and classical PCA falls out as its zero-noise / maximum-likelihood limit.

Why it matters

This is the model that turns PCA from “an eigen-decomposition trick” into “a probabilistic generative model,” which is the framing senior interviewers want. It connects dimensionality reduction to the EM algorithm, to VAEs (a nonlinear PPCA), and to the generative-vs-discriminative discussion. It’s also a clean example of how a prior + likelihood recovers a classical algorithm as a limiting case.

The generative model

Latent factor $z \in R^{k}$ with $k ≪ d$ , observation $x \in R^{d}$ :

z \sim N (0, I), x ∣ z \sim N (W z + μ, Ψ) .

$W \in R^{d \times k}$ is the factor loading matrix (the directions), and $Ψ$ is the noise covariance. Marginalizing $z$ gives a Gaussian with low-rank-plus-structured covariance:

x \sim N (μ, W W^{⊤} + Ψ) .

The whole model is the claim: the correlations between observed variables are explained by a few shared latent factors; whatever is left is independent per-feature noise.

FA vs PPCA vs PCA — it’s all about $Ψ$

Model	Noise covariance $Ψ$	Consequence
Factor analysis	diagonal $diag (ψ_{1}, \dots, ψ_{d})$	per-feature noise; scale-invariant; models unique variances
Probabilistic PCA	isotropic $σ^{2} I$	one shared noise level; MLE has closed form via eigendecomposition
Classical PCA	$σ^{2} \to 0$ limit	deterministic projection onto top- $k$ eigenvectors

The single most important distinction for interviews: FA has a diagonal noise covariance (different noise per feature); PPCA forces it isotropic (same noise everywhere). That’s why FA is invariant to rescaling individual features while PCA/PPCA is sensitive to feature scaling (hence “standardize before PCA”).

Fitting it

PPCA has a closed-form MLE: $W$ is recovered from the top- $k$ eigenvectors of the sample covariance scaled by $(λ_{i} - σ^{2})^{1/2}$ , with $σ^{2}$ = average of the discarded eigenvalues. So PPCA ≈ PCA plus a noise estimate.
FA has no closed form (the diagonal $Ψ$ couples things); it’s fit with EM: the E-step infers the posterior over factors $p (z ∣ x)$ , the M-step updates $W$ and $Ψ$ . This is a textbook EM application.

Why the probabilistic version is worth it

Recasting PCA as a model buys you things plain PCA can’t do:

A proper likelihood → principled model comparison and a way to choose $k$ .
Natural handling of missing data (marginalize unobserved dimensions in EM).
A generative model you can sample from.
Mixtures of PPCA/FA for non-linear, multi-modal structure.
The conceptual bridge to the VAE, which is “PPCA with a neural-network decoder and amortized inference.”

What an interviewer expects you to say

Write the latent linear-Gaussian generative model and the marginal covariance $W W^{⊤} + Ψ$ .
State the key difference: FA = diagonal noise, PPCA = isotropic noise, PCA = zero-noise limit of PPCA.
Explain the practical consequence: FA is scale-invariant; PCA/PPCA require feature standardization.
Know that PPCA has a closed-form (eigendecomposition) MLE while FA needs EM.
Bonus: connect to VAEs (nonlinear PPCA) and note the probabilistic framing enables missing data, model selection, and sampling.

Common confusions

“FA and PCA are the same.” FA models per-feature (diagonal) noise and explains covariance; PCA maximizes retained variance and assumes isotropic/zero noise. They give different loadings unless noise is uniform.
“PPCA is fancier PCA with no payoff.” The payoff is the likelihood: model selection, missing data, sampling, mixtures.
“The factors are unique.” $W$ is only identifiable up to rotation (you can rotate $z$ and absorb it into $W$ ) — hence “factor rotation” (varimax) for interpretability.
“FA needs scaling like PCA.” FA is invariant to per-feature rescaling because its diagonal noise absorbs scale; PCA is not.