LSTM and GRU: gating as Hadamard products

One-line definition

LSTM and GRU are recurrent units with learned gates that decide, per timestep, what to keep and what to forget. Each gate is a sigmoid output, applied via Hadamard (elementwise) product to the hidden state. The state is no longer multiplied by a weight matrix at every step, so gradients can flow.

Why it matters

A vanilla RNN updates its hidden state as $h_{t} = tanh (W h_{t - 1} + U x_{t})$ . Backpropagating through $T$ timesteps multiplies the gradient by $W$ a total of $T$ times. Eigenvalues of $W$ less than 1 vanish; eigenvalues greater than 1 explode. Both kill learning (Pascanu et al., 2013).

LSTMs (Hochreiter & Schmidhuber, 1997) replaced repeated matmul on the cell state with a Hadamard-product update, opening a “gradient highway.” GRUs (Cho et al., 2014) simplified this to two gates with similar empirical performance.

Both are now mostly historical. Transformers replaced them everywhere. But the gating idea persists in attention masking, gating in mixture-of-experts, and residual gating in modern architectures.

The LSTM cell

State: hidden state $h_{t}$ and cell state $c_{t}$ .

Gates (each computed from $[h_{t - 1}, x_{t}]$ ):

f_{t} i_{t} o_{t} \tilde{c}_{t} = σ (W_{f} [h_{t - 1}, x_{t}]) (forget) = σ (W_{i} [h_{t - 1}, x_{t}]) (input) = σ (W_{o} [h_{t - 1}, x_{t}]) (output) = tanh (W_{c} [h_{t - 1}, x_{t}]) (candidate)

Update:

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}, h_{t} = o_{t} ⊙ tanh (c_{t}) .

The forget gate controls what fraction of the previous cell state survives. The input gate controls what fraction of the new candidate is written. The output gate controls what fraction of the cell is exposed.

Why Hadamard products fix gradients

The cell-state recurrence is $c_{t} = f_{t} ⊙ c_{t - 1} + (\dots)$ . The Jacobian $\partial c_{t} / \partial c_{t - 1} = diag (f_{t})$ has bounded eigenvalues. If the forget gate stays near 1, gradients pass through unchanged; if it goes near 0, the cell forgets cleanly. No exponential blow-up or decay.

The GRU cell

Two gates instead of three:

z_{t} r_{t} \tilde{h}_{t} h_{t} = σ (W_{z} [h_{t - 1}, x_{t}]) (update) = σ (W_{r} [h_{t - 1}, x_{t}]) (reset) = tanh (W_{h} [r_{t} ⊙ h_{t - 1}, x_{t}]) = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t} .

Single hidden state, no separate cell. Slightly fewer parameters. Empirically comparable to LSTM on most tasks.

Tradeoffs vs transformers

Sequential: must process tokens one at a time. Cannot parallelize across the sequence dimension at training time.
Linear in sequence length at inference (vs $O (n^{2})$ for vanilla attention). The advantage that makes RNN-style models attractive again at very long context (Mamba, RWKV, linear attention).
No explicit attention. Information from token $i$ to token $j$ has to survive $j - i$ gate updates. Long-range dependencies are still hard in practice.

Common pitfalls

Treating LSTM and GRU as interchangeable. They are close empirically but the cell state in LSTM gives sharper control over long-range memory.
Using vanilla RNNs in 2025. Almost never the right choice. Either go LSTM/GRU or, more likely, transformer.
Forgetting truncated BPTT. Backpropagating through 100k tokens is infeasible; truncate at a window (typically 64 to 256 tokens) and cut gradients.