Graph neural networks: message passing as A·X·W

Neighbors carry signal. A graph neural network averages each node's neighborhood and projects with a learned matrix. The same matmul as a CNN, on irregular structure.

Reviewed May 7, 2026 · 3 min read

One-line definition

A graph neural network (GNN) updates each node’s features by aggregating its neighbors’ features and applying a learned transformation. The simplest variant is one matmul: $H^{(l + 1)} = σ (\hat{A} H^{(l)} W^{(l)})$ , where $\hat{A}$ is a normalized adjacency matrix and $H^{(l)}$ stacks node features.

Why it matters

A CNN exploits regular grid structure with shared local filters. A GNN does the same on graphs: shared parameters, local aggregation, but the neighborhood is defined by graph edges instead of pixel proximity. This unlocks ML on social networks, molecules, knowledge graphs, code ASTs, and recommender bipartite graphs.

GNNs power drug-discovery pipelines (AlphaFold’s Evoformer, DeepMind’s GNoME), large-scale recommenders (Pinterest’s PinSAGE, Uber’s GraphSAGE), and protein structure prediction. Modern transformers are arguably a special case (complete graph with attention as edge weighting).

The mechanism: GCN

The graph convolutional network (Kipf & Welling, 2017) is the canonical GNN:

H^{(l + 1)} = σ (\hat{A} H^{(l)} W^{(l)}),

where $\hat{A} = \tilde{D}^{- 1/2} (A + I) \tilde{D}^{- 1/2}$ is the symmetrically normalized adjacency with self-loops.

Decompose:

Aggregate: $\hat{A} H$ averages each node with its neighbors (and itself, via the self-loop).
Transform: multiply by $W$ to mix features.
Activate: ReLU or similar.

After $L$ layers, each node has aggregated information from its $L$ -hop neighborhood. The network is a sequence of matmuls, the same operation as any other deep learning model. The graph structure shows up only in $\hat{A}$ .

Variants by aggregator

GraphSAGE (Hamilton et al., 2017). Sample a fixed number of neighbors per node; aggregate with mean, max, or LSTM. Practical for large graphs where full-neighbor aggregation is infeasible.
GAT (Veličković et al., 2018). Replace uniform averaging with learned attention weights per edge: $α_{ij} = softmax_{j} (LeakyReLU (a^{⊤} [W h_{i} ∥ W h_{j}]))$ . Same form as transformer attention, restricted to graph neighbors.
GIN (Xu et al., 2019). Use sum aggregation and a learnable epsilon. Provably as expressive as the Weisfeiler-Lehman graph isomorphism test.
Message-passing neural networks (Gilmer et al., 2017). General framework: edges carry messages, nodes aggregate, both can be parametrized.

What the message-passing framework looks like

Most GNNs fit:

m_{ij} = Message (h_{i}, h_{j}, e_{ij}), h_{i}^{(l + 1)} = Update (h_{i}^{(l)}, Aggregate_{j \in N (i)} m_{ij}) .

Aggregate is permutation-invariant (sum, mean, max, attention). Different choices give different GNN families.

Where GNNs hit walls

Over-smoothing. After many layers, all nodes converge to similar representations. Practical depth: 2 to 5 layers for most graphs. Workarounds: residual connections, gating, jumping-knowledge networks.
Over-squashing. Information from distant nodes gets compressed through narrow bottlenecks. Long-range dependencies are hard.
Scalability. Full-graph training is $O (∣ E ∣ \cdot d)$ per layer; sampling (GraphSAGE) or graph clustering (Cluster-GCN) is needed for large graphs.
Expressiveness. Standard GNNs cannot distinguish graphs that the Weisfeiler-Lehman test cannot distinguish. More expressive variants (k-GNN, subgraph GNN) are slower.

Transformers as graphs

A standard transformer is a GNN on the complete graph with attention as the edge function. This is why graph transformers (Graphormer, GraphGPS) work: take a transformer, restrict the attention pattern to edges (or weight by graph distance), get a GNN with strong expressiveness.

Common pitfalls

Treating GNNs as deep. They go shallow (2 to 5 layers) for over-smoothing reasons.
Forgetting self-loops. Without them, a node loses its own features after one aggregation step.
Using sum without normalization on heterogeneous-degree graphs. High-degree nodes dominate; normalize by degree or use mean/attention.
Reporting accuracy without specifying the data split. Transductive vs inductive performance differ dramatically; many papers fudge this.