Depth without nonlinearity is one matrix

Three networks on the same rings. Linear depth adds nothing; ReLU depth does.

0.15
1 linear layer
draws one straight line
accuracy
5 linear layers
still just a straight line
accuracy
5 layers + ReLU
a kink between each layer
accuracy
Proof: W₅·W₄·W₃·W₂·W₁ = one 2→1 matrix
Multiplying the five weight matrices gives a single affine map. Running every point through it reproduces the 5-layer model exactly, because it is the same function.
collapsed weights (w₁, w₂ · bias)
max |logit difference|
Warming up…