Why activations matter
Two networks on the same noisy rings. Only difference: one hidden layer of ReLU.
▶ Run
Hidden size
Steps
Noise
Learning rate
Linear only
can only draw a straight line
accuracy
—
1 hidden layer + ReLU
can bend around the ring
accuracy
—
Warming up…