Why activations matter

Two networks on the same noisy rings. Only difference: one hidden layer of ReLU.

Hidden size

Steps

Noise

Learning rate

Linear only

can only draw a straight line

accuracy —

1 hidden layer + ReLU

can bend around the ring

accuracy —

Warming up…