Memorization vs generalization

The same over-parameterized network trained on 20, 200, and 2 000 noisy points. With 25% label noise, a tiny dataset can be memorized; a large dataset forces the network to learn the rule.

Hidden size

Epochs

Label noise

Learning rate

20 training points

more parameters than data

200 training points

capacity and data balanced

2 000 training points

data overwhelms the noise

Generalization gap (test loss − train loss) · smaller is better

20 pts

—

200 pts

—

2 000 pts

—

Warming up…