Memorization vs generalization

The same over-parameterized network trained on 20, 200, and 2 000 noisy points. With 25% label noise, a tiny dataset can be memorized; a large dataset forces the network to learn the rule.

20 training points
more parameters than data
200 training points
capacity and data balanced
2 000 training points
data overwhelms the noise
Generalization gap (test loss − train loss) · smaller is better
20 pts
200 pts
2 000 pts
Warming up…