Memorization vs generalization
The same over-parameterized network trained on 20, 200, and 2 000 noisy points. With 25% label noise, a tiny dataset can be memorized; a large dataset forces the network to learn the rule.
20 training points
more parameters than data
200 training points
capacity and data balanced
2 000 training points
data overwhelms the noise
Generalization gap (test loss − train loss) · smaller is better
Warming up…