Most neural-network explanations start with math. That’s honest. But the ideas stick when you’ve actually broken something. Each section below is a live demo: click Run, watch it train, change a control, run it again.

  1. Activations exist for a reason
  2. Depth without nonlinearity is a lie
  3. Embeddings learn similarity from next-token alone
  4. Memorization vs generalization

1. Activations exist for a reason

A linear model and a ReLU model, side by side, trying to separate a red ring from a blue one. The linear model can’t draw a curved boundary no matter how long it trains. Without a nonlinear activation, a stack of layers collapses to one matrix multiply. The curved shape you need is impossible to express.


2. Depth without nonlinearity is a lie

Three networks on the same rings: 1 linear layer, 5 linear layers, 5 layers with ReLU. The two linear models track each other almost perfectly, because W₅·W₄·W₃·W₂·W₁ is just one matrix. The Proof panel computes the product so you don’t have to take my word for it.

Adding depth only helps if something nonlinear happens between layers.


3. Embeddings learn similarity from next-token alone

A small model trained to predict the next word in a toy language. No one labeled which words are similar. The model only sees sequences. After training, click any word on the map to see its nearest neighbors.

Words that appear in the same contexts end up close together. The similarity is a side effect of the prediction task, not something anyone designed.


4. Memorization vs generalization

The same network, three times, on 20, 200, and 2,000 points. Some labels are wrong on purpose. The 20-point model just memorizes its training set; the 2,000-point model can’t, so it has to generalize.

The train/test gap shrinks with more data. That’s usually the fix, not a better model.


The source (Python included) is on GitHub.