Embeddings learn similarity from next-token prediction

A tiny model predicts the next word in a made-up grammar. No one tells it "cat and dog are similar", yet their embeddings cluster together.

Loss: · Accuracy:
Word map · click a word to inspect it
Embedding fingerprint
Click a word on the map.
Why they cluster — shared next-token distributions
cat ↔ dog
cat
dog

Tokens that predict the same next tokens get pulled to the same place in vector space.

Warming up…