What if building a language model was like playing with LEGOs?
This is a tiny character-level language model that learns to invent new baby names โ built from scratch with Conal Elliott-style reverse-mode automatic differentiation. No PyTorch. No TensorFlow. Just TypeScript, Float32Arrays, and vibes.
It trains in ~1 second on 80 names and then dreams up new ones.
Imagine you're a kid trying to guess the next letter in a name:
"e", "m", "m" โ probably "a" (emma!)
"s", "o", "p" โ probably "h" (sophia!)
This program learns those patterns by:
- ๐ Looking at lots of real names
- ๐ข Turning each letter into a secret code (an embedding)
- ๐งฎ Mixing the codes together to guess the next letter
- โ Getting told "nope, wrong!" (the loss)
- ๐ Tracing backward through the math to figure out how to do better (the backprop)
- ๐ง Nudging all the knobs a little bit (the gradient descent)
- ๐ Repeating thousands of times in one second
Most ML frameworks treat autodiff as a graph-rewriting compiler pass. Conal Elliott's insight is far more elegant: the derivative is just a function. Every computation returns a pair:
(value, backpropagator)
The backpropagator is a first-class function that, given upstream sensitivity, pushes gradient to its inputs. Composition of programs gives composition of backpropagators โ no tape, no graph, no magic.
Here, that shows up as the Node<T> class:
class Node<T> {
v: T; // the value we computed
back: (g) => void; // "hey inputs, here's how much you matter"
}
ELI5: A Node is like a kid who knows their answer AND knows who to blame if the answer is wrong.
The model is a bag-of-embeddings predictor (think: the simplest possible LLM):
Rendering mermaid diagram...
Rendering mermaid diagram...
Every primitive follows the same shape โ this is the whole trick:
Rendering mermaid diagram...
No tape. No graph object. Just functions calling functions.
| Concept | Code | ELI5 |
|---|---|---|
| Node<T> | class Node<T> | A value that knows who to blame |
| Param | class Param | A trainable knob with a gradient bucket |
| gatherRows | E[ctx] โ (B,K,D) | "Look up the secret codes for these letters" |
| sumOverK | ฮฃ_k emb โ (B,D) | "Mix the codes together" |
| matmul_W_h | h @ Wแต โ (B,V) | "Score every possible next letter" |
| xentFromLogits | softmax + -log(p) | "How surprised were we by the right answer?" |
| loss.back(1.0) | reverse-mode AD | "OK everyone, trace back the blame!" |
# In Val Town โ just hit Run on main.ts # Locally: npx tsx main.ts
Output (something like):
Elapsed: 1.000 seconds
Steps: 4200
Final loss: 2.1234
Samples:
1 : arielle
2 : mavia
3 : elina
4 : sova
5 : nalia
...
- Conal Elliott โ The Simple Essence of Automatic Differentiation โ the paper that inspired this style
- Andrej Karpathy โ makemore โ similar char-level model in Python
- 3Blue1Brown โ Neural Networks โ visual intuition for backprop
Built with zero dependencies. Just math, types, and the Conal Elliott conviction that derivatives are functions, not data structures. โจ
