What if building a language model was like playing with LEGOs?
This is a character-level language model that learns to invent new baby names โ built from scratch with Conal Elliott-style reverse-mode automatic differentiation. No PyTorch. No TensorFlow. Just TypeScript, Float32Arrays, and vibes.
It trains in ~10 seconds on all 32,033 names from Karpathy's makemore dataset and then dreams up new ones like "aylies", "marya", "laina", and "kari".
Imagine you're a kid trying to guess the next letter in a name:
"e", "m", "m" โ probably "a" (emma!)
"s", "o", "p" โ probably "h" (sophia!)
This program learns those patterns by:
- ๐ Looking at 32K real names (fetched live from GitHub!)
- ๐ข Turning each letter into a secret code (an embedding)
- ๐ Gluing 3 codes together into one long vector (the concat)
- ๐ง Passing through a hidden brain layer with tanh squishing
- ๐งฎ Scoring every possible next letter
- โ Getting told "nope, wrong!" (the loss)
- ๐ Tracing backward through the math to figure out how to do better (the backprop)
- ๐ง Nudging all the knobs a little bit (the gradient descent)
- ๐ Repeating ~3,400 times in ten seconds
Most ML frameworks treat autodiff as a graph-rewriting compiler pass. Conal Elliott's insight is far more elegant: the derivative is just a function. Every computation returns a pair:
(value, backpropagator)
The backpropagator is a first-class function that, given upstream sensitivity, pushes gradient to its inputs. Composition of programs gives composition of backpropagators โ no tape, no graph, no magic.
Here, that shows up as the Node<T> class:
class Node<T> {
v: T; // the value we computed
back: (g) => void; // "hey inputs, here's how much you matter"
}
ELI5: A Node is like a kid who knows their answer AND knows who to blame if the answer is wrong.
The model is a Bengio-style MLP (like Karpathy's makemore Part 2):
Rendering mermaid diagram...
7,721 parameters total โ small enough to fit in a tweet, powerful enough to dream up names.
Rendering mermaid diagram...
Every primitive follows the same shape โ this is the whole trick:
Rendering mermaid diagram...
No tape. No graph object. Just functions calling functions.
| Concept | Code | ELI5 |
|---|---|---|
| Node<T> | class Node<T> | A value that knows who to blame |
| Param | class Param | A trainable knob with a gradient bucket |
| gatherAndConcat | E[ctx] โ (B, KยทEmb) | "Look up codes for 3 letters, glue into one vector" |
| linearLayer | X @ W^T + b | "Each neuron forms an opinion about the input" |
| tanhAct | tanh(x) | "Squish numbers to stay calm" |
| xentFromLogits | softmax + -log(p) | "How surprised were we by the right answer?" |
| loss.back(1.0) | reverse-mode AD | "OK everyone, trace back the blame!" |
Training on 32,033 names (90/10 train/test split) for 10 seconds:
| Metric | Value |
|---|---|
| Dataset | 32,033 names from names.txt |
| Parameters | 7,721 |
| Steps | ~3,400 |
| Train loss | ~2.33 |
| Test loss | ~2.33 |
| Time | 10 seconds |
Sample generated names: aylies, avurie, kari, marya, laina, dorie, alyni, elia
# In Val Town โ just hit Run on main.ts # Locally: npx tsx main.ts
- Conal Elliott โ The Simple Essence of Automatic Differentiation โ the paper that inspired this style
- Andrej Karpathy โ makemore โ similar char-level model in Python
- Andrej Karpathy โ microgpt โ the beautiful 300-line GPT
- Bengio et al. 2003 โ A Neural Probabilistic Language Model โ the original MLP LM paper
- 3Blue1Brown โ Neural Networks โ visual intuition for backprop
Built with zero dependencies. Just math, types, and the Conal Elliott conviction that derivatives are functions, not data structures. โจ
