What if building a language model was like playing with LEGOs?
This is a character-level language model that learns to invent new baby names — built from scratch with Conal Elliott-style reverse-mode automatic differentiation. No PyTorch. No TensorFlow. Just TypeScript, Float32Arrays, and vibes.
It trains in ~10 seconds on all 32,033 names from Karpathy's makemore dataset and then dreams up new ones like "aylies", "marya", "laina", and "kari".
Pick your depth:
| Time | What you'll get |
|---|---|
| ⏱️ 30 seconds | Read the ELI5 section — you'll know what this does |
| ⏱️ 5 minutes | Add the Architecture section — you'll know how it works |
| ⏱️ 20 minutes | Read "One Training Step, Traced" — you'll feel the math with real numbers |
| ⏱️ 1 hour | Read main.ts guided by the inline comments — you'll understand every line |
| ⏱️ Beyond | Read "From Names to Sentences" — you'll see how this becomes ChatGPT |
Imagine you're a kid trying to guess the next letter in a name:
"e", "m", "m" → probably "a" (emma!)
"s", "o", "p" → probably "h" (sophia!)
This program learns those patterns by:
- 📖 Looking at 32K real names (fetched live from GitHub!)
- 🔢 Turning each letter into a secret code (an embedding)
- 📏 Gluing 5 codes together into one long vector (the concat)
- 🧠 Passing through a hidden brain layer with tanh squishing
- 🧮 Scoring every possible next letter
- ❌ Getting told "nope, wrong!" (the loss)
- 🔙 Tracing backward through the math to figure out how to do better (the backprop)
- 🔧 Nudging all the knobs a little bit (the gradient descent)
- 🔁 Repeating ~3,400 times in ten seconds
Most ML frameworks treat autodiff as a graph-rewriting compiler pass. Conal Elliott's insight is far more elegant: the derivative is just a function. Every computation returns a pair:
(value, backpropagator)
The backpropagator is a first-class function that, given upstream sensitivity, pushes gradient to its inputs. Composition of programs gives composition of backpropagators — no tape, no graph, no magic.
Here, that shows up as the Node<T> class:
class Node<T> {
v: T; // the value we computed
back: (g) => void; // "hey inputs, here's how much you matter"
}
ELI5: A Node is like a kid who knows their answer AND knows who to blame if the answer is wrong.
The model is a Bengio-style MLP (like Karpathy's makemore Part 2):
Rendering mermaid diagram...
7,721 parameters total — small enough to fit in a tweet, powerful enough to dream up names.
Rendering mermaid diagram...
Every primitive follows the same shape — this is the whole trick:
Rendering mermaid diagram...
No tape. No graph object. Just functions calling functions.
| Concept | Code | ELI5 |
|---|---|---|
| Node<T> | class Node<T> | A value that knows who to blame |
| Param | class Param | A trainable knob with a gradient bucket |
| gatherAndConcat | E[ctx] → (B, K·Emb) | "Look up codes for 5 letters, glue into one vector" |
| linearLayer | X @ W^T + b | "Each neuron forms an opinion about the input" |
| tanhAct | tanh(x) | "Squish numbers to stay calm" |
| xentFromLogits | softmax + -log(p) | "How surprised were we by the right answer?" |
| loss.back(1.0) | reverse-mode AD | "OK everyone, trace back the blame!" |
Abstract math gets real when you follow one example through. Here's what actually happens inside a single training step. Every number below is real — this is what the code computes.
The model sees the context ".", "e", "m", "m", "a" and must predict the next character. (The name is "emma", and "." is the BOS/start token.) The correct answer is "." — the name is over.
Each of the 5 context characters looks up its row in the embedding table E (27 × 10). The character "e" (id=5) pulls out 10 numbers, "m" (id=13) pulls out its 10 numbers, etc. Then we glue all 5 vectors together:
embed = [ e's 10 floats | m's 10 floats | m's 10 floats | a's 10 floats | .'s 10 floats ]
= 50 numbers total
Note: "m" appears twice, so the same 10 numbers appear in two places. During backprop, the gradients from both positions add up — that's the += in gatherAndConcat's backward pass.
The 50-number embedding hits the hidden layer: h = tanh(W₁ · embed + b₁). Each of the 128 hidden neurons computes a weighted sum of all 50 inputs, adds its bias, then gets squished through tanh into the range (-1, +1):
pre-tanh neuron 0: 0.02×embed[0] + (-0.01)×embed[1] + ... + bias[0] = 1.47
after tanh: tanh(1.47) = 0.90
pre-tanh neuron 1: ... = -0.83
after tanh: tanh(-0.83) = -0.68
... (128 neurons total)
The 128 hidden activations hit the output layer: logits = W₂ · h + b₂. This produces 27 raw scores, one per character:
logit["."] = 1.2, logit["a"] = 2.1, logit["b"] = -0.5, ... logit["m"] = 1.8, ...
These are just scores — they can be any number, positive or negative.
Softmax turns scores into probabilities (all positive, sum to 1):
p["."] = exp(1.2) / Σexp(all) = 0.07 ← the model gives 7% to the right answer
p["a"] = exp(2.1) / Σexp(all) = 0.18
p["m"] = exp(1.8) / Σexp(all) = 0.14
... (all 27 sum to 1.0)
The target is "." (id=0). The loss is:
loss = -log(p["."]) = -log(0.07) = 2.66
Interpretation: If the model had put 100% on ".", loss = -log(1) = 0 (not surprised at all). If it had put 0.01% on ".", loss = -log(0.0001) = 9.2 (very surprised). Our 7% gives a loss of 2.66 — not great, not terrible.
Here's where the Conal magic happens. We call loss.back(1.0) and blame flows backward through every layer:
Gradient on logits (the beautiful shortcut — no chain rule needed here!):
∂loss/∂logit["."] = p["."] - 1 = 0.07 - 1 = -0.93 ← "push this score UP a lot"
∂loss/∂logit["a"] = p["a"] - 0 = +0.18 ← "push this score down a bit"
∂loss/∂logit["m"] = p["m"] - 0 = +0.14 ← "push this score down a bit"
... (all non-targets get pushed down by their probability)
This is why the softmax+cross-entropy gradient is so elegant: it's just (probability - target). The model put too much probability on "a" and "m", too little on ".", so the gradient says "fix that."
Gradient flows through W₂ → through tanh (multiplied by 1 - tanh²) → through W₁ → back into the embedding table.
At the end of backward, every one of the 7,721 parameters has a gradient telling it "here's how to nudge yourself to make this prediction less surprising."
Simple gradient descent: param -= learning_rate × gradient. Every parameter shifts slightly. After thousands of steps, the model gets less and less surprised by the training data, and starts generating plausible new names.
Training on 32,033 names (90/10 train/test split) for 10 seconds:
| Metric | Value |
|---|---|
| Dataset | 32,033 names from names.txt |
| Parameters | 7,721 |
| Steps | ~3,400 |
| Train loss | ~2.33 |
| Test loss | ~2.33 |
| Time | 10 seconds |
Sample generated names: aylies, avurie, kari, marya, laina, dorie, alyni, elia
This model generates baby names. ChatGPT completes sentences. What's between them? Less than you'd think.
Right now we train on 32K names with 27 characters (a-z + BOS). To handle sentences, just swap in a text file — say, Tiny Shakespeare (~1MB). The vocab grows to ~65 characters (add capitals, spaces, punctuation). Nothing else changes — the loss function, the training loop, the AD framework all work identically. V just becomes 65 instead of 27.
What you'd get: babble that looks like English character patterns, but no coherent words. Why? Because K=5 characters is barely one word of context.
With K=5 you see 5 characters of history. English words average ~5 characters, so you can barely see one word back. To complete sentences you'd want K=64 or K=128+.
But here's the problem: our model concatenates the K embeddings into one flat vector. With K=128 and Emb=10, that's a 1,280-dim input. The model must learn completely separate weights for "e in position 1" vs. "e in position 47", even though they're the same letter. The parameter count explodes, and there's no way for character 3 to selectively attend to character 97.
This is exactly the problem attention solves.
Attention replaces our concat-everything approach with a selective mechanism: each position gets to look at all previous positions and decide what's relevant, using learned Query/Key/Value projections.
The beautiful thing: in our Conal framework, attention decomposes into primitives we almost already have:
Q, K, V = three linear projections ← we already have linearLayer!
scores = Q · Kᵀ / √d ← dot products, we can do this
weights = softmax(scores) ← we already have softmax!
output = weights · V ← another matmul
Each of these would get a back(g) function, just like our existing primitives. The Node<T> pattern doesn't change at all.
With attention + a few more standard pieces (layer normalization, residual connections, multiple layers), you'd need ~500K–2M parameters and minutes-to-hours of training. But the result would complete basic sentences.
Here's the punchline that surprises most people:
| Component | This Project | ChatGPT |
|---|---|---|
| Loss function | -log(softmax(logits)[target]) | -log(softmax(logits)[target]) |
| AD framework | Node<T> with back(g) | Same idea, on tensors + GPUs |
| Training loop | forward → loss → backward → update | forward → loss → backward → update |
| What it learns | "after 'emm', 'a' is likely" | "after 'The capital of France is', 'Paris' is likely" |
The core algorithm is identical. Everything between here and ChatGPT is scale (more data, bigger model, faster hardware) and a few architectural upgrades (attention, normalization, better optimizers). The essence — predict the next token, measure surprise, trace blame backward, nudge the knobs — that's all here in 491 lines of TypeScript.
# In Val Town — just hit Run on main.ts # Locally: npx tsx main.ts
- Conal Elliott — The Simple Essence of Automatic Differentiation — the paper that inspired this style
- Andrej Karpathy — makemore — similar char-level model in Python
- Andrej Karpathy — microgpt — the beautiful 243-line GPT
- Bengio et al. 2003 — A Neural Probabilistic Language Model — the original MLP LM paper
- 3Blue1Brown — Neural Networks — visual intuition for backprop
Built with zero dependencies. Just math, types, and the Conal Elliott conviction that derivatives are functions, not data structures. ✨
