🧒 AutoLLM — An LLM You Can Explain to a 5-Year-Old

What if building a language model was like playing with LEGOs?

This is a character-level language model that learns to invent new baby names — built from scratch with Conal Elliott-style reverse-mode automatic differentiation. No PyTorch. No TensorFlow. Just TypeScript, Float32Arrays, and vibes.

It trains in ~10 seconds on all 32,033 names from Karpathy's makemore dataset and then dreams up new ones like "aylies", "marya", "laina", and "kari".

🗺️ How to Read This

Pick your depth:

Time	What you'll get
⏱️ 30 seconds	Read the ELI5 section — you'll know what this does
⏱️ 5 minutes	Add the Architecture section — you'll know how it works
⏱️ 20 minutes	Read "One Training Step, Traced" — you'll feel the math with real numbers
⏱️ 1 hour	Read `main.ts` guided by the inline comments — you'll understand every line
⏱️ Beyond	Read "From Names to Sentences" — you'll see how this becomes ChatGPT

🤔 ELI5: What's Going On? ⏱️ 30 sec

Imagine you're a kid trying to guess the next letter in a name:

"e", "m", "m" → probably "a" (emma!)
"s", "o", "p" → probably "h" (sophia!)

This program learns those patterns by:

📖 Looking at 32K real names (fetched live from GitHub!)
🔢 Turning each letter into a secret code (an embedding)
📏 Gluing 5 codes together into one long vector (the concat)
🧠 Passing through a hidden brain layer with tanh squishing
🧮 Scoring every possible next letter
❌ Getting told "nope, wrong!" (the loss)
🔙 Tracing backward through the math to figure out how to do better (the backprop)
🔧 Nudging all the knobs a little bit (the gradient descent)
🔁 Repeating ~3,400 times in ten seconds

🧠 The Conal Elliott Secret Sauce ⏱️ 5 min

Most ML frameworks treat autodiff as a graph-rewriting compiler pass. Conal Elliott's insight is far more elegant: the derivative is just a function. Every computation returns a pair:

(value, backpropagator)

The backpropagator is a first-class function that, given upstream sensitivity, pushes gradient to its inputs. Composition of programs gives composition of backpropagators — no tape, no graph, no magic.

Here, that shows up as the Node<T> class:

class Node<T> {
  v: T;              // the value we computed
  back: (g) => void; // "hey inputs, here's how much you matter"
}

ELI5: A Node is like a kid who knows their answer AND knows who to blame if the answer is wrong.

🏗️ Architecture ⏱️ 5 min

The model is a Bengio-style MLP (like Karpathy's makemore Part 2):

Rendering mermaid diagram...

7,721 parameters total — small enough to fit in a tweet, powerful enough to dream up names.

Forward + Backward in One Picture

Rendering mermaid diagram...

The Conal Pattern

Every primitive follows the same shape — this is the whole trick:

Rendering mermaid diagram...

No tape. No graph object. Just functions calling functions.

📦 What's in the Box?

Concept	Code	ELI5
Node<T>	`class Node<T>`	A value that knows who to blame
Param	`class Param`	A trainable knob with a gradient bucket
gatherAndConcat	`E[ctx]` → `(B, K·Emb)`	"Look up codes for 5 letters, glue into one vector"
linearLayer	`X @ W^T + b`	"Each neuron forms an opinion about the input"
tanhAct	`tanh(x)`	"Squish numbers to stay calm"
xentFromLogits	softmax + `-log(p)`	"How surprised were we by the right answer?"
loss.back(1.0)	reverse-mode AD	"OK everyone, trace back the blame!"

🔍 One Training Step, Traced ⏱️ 20 min

Abstract math gets real when you follow one example through. Here's what actually happens inside a single training step. Every number below is real — this is what the code computes.

The Setup

The model sees the context ".", "e", "m", "m", "a" and must predict the next character. (The name is "emma", and "." is the BOS/start token.) The correct answer is "." — the name is over.

Step 1: Embedding Lookup

Each of the 5 context characters looks up its row in the embedding table E (27 × 10). The character "e" (id=5) pulls out 10 numbers, "m" (id=13) pulls out its 10 numbers, etc. Then we glue all 5 vectors together:

embed = [ e's 10 floats | m's 10 floats | m's 10 floats | a's 10 floats | .'s 10 floats ]
       = 50 numbers total

Note: "m" appears twice, so the same 10 numbers appear in two places. During backprop, the gradients from both positions add up — that's the += in gatherAndConcat's backward pass.

Step 2: Hidden Layer

The 50-number embedding hits the hidden layer: h = tanh(W₁ · embed + b₁). Each of the 128 hidden neurons computes a weighted sum of all 50 inputs, adds its bias, then gets squished through tanh into the range (-1, +1):

pre-tanh neuron 0:  0.02×embed[0] + (-0.01)×embed[1] + ... + bias[0] = 1.47
after tanh:         tanh(1.47) = 0.90

pre-tanh neuron 1:  ... = -0.83
after tanh:         tanh(-0.83) = -0.68

... (128 neurons total)

Step 3: Output Scores (Logits)

The 128 hidden activations hit the output layer: logits = W₂ · h + b₂. This produces 27 raw scores, one per character:

logit["."] = 1.2,  logit["a"] = 2.1,  logit["b"] = -0.5, ...  logit["m"] = 1.8, ...

These are just scores — they can be any number, positive or negative.

Step 4: Softmax → Probabilities

Softmax turns scores into probabilities (all positive, sum to 1):

p["."] = exp(1.2) / Σexp(all) = 0.07    ← the model gives 7% to the right answer
p["a"] = exp(2.1) / Σexp(all) = 0.18
p["m"] = exp(1.8) / Σexp(all) = 0.14
... (all 27 sum to 1.0)

Step 5: Loss — "How Surprised Are We?"

The target is "." (id=0). The loss is:

loss = -log(p["."]) = -log(0.07) = 2.66

Interpretation: If the model had put 100% on ".", loss = -log(1) = 0 (not surprised at all). If it had put 0.01% on ".", loss = -log(0.0001) = 9.2 (very surprised). Our 7% gives a loss of 2.66 — not great, not terrible.

Step 6: Backward — "Who's to Blame?"

Here's where the Conal magic happens. We call loss.back(1.0) and blame flows backward through every layer:

Gradient on logits (the beautiful shortcut — no chain rule needed here!):

∂loss/∂logit["."] = p["."] - 1 = 0.07 - 1 = -0.93   ← "push this score UP a lot"
∂loss/∂logit["a"] = p["a"] - 0 = +0.18               ← "push this score down a bit"
∂loss/∂logit["m"] = p["m"] - 0 = +0.14               ← "push this score down a bit"
... (all non-targets get pushed down by their probability)

This is why the softmax+cross-entropy gradient is so elegant: it's just (probability - target). The model put too much probability on "a" and "m", too little on ".", so the gradient says "fix that."

Gradient flows through W₂ → through tanh (multiplied by 1 - tanh²) → through W₁ → back into the embedding table.

At the end of backward, every one of the 7,721 parameters has a gradient telling it "here's how to nudge yourself to make this prediction less surprising."

Step 7: Update — "Nudge the Knobs"

Simple gradient descent: param -= learning_rate × gradient. Every parameter shifts slightly. After thousands of steps, the model gets less and less surprised by the training data, and starts generating plausible new names.

📊 Results

Training on 32,033 names (90/10 train/test split) for 10 seconds:

Metric	Value
Dataset	32,033 names from names.txt
Parameters	7,721
Steps	~3,400
Train loss	~2.33
Test loss	~2.33
Time	10 seconds

Sample generated names: aylies, avurie, kari, marya, laina, dorie, alyni, elia

🗺️ From Names to Sentences ⏱️ Beyond

This model generates baby names. ChatGPT completes sentences. What's between them? Less than you'd think.

Level 1: Swap the Data 🟢

Right now we train on 32K names with 27 characters (a-z + BOS). To handle sentences, just swap in a text file — say, Tiny Shakespeare (~1MB). The vocab grows to ~65 characters (add capitals, spaces, punctuation). Nothing else changes — the loss function, the training loop, the AD framework all work identically. V just becomes 65 instead of 27.

What you'd get: babble that looks like English character patterns, but no coherent words. Why? Because K=5 characters is barely one word of context.

Level 2: See More Context 🟡

With K=5 you see 5 characters of history. English words average ~5 characters, so you can barely see one word back. To complete sentences you'd want K=64 or K=128+.

But here's the problem: our model concatenates the K embeddings into one flat vector. With K=128 and Emb=10, that's a 1,280-dim input. The model must learn completely separate weights for "e in position 1" vs. "e in position 47", even though they're the same letter. The parameter count explodes, and there's no way for character 3 to selectively attend to character 97.

This is exactly the problem attention solves.

Level 3: Add Attention 🔴

Attention replaces our concat-everything approach with a selective mechanism: each position gets to look at all previous positions and decide what's relevant, using learned Query/Key/Value projections.

The beautiful thing: in our Conal framework, attention decomposes into primitives we almost already have:

Q, K, V = three linear projections     ← we already have linearLayer!
scores  = Q · Kᵀ / √d                  ← dot products, we can do this
weights = softmax(scores)               ← we already have softmax!
output  = weights · V                   ← another matmul

Each of these would get a back(g) function, just like our existing primitives. The Node<T> pattern doesn't change at all.

With attention + a few more standard pieces (layer normalization, residual connections, multiple layers), you'd need ~500K–2M parameters and minutes-to-hours of training. But the result would complete basic sentences.

🔑 What Stays the Same All the Way to ChatGPT

Here's the punchline that surprises most people:

Component	This Project	ChatGPT
Loss function	`-log(softmax(logits)[target])`	`-log(softmax(logits)[target])`
AD framework	`Node<T>` with `back(g)`	Same idea, on tensors + GPUs
Training loop	forward → loss → backward → update	forward → loss → backward → update
What it learns	"after 'emm', 'a' is likely"	"after 'The capital of France is', 'Paris' is likely"

The core algorithm is identical. Everything between here and ChatGPT is scale (more data, bigger model, faster hardware) and a few architectural upgrades (attention, normalization, better optimizers). The essence — predict the next token, measure surprise, trace blame backward, nudge the knobs — that's all here in 491 lines of TypeScript.

🏃 Running It

# In Val Town — just hit Run on main.ts
# Locally:
npx tsx main.ts

🎓 Further Reading

Conal Elliott — The Simple Essence of Automatic Differentiation — the paper that inspired this style
Andrej Karpathy — makemore — similar char-level model in Python
Andrej Karpathy — microgpt — the beautiful 243-line GPT
Bengio et al. 2003 — A Neural Probabilistic Language Model — the original MLP LM paper
3Blue1Brown — Neural Networks — visual intuition for backprop

Built with zero dependencies. Just math, types, and the Conal Elliott conviction that derivatives are functions, not data structures. ✨

stevekrouse

autollm