microGPT.ts (scalar autograd edition)

A direct TypeScript port of Karpathy's microGPT — faithful to the original's scalar Value-based autograd, not optimized for speed.

Karpathy's microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit at the same level of abstraction — every scalar is a Value node in a dynamically-built computation graph, just like the original.

Scalar Autograd: Same as Karpathy

The core design choice: every number is a Value object.

class Value {
  data: number;      // one scalar
  grad: number;      // one scalar
  _backward: () => void;
  _prev: Value[];    // parent nodes in the graph
}

A 16-dim embedding lookup creates 16 Value objects. A vector-matrix multiply of [16] × [16, 48] creates 48 dot products, each made of 16 multiplies and 15 adds — over 1,000 new Value nodes. A single forward pass builds a computation graph with tens of thousands of nodes, each individually linked and walked during .backward().

This is the same approach as Karpathy's Python Value class. It makes the chain rule visible at every single scalar operation.

Architecture

Same GPT-2-style transformer as the original:

Rendering mermaid diagram...

The Seven Components

Following Karpathy's decomposition:

Dataset — ~32k names fetched from Karpathy's makemore repo
Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
Autograd — Scalar Value class with add, mul, pow, exp, log, relu — each operation creates a node in the graph
Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
Loss — Cross-entropy over next-token predictions
Optimizer — Adam with bias correction and linear learning rate decay
Sampling — Temperature-controlled autoregressive generation

Running

This is a Val Town script val. It trains for 1000 steps on CPU. Because every scalar is its own autograd node, it's much slower than a tensor-based implementation (~60x slower) — but the code is simpler and more transparent:

num docs: 32033
vocab size: 27
num params: 4795
step    1 / 1000 | loss 3.5062 | 0.0s
step  101 / 1000 | loss 2.7573 | ...s
...
step 1000 / 1000 | loss 2.2891 | ...s

--- generation ---
sample  1: malede
sample  2: jara
...

Hyperparameters

Matched to Karpathy's defaults:

Parameter	Value	Notes
`dModel`	16	Embedding dimension
`nHeads`	4	Attention heads (head dim = 4)
`nLayers`	1	Transformer blocks
`dFF`	64	FF hidden dim (4× `dModel`)
`maxLen`	8	Context window
`steps`	1000	Training iterations
`learningRate`	0.01	With linear decay to 0
`seed`	42	Deterministic initialization

Why is this slow?

This version uses scalar autograd — the same approach as Karpathy's original Python. Every addition, multiplication, and activation creates a new Value object in the computation graph. A single forward pass creates tens of thousands of nodes, and backward() must topologically sort and walk all of them.

The main branch uses tensor autograd instead — a single matmul operation creates one graph node with a fused backward closure that loops over all scalars in tight for loops. Same math, ~60x faster, but the individual scalar chain rule is hidden inside the closure.

Factor	This version (scalar Values)	Main branch (Tensors)
Graph nodes per forward pass	~10,000+ `Value` objects	~30-40 `Tensor` objects
Object allocation	One object per scalar op	One TypedArray per tensor op
Backward traversal	Topological sort over ~10K nodes	Topological sort over ~30 nodes
Inner loop execution	V8 JIT over many small closures	V8 JIT over tight array loops

Karpathy's choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. This port preserves that choice.

Why TypeScript?

TypeScript's type system makes the structure visible: ModelSpec describes what the model is, Model is the behavior, Trained is a frozen snapshot. But the autograd is the same scalar-level approach as the Python original — no tensor tricks, no fused kernels, just Value.add() and Value.mul() all the way down.

stevekrouse

microgpt