A direct TypeScript port of Karpathy's microGPT — faithful to the original's scalar Value-based autograd, not optimized for speed.
Karpathy's microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit at the same level of abstraction — every scalar is a Value node in a dynamically-built computation graph, just like the original.
The core design choice: every number is a Value object.
class Value {
data: number; // one scalar
grad: number; // one scalar
_backward: () => void;
_prev: Value[]; // parent nodes in the graph
}
A 16-dim embedding lookup creates 16 Value objects. A vector-matrix multiply of [16] × [16, 48] creates 48 dot products, each made of 16 multiplies and 15 adds — over 1,000 new Value nodes. A single forward pass builds a computation graph with tens of thousands of nodes, each individually linked and walked during .backward().
This is the same approach as Karpathy's Python Value class. It makes the chain rule visible at every single scalar operation.
Same GPT-2-style transformer as the original:
Rendering mermaid diagram...
Following Karpathy's decomposition:
- Dataset — ~32k names fetched from Karpathy's
makemorerepo - Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
- Autograd — Scalar
Valueclass withadd,mul,pow,exp,log,relu— each operation creates a node in the graph - Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
- Loss — Cross-entropy over next-token predictions
- Optimizer — Adam with bias correction and linear learning rate decay
- Sampling — Temperature-controlled autoregressive generation
This is a Val Town script val. It trains for 1000 steps on CPU. Because every scalar is its own autograd node, it's much slower than a tensor-based implementation (~60x slower) — but the code is simpler and more transparent:
num docs: 32033
vocab size: 27
num params: 4795
step 1 / 1000 | loss 3.5062 | 0.0s
step 101 / 1000 | loss 2.7573 | ...s
...
step 1000 / 1000 | loss 2.2891 | ...s
--- generation ---
sample 1: malede
sample 2: jara
...
Matched to Karpathy's defaults:
| Parameter | Value | Notes |
|---|---|---|
dModel | 16 | Embedding dimension |
nHeads | 4 | Attention heads (head dim = 4) |
nLayers | 1 | Transformer blocks |
dFF | 64 | FF hidden dim (4× dModel) |
maxLen | 8 | Context window |
steps | 1000 | Training iterations |
learningRate | 0.01 | With linear decay to 0 |
seed | 42 | Deterministic initialization |
This version uses scalar autograd — the same approach as Karpathy's original Python. Every addition, multiplication, and activation creates a new Value object in the computation graph. A single forward pass creates tens of thousands of nodes, and backward() must topologically sort and walk all of them.
The main branch uses tensor autograd instead — a single matmul operation creates one graph node with a fused backward closure that loops over all scalars in tight for loops. Same math, ~60x faster, but the individual scalar chain rule is hidden inside the closure.
| Factor | This version (scalar Values) | Main branch (Tensors) |
|---|---|---|
| Graph nodes per forward pass | ~10,000+ Value objects | ~30-40 Tensor objects |
| Object allocation | One object per scalar op | One TypedArray per tensor op |
| Backward traversal | Topological sort over ~10K nodes | Topological sort over ~30 nodes |
| Inner loop execution | V8 JIT over many small closures | V8 JIT over tight array loops |
Karpathy's choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. This port preserves that choice.
TypeScript's type system makes the structure visible: ModelSpec describes what the model is, Model is the behavior, Trained is a frozen snapshot. But the autograd is the same scalar-level approach as the Python original — no tensor tricks, no fused kernels, just Value.add() and Value.mul() all the way down.
