A TypeScript port of Karpathy's microGPT — but written in the style of Conal Elliott's denotational design: types first, meanings first, implementation as a consequence.
Karpathy's original microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit but raises the level of abstraction — leaning into TypeScript's type system to make the structure of a language model legible, not just the math.
Conal Elliott's core idea: give a simple mathematical meaning (denotation) for each type, then define operations as if they work on meanings, not representations. The implementation is free to differ for efficiency, but must be observationally equivalent to the denotation.
Here, the "meanings" are:
| Type | Denotation (meaning) |
|---|---|
Tensor | A shaped array of scalars with attached gradient and backward function — i.e., a node in a computation graph |
ModelSpec | The what of a transformer: vocab size, dimensions, heads, layers — a pure description with no behavior |
Model | A triple of (spec, initParams, forward) — a model is its specification plus two functions |
Trained | A triple of (tokenizer, model, params) — a frozen snapshot: everything needed to generate |
The key move: Model is not a class with hidden state. It's a plain record of functions. makeTransformerLanguageModel takes a ModelSpec and returns a Model — a function from specification to behavior. This is the denotational design pattern: separate the what (spec) from the how (init + forward), and make the connection between them explicit and total.
Rendering mermaid diagram...
Each transformer layer follows the now-standard pre-norm pattern:
Rendering mermaid diagram...
Following Karpathy's decomposition — every LLM has exactly these parts, and nothing else:
- Dataset — ~32k names fetched from Karpathy's
makemorerepo - Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
- Autograd — Micrograd-style reverse-mode AD on flat
Float32Arraytensors - Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
- Loss — Cross-entropy over next-token predictions
- Optimizer — Adam with bias correction and linear learning rate decay
- Sampling — Temperature-controlled autoregressive generation
This is a Val Town script val. Run it directly — it will train for 1000 steps on CPU and generate 20 sample names:
num docs: 32033
vocab size: 27
num params: 4795
step 1 / 1000 | loss 3.5062 | 0.0s
step 101 / 1000 | loss 2.7573 | 1.2s
...
step 1000 / 1000 | loss 2.2891 | 11.8s
--- generation ---
sample 1: malede
sample 2: jara
sample 3: kaylin
...
Matched to Karpathy's defaults:
| Parameter | Value | Notes |
|---|---|---|
dModel | 16 | Embedding dimension |
nHeads | 4 | Attention heads (head dim = 4) |
nLayers | 1 | Transformer blocks |
dFF | 64 | FF hidden dim (4× dModel) |
maxLen | 8 | Context window |
steps | 1000 | Training iterations |
learningRate | 0.01 | With linear decay to 0 |
seed | 42 | Deterministic initialization |
Karpathy's Python version takes about 1 minute on a MacBook. This TypeScript version runs in about 1 second. Same algorithm, same hyperparameters, same 1000 steps. The difference comes down to one architectural decision in the autograd:
Karpathy's autograd operates on individual scalar numbers, each wrapped in a Value object:
class Value: def __init__(self, data, children=(), local_grads=()): self.data = data # one float self.grad = 0 # one float self._children = children self._local_grads = local_grads
A 16-dim embedding lookup creates 16 Value objects. A matmul of [7, 16] × [16, 48] creates 7 × 48 = 336 new Value objects, each individually linked in the graph. Over a single forward pass, the computation graph grows to thousands of Value nodes, each requiring a Python object allocation, pointer tracking, and a topological sort during .backward().
The graph must also be walked one scalar at a time during backprop. Python's interpreter overhead per operation is ~100ns, and there are tens of thousands of operations per step.
Our autograd operates on flat Float32Array buffers with shape metadata:
class Tensor {
data: Float32Array; // all scalars in one contiguous buffer
grad: Float32Array; // same size, contiguous
shape: Shape;
_backward: () => void; // one closure for the whole operation
_parents: Tensor[]; // typically 1-2 parents, not thousands
}
A matmul of [7, 16] × [16, 48] creates one Tensor node in the graph, with a single backward closure that loops over all 7 × 48 × 16 = 5,376 multiply-adds in tight for loops. The computation graph for an entire forward pass has only ~30-40 nodes (not thousands).
| Factor | Karpathy (Python scalars) | Ours (TS tensors) |
|---|---|---|
| Graph nodes per forward pass | ~10,000+ Value objects | ~30-40 Tensor objects |
| Object allocation | One Python object per scalar | One TypedArray per operation |
| Backward traversal | Topological sort over ~10K nodes | Topological sort over ~30 nodes |
| Inner loop execution | Python interpreter (~100ns/op) | V8 JIT-compiled tight loops (~1-2ns/op) |
| Memory layout | Scattered Python float objects on heap | Contiguous Float32Array buffers |
| Per-operation overhead | Dict lookup + pointer chasing per scalar | One closure call, then raw array math |
The dominant factor is graph granularity: Karpathy has one autograd node per scalar multiplication, we have one autograd node per matrix multiplication. This is the same distinction between micrograd (scalar autograd) and PyTorch (tensor autograd), just implemented within a single file.
Karpathy is fully aware of this — his blog post says: "Production systems use tensors (large multi-dimensional arrays of numbers)... The math is identical, just corresponds to many scalars processed in parallel." His choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. Our choice of tensor operations is equally deliberate: it shows that the same algorithm, lifted to operate on arrays, becomes fast enough to run in a second.
This is the denotational design lens: both implementations denote the same thing (a computation graph with reverse-mode AD), but the representations differ. Karpathy's representation makes the calculus legible. Ours makes the linear algebra legible.
Karpathy's Python version is the irreducible essence of a language model. This version asks: what if we took that essence and gave it more structure? TypeScript's interfaces (ModelSpec, Model, Trained) make the architecture of the architecture visible — you can see the separation of concerns that's implicit in the Python version.
In Conal Elliott's terms: the Python version is the implementation, this version tries to also show the denotation.
