microGPT.ts

A TypeScript port of Karpathy's microGPT — but written in the style of Conal Elliott's denotational design: types first, meanings first, implementation as a consequence.

Karpathy's original microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit but raises the level of abstraction — leaning into TypeScript's type system to make the structure of a language model legible, not just the math.

Denotational Design, Applied

Conal Elliott's core idea: give a simple mathematical meaning (denotation) for each type, then define operations as if they work on meanings, not representations. The implementation is free to differ for efficiency, but must be observationally equivalent to the denotation.

Here, the "meanings" are:

Type	Denotation (meaning)
`Tensor`	A shaped array of scalars with attached gradient and backward function — i.e., a node in a computation graph
`ModelSpec`	The what of a transformer: vocab size, dimensions, heads, layers — a pure description with no behavior
`Model`	A triple of `(spec, initParams, forward)` — a model is its specification plus two functions
`Trained`	A triple of `(tokenizer, model, params)` — a frozen snapshot: everything needed to generate

The key move: Model is not a class with hidden state. It's a plain record of functions. makeTransformerLanguageModel takes a ModelSpec and returns a Model — a function from specification to behavior. This is the denotational design pattern: separate the what (spec) from the how (init + forward), and make the connection between them explicit and total.

Architecture

Rendering mermaid diagram...

Forward Pass Detail

Each transformer layer follows the now-standard pre-norm pattern:

Rendering mermaid diagram...

The Seven Components

Following Karpathy's decomposition — every LLM has exactly these parts, and nothing else:

Dataset — ~32k names fetched from Karpathy's makemore repo
Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
Autograd — Micrograd-style reverse-mode AD on flat Float32Array tensors
Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
Loss — Cross-entropy over next-token predictions
Optimizer — Adam with bias correction and linear learning rate decay
Sampling — Temperature-controlled autoregressive generation

Running

This is a Val Town script val. Run it directly — it will train for 1000 steps on CPU and generate 20 sample names:

num docs: 32033
vocab size: 27
num params: 4795
step    1 / 1000 | loss 3.5062 | 0.0s
step  101 / 1000 | loss 2.7573 | 1.2s
...
step 1000 / 1000 | loss 2.2891 | 11.8s

--- generation ---
sample  1: malede
sample  2: jara
sample  3: kaylin
...

Hyperparameters

Matched to Karpathy's defaults:

Parameter	Value	Notes
`dModel`	16	Embedding dimension
`nHeads`	4	Attention heads (head dim = 4)
`nLayers`	1	Transformer blocks
`dFF`	64	FF hidden dim (4× `dModel`)
`maxLen`	8	Context window
`steps`	1000	Training iterations
`learningRate`	0.01	With linear decay to 0
`seed`	42	Deterministic initialization

Why is this ~60x faster than Karpathy's?

Karpathy's Python version takes about 1 minute on a MacBook. This TypeScript version runs in about 1 second. Same algorithm, same hyperparameters, same 1000 steps. The difference comes down to one architectural decision in the autograd:

Karpathy: scalar `Value` objects

Karpathy's autograd operates on individual scalar numbers, each wrapped in a Value object:

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data       # one float
        self.grad = 0          # one float
        self._children = children
        self._local_grads = local_grads

A 16-dim embedding lookup creates 16 Value objects. A matmul of [7, 16] × [16, 48] creates 7 × 48 = 336 new Value objects, each individually linked in the graph. Over a single forward pass, the computation graph grows to thousands of Value nodes, each requiring a Python object allocation, pointer tracking, and a topological sort during .backward().

The graph must also be walked one scalar at a time during backprop. Python's interpreter overhead per operation is ~100ns, and there are tens of thousands of operations per step.

Ours: `Tensor` with fused backward closures

Our autograd operates on flat Float32Array buffers with shape metadata:

class Tensor {
  data: Float32Array;   // all scalars in one contiguous buffer
  grad: Float32Array;   // same size, contiguous
  shape: Shape;
  _backward: () => void;  // one closure for the whole operation
  _parents: Tensor[];     // typically 1-2 parents, not thousands
}

A matmul of [7, 16] × [16, 48] creates one Tensor node in the graph, with a single backward closure that loops over all 7 × 48 × 16 = 5,376 multiply-adds in tight for loops. The computation graph for an entire forward pass has only ~30-40 nodes (not thousands).

Where the speedup comes from

Factor	Karpathy (Python scalars)	Ours (TS tensors)
Graph nodes per forward pass	~10,000+ `Value` objects	~30-40 `Tensor` objects
Object allocation	One Python object per scalar	One TypedArray per operation
Backward traversal	Topological sort over ~10K nodes	Topological sort over ~30 nodes
Inner loop execution	Python interpreter (~100ns/op)	V8 JIT-compiled tight loops (~1-2ns/op)
Memory layout	Scattered Python `float` objects on heap	Contiguous `Float32Array` buffers
Per-operation overhead	Dict lookup + pointer chasing per scalar	One closure call, then raw array math

The dominant factor is graph granularity: Karpathy has one autograd node per scalar multiplication, we have one autograd node per matrix multiplication. This is the same distinction between micrograd (scalar autograd) and PyTorch (tensor autograd), just implemented within a single file.

Karpathy is fully aware of this — his blog post says: "Production systems use tensors (large multi-dimensional arrays of numbers)... The math is identical, just corresponds to many scalars processed in parallel." His choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. Our choice of tensor operations is equally deliberate: it shows that the same algorithm, lifted to operate on arrays, becomes fast enough to run in a second.

This is the denotational design lens: both implementations denote the same thing (a computation graph with reverse-mode AD), but the representations differ. Karpathy's representation makes the calculus legible. Ours makes the linear algebra legible.

Why TypeScript?

Karpathy's Python version is the irreducible essence of a language model. This version asks: what if we took that essence and gave it more structure? TypeScript's interfaces (ModelSpec, Model, Trained) make the architecture of the architecture visible — you can see the separation of concerns that's implicit in the Python version.

In Conal Elliott's terms: the Python version is the implementation, this version tries to also show the denotation.

stevekrouse

microgpt

microGPT.ts

Denotational Design, Applied

Architecture

Forward Pass Detail

The Seven Components

Running

Hyperparameters

Why is this ~60x faster than Karpathy's?

Karpathy: scalar `Value` objects

Ours: `Tensor` with fused backward closures

Where the speedup comes from

Why TypeScript?

stevekrouse

microgpt

microGPT.ts

Denotational Design, Applied

Architecture

Forward Pass Detail

The Seven Components

Running

Hyperparameters

Why is this ~60x faster than Karpathy's?

Karpathy: scalar Value objects

Ours: Tensor with fused backward closures

Where the speedup comes from

Why TypeScript?

Karpathy: scalar `Value` objects

Ours: `Tensor` with fused backward closures