• Blog
  • Docs
  • Pricing
  • We’re hiring!
Log inSign up
stevekrouse

stevekrouse

microgpt

Public
Like
microgpt
Home
Code
2
README.md
main.ts
Environment variables
Branches
2
Pull requests
Remixes
History
Val Town is a collaborative website to build and scale JavaScript apps.
Deploy APIs, crons, & store data – all from the browser, and deployed in milliseconds.
Sign up now
Code
/
README.md
Code
/
README.md
Search
…
Viewing readonly version of karpathy-scalar-autograd branch: v1
View latest version
README.md

microGPT.ts

A TypeScript port of Karpathy's microGPT — but written in the style of Conal Elliott's denotational design: types first, meanings first, implementation as a consequence.

Karpathy's original microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit but raises the level of abstraction — leaning into TypeScript's type system to make the structure of a language model legible, not just the math.

Denotational Design, Applied

Conal Elliott's core idea: give a simple mathematical meaning (denotation) for each type, then define operations as if they work on meanings, not representations. The implementation is free to differ for efficiency, but must be observationally equivalent to the denotation.

Here, the "meanings" are:

TypeDenotation (meaning)
TensorA shaped array of scalars with attached gradient and backward function — i.e., a node in a computation graph
ModelSpecThe what of a transformer: vocab size, dimensions, heads, layers — a pure description with no behavior
ModelA triple of (spec, initParams, forward) — a model is its specification plus two functions
TrainedA triple of (tokenizer, model, params) — a frozen snapshot: everything needed to generate

The key move: Model is not a class with hidden state. It's a plain record of functions. makeTransformerLanguageModel takes a ModelSpec and returns a Model — a function from specification to behavior. This is the denotational design pattern: separate the what (spec) from the how (init + forward), and make the connection between them explicit and total.

Architecture

Rendering mermaid diagram...

Forward Pass Detail

Each transformer layer follows the now-standard pre-norm pattern:

Rendering mermaid diagram...

The Seven Components

Following Karpathy's decomposition — every LLM has exactly these parts, and nothing else:

  1. Dataset — ~32k names fetched from Karpathy's makemore repo
  2. Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
  3. Autograd — Micrograd-style reverse-mode AD on flat Float32Array tensors
  4. Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
  5. Loss — Cross-entropy over next-token predictions
  6. Optimizer — Adam with bias correction and linear learning rate decay
  7. Sampling — Temperature-controlled autoregressive generation

Running

This is a Val Town script val. Run it directly — it will train for 1000 steps on CPU and generate 20 sample names:

num docs: 32033
vocab size: 27
num params: 4795
step    1 / 1000 | loss 3.5062 | 0.0s
step  101 / 1000 | loss 2.7573 | 1.2s
...
step 1000 / 1000 | loss 2.2891 | 11.8s

--- generation ---
sample  1: malede
sample  2: jara
sample  3: kaylin
...

Hyperparameters

Matched to Karpathy's defaults:

ParameterValueNotes
dModel16Embedding dimension
nHeads4Attention heads (head dim = 4)
nLayers1Transformer blocks
dFF64FF hidden dim (4× dModel)
maxLen8Context window
steps1000Training iterations
learningRate0.01With linear decay to 0
seed42Deterministic initialization

Why is this ~60x faster than Karpathy's?

Karpathy's Python version takes about 1 minute on a MacBook. This TypeScript version runs in about 1 second. Same algorithm, same hyperparameters, same 1000 steps. The difference comes down to one architectural decision in the autograd:

Karpathy: scalar Value objects

Karpathy's autograd operates on individual scalar numbers, each wrapped in a Value object:

class Value: def __init__(self, data, children=(), local_grads=()): self.data = data # one float self.grad = 0 # one float self._children = children self._local_grads = local_grads

A 16-dim embedding lookup creates 16 Value objects. A matmul of [7, 16] × [16, 48] creates 7 × 48 = 336 new Value objects, each individually linked in the graph. Over a single forward pass, the computation graph grows to thousands of Value nodes, each requiring a Python object allocation, pointer tracking, and a topological sort during .backward().

The graph must also be walked one scalar at a time during backprop. Python's interpreter overhead per operation is ~100ns, and there are tens of thousands of operations per step.

Ours: Tensor with fused backward closures

Our autograd operates on flat Float32Array buffers with shape metadata:

class Tensor { data: Float32Array; // all scalars in one contiguous buffer grad: Float32Array; // same size, contiguous shape: Shape; _backward: () => void; // one closure for the whole operation _parents: Tensor[]; // typically 1-2 parents, not thousands }

A matmul of [7, 16] × [16, 48] creates one Tensor node in the graph, with a single backward closure that loops over all 7 × 48 × 16 = 5,376 multiply-adds in tight for loops. The computation graph for an entire forward pass has only ~30-40 nodes (not thousands).

Where the speedup comes from

FactorKarpathy (Python scalars)Ours (TS tensors)
Graph nodes per forward pass~10,000+ Value objects~30-40 Tensor objects
Object allocationOne Python object per scalarOne TypedArray per operation
Backward traversalTopological sort over ~10K nodesTopological sort over ~30 nodes
Inner loop executionPython interpreter (~100ns/op)V8 JIT-compiled tight loops (~1-2ns/op)
Memory layoutScattered Python float objects on heapContiguous Float32Array buffers
Per-operation overheadDict lookup + pointer chasing per scalarOne closure call, then raw array math

The dominant factor is graph granularity: Karpathy has one autograd node per scalar multiplication, we have one autograd node per matrix multiplication. This is the same distinction between micrograd (scalar autograd) and PyTorch (tensor autograd), just implemented within a single file.

Karpathy is fully aware of this — his blog post says: "Production systems use tensors (large multi-dimensional arrays of numbers)... The math is identical, just corresponds to many scalars processed in parallel." His choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. Our choice of tensor operations is equally deliberate: it shows that the same algorithm, lifted to operate on arrays, becomes fast enough to run in a second.

This is the denotational design lens: both implementations denote the same thing (a computation graph with reverse-mode AD), but the representations differ. Karpathy's representation makes the calculus legible. Ours makes the linear algebra legible.

Why TypeScript?

Karpathy's Python version is the irreducible essence of a language model. This version asks: what if we took that essence and gave it more structure? TypeScript's interfaces (ModelSpec, Model, Trained) make the architecture of the architecture visible — you can see the separation of concerns that's implicit in the Python version.

In Conal Elliott's terms: the Python version is the implementation, this version tries to also show the denotation.

FeaturesVersion controlCode intelligenceCLIMCP
Use cases
TeamsAI agentsSlackGTM
DocsShowcaseTemplatesNewestTrendingAPI examplesNPM packages
PricingNewsletterBlogAboutCareers
We’re hiring!
Brandhi@val.townStatus
X (Twitter)
Discord community
GitHub discussions
YouTube channel
Bluesky
Open Source Pledge
Terms of usePrivacy policyAbuse contact
© 2026 Val Town, Inc.