• Blog
  • Docs
  • Pricing
  • We’re hiring!
Log inSign up
stevekrouse

stevekrouse

microgpt

Public
Like
microgpt
Home
Code
5
README.md
main.ts
tokenizer.ts
types.ts
value.ts
Environment variables
Branches
2
Pull requests
Remixes
History
Val Town is a collaborative website to build and scale JavaScript apps.
Deploy APIs, crons, & store data – all from the browser, and deployed in milliseconds.
Sign up now
Code
/
README.md
Code
/
README.md
Search
…
Viewing readonly version of karpathy-scalar-autograd branch: v43
View latest version
README.md

microGPT.ts (scalar autograd edition)

A direct TypeScript port of Karpathy's microGPT — faithful to the original's scalar Value-based autograd, not optimized for speed.

Karpathy's microGPT is a single Python file that trains and inferences a GPT with zero dependencies. As he put it: "This is the full algorithmic content of what is needed. Everything else is just efficiency." This port preserves that spirit at the same level of abstraction — every scalar is a Value node in a dynamically-built computation graph, just like the original.

Scalar Autograd: Same as Karpathy

The core design choice: every number is a Value object.

class Value { data: number; // one scalar grad: number; // one scalar _backward: () => void; _prev: Value[]; // parent nodes in the graph }

A 16-dim embedding lookup creates 16 Value objects. A vector-matrix multiply of [16] × [16, 48] creates 48 dot products, each made of 16 multiplies and 15 adds — over 1,000 new Value nodes. A single forward pass builds a computation graph with tens of thousands of nodes, each individually linked and walked during .backward().

This is the same approach as Karpathy's Python Value class. It makes the chain rule visible at every single scalar operation.

Architecture

Same GPT-2-style transformer as the original:

Rendering mermaid diagram...

The Seven Components

Following Karpathy's decomposition:

  1. Dataset — ~32k names fetched from Karpathy's makemore repo
  2. Tokenizer — Character-level: 26 letters + 1 BOS/EOS token (vocab size 27)
  3. Autograd — Scalar Value class with add, mul, pow, exp, log, relu — each operation creates a node in the graph
  4. Architecture — 1-layer GPT-2-style transformer (RMSNorm, causal attention, ReLU MLP, weight tying)
  5. Loss — Cross-entropy over next-token predictions
  6. Optimizer — Adam with bias correction and linear learning rate decay
  7. Sampling — Temperature-controlled autoregressive generation

Running

This is a Val Town script val. It trains for 1000 steps on CPU. Because every scalar is its own autograd node, it's much slower than a tensor-based implementation (~60x slower) — but the code is simpler and more transparent:

num docs: 32033
vocab size: 27
num params: 4795
step    1 / 1000 | loss 3.5062 | 0.0s
step  101 / 1000 | loss 2.7573 | ...s
...
step 1000 / 1000 | loss 2.2891 | ...s

--- generation ---
sample  1: malede
sample  2: jara
...

Hyperparameters

Matched to Karpathy's defaults:

ParameterValueNotes
dModel16Embedding dimension
nHeads4Attention heads (head dim = 4)
nLayers1Transformer blocks
dFF64FF hidden dim (4× dModel)
maxLen8Context window
steps1000Training iterations
learningRate0.01With linear decay to 0
seed42Deterministic initialization

Why is this slow?

This version uses scalar autograd — the same approach as Karpathy's original Python. Every addition, multiplication, and activation creates a new Value object in the computation graph. A single forward pass creates tens of thousands of nodes, and backward() must topologically sort and walk all of them.

The main branch uses tensor autograd instead — a single matmul operation creates one graph node with a fused backward closure that loops over all scalars in tight for loops. Same math, ~60x faster, but the individual scalar chain rule is hidden inside the closure.

FactorThis version (scalar Values)Main branch (Tensors)
Graph nodes per forward pass~10,000+ Value objects~30-40 Tensor objects
Object allocationOne object per scalar opOne TypedArray per tensor op
Backward traversalTopological sort over ~10K nodesTopological sort over ~30 nodes
Inner loop executionV8 JIT over many small closuresV8 JIT over tight array loops

Karpathy's choice of scalar Value was deliberate: it makes the chain rule visible at the most granular level. This port preserves that choice.

Why TypeScript?

TypeScript's type system makes the structure visible: ModelSpec describes what the model is, Model is the behavior, Trained is a frozen snapshot. But the autograd is the same scalar-level approach as the Python original — no tensor tricks, no fused kernels, just Value.add() and Value.mul() all the way down.

FeaturesVersion controlCode intelligenceCLIMCP
Use cases
TeamsAI agentsSlackGTM
DocsShowcaseTemplatesNewestTrendingAPI examplesNPM packages
PricingNewsletterBlogAboutCareers
We’re hiring!
Brandhi@val.townStatus
X (Twitter)
Discord community
GitHub discussions
YouTube channel
Bluesky
Open Source Pledge
Terms of usePrivacy policyAbuse contact
© 2026 Val Town, Inc.