• Blog
  • Docs
  • Pricing
  • We’re hiring!
Log inSign up
c15r

c15r

microgpt

Online learning micro-GPT: byte-level decoder transformer
Public
Like
microgpt
Home
Code
7
corpus
6
engine
2
mechanisms
1
model
5
ui
7
README.md
H
main.ts
Connections
Environment variables
Branches
3
Pull requests
Remixes
History
Val Town is a collaborative website to build and scale JavaScript apps.
Deploy APIs, crons, & store data – all from the browser, and deployed in milliseconds.
Sign up now
Code
/
README.md
Code
/
README.md
Search
…
Viewing readonly version of main branch: v44
View latest version
README.md

μGPT — Online Learning Micro-GPT

A faithful minimal decoder-only transformer trained online under the prequential protocol. Byte-level, from scratch, in front of you.

Architecture

model/
  core.ts       Config, types, init, RMSNorm, sampling utilities
  forward.ts    Full causal self-attention forward pass (all positions)
  backward.ts   Full backward pass through causal decoder block
  optim.ts      SGD + gradient clipping

mechanisms/
  index.ts      Toggleable research-backed mechanisms:
                - Selective-Backprop (skip easy tokens)
                - Experience Replay (reservoir + priority)
                - Adaptive LR (dual-EMA shift detection)

corpus/
  phases.ts     Training data phases, corpus builder, boundary tracking

engine/
  loop.ts       Prequential training loop coordinating model + mechanisms

ui/
  app.ts        (TODO) Interactive frontend — currently inline in main.ts

main.ts         HTTP entry point serving the interactive page

Model

Decoder-only transformer. 1 layer, 4 heads × 4d, MLP 16→64→16, separate lm_head. Pre-norm (RMSNorm), residual stream, ~7K parameters. Byte vocabulary (257 = 256 + BOS).

The forward pass computes hidden states for all positions with causal masking. The backward pass propagates gradients through the entire sequence in the block. This is what makes it a faithful GPT, not just "attention LM."

Protocol

Prequential: predict → observe → update. Loss is recorded before the update. Every byte is both evaluation and curriculum. No train/test split.

Mechanisms

From the online/continual learning literature:

MechanismPaperWhat it doesStrict online?
Selective-BackpropJiang et al. 2019Skip updates when loss < τYes
Experience ReplayReservoir samplingReplay past windows weighted by lossNo
Adaptive LRDynamic eval literatureBoost LR on distribution shiftYes
Gradient ClippingStandard practiceCap gradient norm for stabilityYes

Next steps

  • TinyStories integration for capacity-aligned training data
  • Client-side ES module imports from val files (eliminate duplication)
  • Web worker for training loop (unblock UI thread)
  • kNN/continuous cache mechanism (non-parametric memory)
  • Multi-layer support with depth controls
  • Configurable model dimensions in UI
FeaturesVersion controlCode intelligenceCLIMCP
Use cases
TeamsAI agentsSlackGTM
DocsShowcaseTemplatesNewestTrendingAPI examplesNPM packages
AboutAlternativesPricingBlogNewsletterCareers
We’re hiring!
Brandhi@val.townStatus
X (Twitter)
Discord community
GitHub discussions
YouTube channel
Bluesky
Open Source Pledge
Terms of usePrivacy policyAbuse contact
© 2026 Val Town, Inc.