μGPT — Online Learning Micro-GPT

A faithful minimal decoder-only transformer trained online under the prequential protocol. Byte-level, from scratch, in front of you.

Architecture

model/
  core.ts       Config, types, init, RMSNorm, sampling utilities
  forward.ts    Full causal self-attention forward pass (all positions)
  backward.ts   Full backward pass through causal decoder block
  optim.ts      SGD + gradient clipping

mechanisms/
  index.ts      Toggleable research-backed mechanisms:
                - Selective-Backprop (skip easy tokens)
                - Experience Replay (reservoir + priority)
                - Adaptive LR (dual-EMA shift detection)

corpus/
  phases.ts     Training data phases, corpus builder, boundary tracking

engine/
  loop.ts       Prequential training loop coordinating model + mechanisms

ui/
  app.ts        (TODO) Interactive frontend — currently inline in main.ts

main.ts         HTTP entry point serving the interactive page

Model

Decoder-only transformer. 1 layer, 4 heads × 4d, MLP 16→64→16, separate lm_head. Pre-norm (RMSNorm), residual stream, ~7K parameters. Byte vocabulary (257 = 256 + BOS).

The forward pass computes hidden states for all positions with causal masking. The backward pass propagates gradients through the entire sequence in the block. This is what makes it a faithful GPT, not just "attention LM."

Protocol

Prequential: predict → observe → update. Loss is recorded before the update. Every byte is both evaluation and curriculum. No train/test split.

Mechanisms

From the online/continual learning literature:

MechanismPaperWhat it doesStrict online?
Selective-BackpropJiang et al. 2019Skip updates when loss < τYes
Experience ReplayReservoir samplingReplay past windows weighted by lossNo
Adaptive LRDynamic eval literatureBoost LR on distribution shiftYes
Gradient ClippingStandard practiceCap gradient norm for stabilityYes

Next steps

  • TinyStories integration for capacity-aligned training data
  • Client-side ES module imports from val files (eliminate duplication)
  • Web worker for training loop (unblock UI thread)
  • kNN/continuous cache mechanism (non-parametric memory)
  • Multi-layer support with depth controls
  • Configurable model dimensions in UI