μGPT — Online Learning Micro-GPT

A faithful minimal decoder-only transformer trained online under the prequential protocol. Byte-level, from scratch, in front of you.

Architecture

model/
  core.ts       Config, types, init, RMSNorm, sampling utilities
  forward.ts    Full causal self-attention forward pass (all positions)
  backward.ts   Full backward pass through causal decoder block
  optim.ts      SGD + gradient clipping

mechanisms/
  index.ts      Toggleable research-backed mechanisms:
                - Selective-Backprop (skip easy tokens)
                - Experience Replay (reservoir + priority)
                - Adaptive LR (dual-EMA shift detection)

corpus/
  phases.ts     Training data phases, corpus builder, boundary tracking

engine/
  loop.ts       Prequential training loop coordinating model + mechanisms

ui/
  app.ts        (TODO) Interactive frontend — currently inline in main.ts

main.ts         HTTP entry point serving the interactive page

Model

Decoder-only transformer. 1 layer, 4 heads × 4d, MLP 16→64→16, separate lm_head. Pre-norm (RMSNorm), residual stream, ~7K parameters. Byte vocabulary (257 = 256 + BOS).

The forward pass computes hidden states for all positions with causal masking. The backward pass propagates gradients through the entire sequence in the block. This is what makes it a faithful GPT, not just "attention LM."

Protocol

Prequential: predict → observe → update. Loss is recorded before the update. Every byte is both evaluation and curriculum. No train/test split.

Mechanisms

From the online/continual learning literature:

Mechanism	Paper	What it does	Strict online?
Selective-Backprop	Jiang et al. 2019	Skip updates when loss < τ	Yes
Experience Replay	Reservoir sampling	Replay past windows weighted by loss	No
Adaptive LR	Dynamic eval literature	Boost LR on distribution shift	Yes
Gradient Clipping	Standard practice	Cap gradient norm for stability	Yes

Next steps

TinyStories integration for capacity-aligned training data
Client-side ES module imports from val files (eliminate duplication)
Web worker for training loop (unblock UI thread)
kNN/continuous cache mechanism (non-parametric memory)
Multi-layer support with depth controls
Configurable model dimensions in UI