A faithful minimal decoder-only transformer trained online under the prequential protocol. Byte-level, from scratch, in front of you.
model/
core.ts Config, types, init, RMSNorm, sampling utilities
forward.ts Full causal self-attention forward pass (all positions)
backward.ts Full backward pass through causal decoder block
optim.ts SGD + gradient clipping
mechanisms/
index.ts Toggleable research-backed mechanisms:
- Selective-Backprop (skip easy tokens)
- Experience Replay (reservoir + priority)
- Adaptive LR (dual-EMA shift detection)
corpus/
phases.ts Training data phases, corpus builder, boundary tracking
engine/
loop.ts Prequential training loop coordinating model + mechanisms
ui/
app.ts (TODO) Interactive frontend — currently inline in main.ts
main.ts HTTP entry point serving the interactive page
Decoder-only transformer. 1 layer, 4 heads × 4d, MLP 16→64→16, separate lm_head. Pre-norm (RMSNorm), residual stream, ~7K parameters. Byte vocabulary (257 = 256 + BOS).
The forward pass computes hidden states for all positions with causal masking. The backward pass propagates gradients through the entire sequence in the block. This is what makes it a faithful GPT, not just "attention LM."
Prequential: predict → observe → update. Loss is recorded before the update. Every byte is both evaluation and curriculum. No train/test split.
From the online/continual learning literature:
| Mechanism | Paper | What it does | Strict online? |
|---|---|---|---|
| Selective-Backprop | Jiang et al. 2019 | Skip updates when loss < τ | Yes |
| Experience Replay | Reservoir sampling | Replay past windows weighted by loss | No |
| Adaptive LR | Dynamic eval literature | Boost LR on distribution shift | Yes |
| Gradient Clipping | Standard practice | Cap gradient norm for stability | Yes |