"Don't start with the implementation. Start with what the thing means." โ The Conal Elliott principle
This document records our thinking about how to extend AutoLLM from autoregressive (one-character-at-a-time) generation to diffusion-style generation (all characters at once). We follow Conal Elliott's denotational design methodology: pin down the mathematical meaning first, derive the implementation from it, and let the algebra tell us what the code must be.
AutoLLM currently generates names left-to-right:
. โ e โ m โ m โ a โ .
What if we generated the whole name at once โ like how a diffusion model generates an entire image simultaneously?
Before changing anything, we must be precise about what the current model means.
Semantic type of the autoregressive model:
nextChar : Vector(K, Char) โ Distribution(Char)
Given K characters of context, produce a probability distribution over the next character. The joint probability of a whole name factors as a product of conditionals:
P("emma") = P(e|.) ยท P(m|.e) ยท P(m|.em) ยท P(a|.emm) ยท P(.|.emma)
This factorization is the meaning. Left-to-right generation isn't a design choice โ it's an inevitable consequence of the semantics. If we want a different generation strategy, we need a different denotation.
Conal's lesson: You can't get "all at once" generation by patching the sampling loop. You need to change what the model means.
For "all at once" generation, we need a model whose meaning is the joint distribution over entire names โ not factored left-to-right.
The direct joint P(xโ, xโ, ..., x_L) is intractable (27^L possibilities). So we need a tractable representation. Three candidates, each a different mathematical object:
P(xโ, ..., x_L) โ โแตข P(xแตข)
Each position independent. Too weak โ knowing position 1 is "e" tells you a lot about position 2. This denotation can't capture the structure of names.
P(xโ, ..., x_L) = ฮฃ over paths: P(xโฐ) ยท โโ P(xแต | xแตโปยน)
where xโฐ is pure noise (all masks), xแต is the clean name, and each transition is a denoising step that makes the whole sequence a little less corrupted.
P(xโ, ..., x_L) = โซ P(z) ยท โแตข P(xแตข | z) dz
A latent variable z captures the "identity" of the name; conditioned on z, characters are independent.
We choose Candidate B โ the Markov chain of refinements โ because it's closest to the diffusion intuition and decomposes naturally into repeated application of a single learned function.
The denotation is a single function:
denoise : MaskedName ร NoiseLevel โ Vector(L, Distribution(Char))
where:
Name = Vector(L, Char) โ a fixed-length sequence of charactersMaskedName = Vector(L, Char โช {MASK}) โ same, but some positions are maskedNoiseLevel โ [0, 1] โ the fraction of positions currently maskedDistribution(Char) โ a categorical distribution over the 27-character vocabularyThe model takes a corrupted name and produces, for each position, a probability distribution over what character belongs there.
This type signature is the north star. Every design decision that follows must be consistent with it.
Conal would insist: what laws must this function satisfy? These laws define correctness โ they are what it means for the model to be right.
denoise(name, 0) = pointDistribution(name)
If nothing is masked, the model should predict exactly what it sees, with probability 1. This is a boundary condition: a perfect denoiser at noise level 0 is the identity function.
denoise(MASK...MASK, 1) โ marginal distribution over characters at each position
When everything is masked, there's no context to use. The model can only rely on positional priors (e.g., names tend to start with consonants, end with vowels). This is another boundary condition.
The predictions at noise level t should be consistent with predictions at noise level t + ฮต. Masking one more position shouldn't cause the predictions for the remaining positions to wildly contradict what they were before. Formally: the Markov chain should converge, not oscillate.
The model's predictions for position i should depend on which positions are masked, not on the order in which they were masked. Masking is a set operation, not a sequence operation.
These four laws are our specification. Training is: find parameters such that these laws hold approximately on the data distribution.
The loss function isn't a design choice โ it follows from the denotation.
The model's meaning at each position is Distribution(Char). The natural measure of fit between a predicted distribution and an observed character is the negative log-likelihood:
โ = -ฮฃแตข [isMasked(i)] ยท log P_model(xแตข | corrupted_name, noise_level)
We sum only over masked positions because:
This is cross-entropy โ the same loss we already have. The denotation tells us we're applying it per-position over masked slots rather than once for a single next-character prediction.
The sampling procedure is also determined by the denotation, not invented separately.
Since the meaning is a Markov chain of refinements, generation is iteration of denoise from high noise to low noise:
xโฐ = [MASK, MASK, MASK, ..., MASK] (noise_level = 1.0)
xยน = unmask some positions via denoise (noise_level โ 0.8)
xยฒ = unmask more positions via denoise (noise_level โ 0.5)
...
xแต = fully unmasked name (noise_level = 0.0)
At each step, the model predicts all positions, and we unmask the most confident predictions โ the positions where the model's predicted distribution has the lowest entropy.
The number of diffusion steps T and the schedule (how many to unmask per step) are parameters, but the structure โ iterate denoise from noise to clarity โ is forced by the denotation.
Only now does Conal let us think about implementation.
The denotation constrains the interface but not the internals:
| Constraint (from denotation) | What it forces |
|---|---|
| Input is a whole sequence | Model must be bidirectional (not left-to-right) |
| Output is per-position distributions | Output shape is L ร V |
| Conditions on noise level | Noise level must be an explicit input |
| Must handle MASK tokens | Vocabulary grows by 1 (V: 27 โ 28) |
| Fixed-length representation | Names padded to max length L |
The representation choice (MLP vs. attention vs. convolution) is free. For our MLP-based model, the simplest realization:
Input: L character IDs (some are MASK), plus noise_level
Embed: Look up each character in E (28 ร Emb), concatenate all L
Concat: Append noise_level embedding to the flat vector
Hidden: h = tanh(Wโ ยท concat + bโ) (bidirectional: sees all positions)
Output: logits = Wโ ยท h + bโ shape: (L ร V)
This is almost identical to what we have now. The only structural differences:
The deepest Conal move: the AD framework should be a homomorphism โ it should map the algebraic structure of function composition to the algebraic structure of the chain rule:
D(f โ g) = D(f) โ D(g)
Our Node<T> / back(g) pattern already satisfies this. Every primitive returns (value, backpropagator), and backpropagators compose by calling each other. This is a correct homomorphism from the category of smooth functions to the category of linear maps.
For the diffusion model, nothing changes here. We might add new primitives (masking, noise-level embedding, multi-position output), and each one gets a (value, back) pair. They compose the same way. The AD framework is already correct โ we just compose different things.
| Aspect | Current (Autoregressive) | New (Diffusion) |
|---|---|---|
| Denotation | P(xโ โฃ xโ...xโโโ) | denoise : MaskedName ร NoiseLevel โ Vector(L, Dist(Char)) |
| Generation | Sequential: one char at a time | Iterative: all chars, refine over T steps |
| Vocab | 27 (a-z + BOS) | 28 (+ MASK) |
| Input | K=5 char left context | L=16 chars (full name, some masked) |
| Output | V=27 logits (one position) | LรV logits (all positions) |
| Loss | Cross-entropy on next char | Cross-entropy on masked positions |
| Dataset | Sliding window within names | Whole names with random masking |
| AD framework | Node<T>, unchanged | Node<T>, unchanged |
| Laws | Chain rule of probability | Identity at noise=0, marginal at noise=1, consistency, equivariance |
Before writing any code, verify:
denoise : MaskedName ร NoiseLevel โ Vector(L, Distribution(Char))Distribution(Char) codomainNode<T> pattern unchangedThe model changes. The framework doesn't. That's the power of getting the denotation right.