🌊 From Characters to Diffusion: A Denotational Design

"Don't start with the implementation. Start with what the thing means." — The Conal Elliott principle

This document records our thinking about how to extend AutoLLM from autoregressive (one-character-at-a-time) generation to diffusion-style generation (all characters at once). We follow Conal Elliott's denotational design methodology: pin down the mathematical meaning first, derive the implementation from it, and let the algebra tell us what the code must be.

🎯 The Question

AutoLLM currently generates names left-to-right:

. → e → m → m → a → .

What if we generated the whole name at once — like how a diffusion model generates an entire image simultaneously?

🔑 Step 1: The Current Denotation

Before changing anything, we must be precise about what the current model means.

Semantic type of the autoregressive model:

nextChar : Vector(K, Char) → Distribution(Char)

Given K characters of context, produce a probability distribution over the next character. The joint probability of a whole name factors as a product of conditionals:

P("emma") = P(e|.) · P(m|.e) · P(m|.em) · P(a|.emm) · P(.|.emma)

This factorization is the meaning. Left-to-right generation isn't a design choice — it's an inevitable consequence of the semantics. If we want a different generation strategy, we need a different denotation.

Conal's lesson: You can't get "all at once" generation by patching the sampling loop. You need to change what the model means.

🔑 Step 2: The New Denotation

For "all at once" generation, we need a model whose meaning is the joint distribution over entire names — not factored left-to-right.

The direct joint P(x₁, x₂, ..., x_L) is intractable (27^L possibilities). So we need a tractable representation. Three candidates, each a different mathematical object:

Candidate A: Independent Positions

P(x₁, ..., x_L) ≈ ∏ᵢ P(xᵢ)

Each position independent. Too weak — knowing position 1 is "e" tells you a lot about position 2. This denotation can't capture the structure of names.

P(x₁, ..., x_L) = Σ over paths: P(x⁰) · ∏ₜ P(xᵗ | xᵗ⁻¹)

where x⁰ is pure noise (all masks), xᵀ is the clean name, and each transition is a denoising step that makes the whole sequence a little less corrupted.

Candidate C: Conditional Independence Given Latent (VAE)

P(x₁, ..., x_L) = ∫ P(z) · ∏ᵢ P(xᵢ | z) dz

A latent variable z captures the "identity" of the name; conditioned on z, characters are independent.

We choose Candidate B — the Markov chain of refinements — because it's closest to the diffusion intuition and decomposes naturally into repeated application of a single learned function.

🔑 Step 3: Pinning Down the Semantic Type

The denotation is a single function:

denoise : MaskedName × NoiseLevel → Vector(L, Distribution(Char))

where:

Name = Vector(L, Char) — a fixed-length sequence of characters
MaskedName = Vector(L, Char ∪ {MASK}) — same, but some positions are masked
NoiseLevel ∈ [0, 1] — the fraction of positions currently masked
Distribution(Char) — a categorical distribution over the 27-character vocabulary

The model takes a corrupted name and produces, for each position, a probability distribution over what character belongs there.

This type signature is the north star. Every design decision that follows must be consistent with it.

🔑 Step 4: The Laws

Conal would insist: what laws must this function satisfy? These laws define correctness — they are what it means for the model to be right.

Law 1: Identity at Zero Noise

denoise(name, 0) = pointDistribution(name)

If nothing is masked, the model should predict exactly what it sees, with probability 1. This is a boundary condition: a perfect denoiser at noise level 0 is the identity function.

Law 2: Marginal at Full Noise

denoise(MASK...MASK, 1) ≈ marginal distribution over characters at each position

When everything is masked, there's no context to use. The model can only rely on positional priors (e.g., names tend to start with consonants, end with vowels). This is another boundary condition.

Law 3: Consistency Across Noise Levels

The predictions at noise level t should be consistent with predictions at noise level t + ε. Masking one more position shouldn't cause the predictions for the remaining positions to wildly contradict what they were before. Formally: the Markov chain should converge, not oscillate.

Law 4: Permutation Equivariance of Masking

The model's predictions for position i should depend on which positions are masked, not on the order in which they were masked. Masking is a set operation, not a sequence operation.

These four laws are our specification. Training is: find parameters such that these laws hold approximately on the data distribution.

🔑 Step 5: Deriving the Loss Function

The loss function isn't a design choice — it follows from the denotation.

The model's meaning at each position is Distribution(Char). The natural measure of fit between a predicted distribution and an observed character is the negative log-likelihood:

ℒ = -Σᵢ [isMasked(i)] · log P_model(xᵢ | corrupted_name, noise_level)

We sum only over masked positions because:

At unmasked positions, Law 1 says the prediction should be trivial (just copy the input)
The interesting learning signal comes from reconstructing what's been hidden

This is cross-entropy — the same loss we already have. The denotation tells us we're applying it per-position over masked slots rather than once for a single next-character prediction.

🔑 Step 6: Deriving the Sampling Algorithm

The sampling procedure is also determined by the denotation, not invented separately.

Since the meaning is a Markov chain of refinements, generation is iteration of denoise from high noise to low noise:

x⁰ = [MASK, MASK, MASK, ..., MASK]       (noise_level = 1.0)
x¹ = unmask some positions via denoise     (noise_level ≈ 0.8)
x² = unmask more positions via denoise     (noise_level ≈ 0.5)
...
xᵀ = fully unmasked name                  (noise_level = 0.0)

At each step, the model predicts all positions, and we unmask the most confident predictions — the positions where the model's predicted distribution has the lowest entropy.

The number of diffusion steps T and the schedule (how many to unmask per step) are parameters, but the structure — iterate denoise from noise to clarity — is forced by the denotation.

🔑 Step 7: From Denotation to Architecture

Only now does Conal let us think about implementation.

The denotation constrains the interface but not the internals:

Constraint (from denotation)	What it forces
Input is a whole sequence	Model must be bidirectional (not left-to-right)
Output is per-position distributions	Output shape is `L × V`
Conditions on noise level	Noise level must be an explicit input
Must handle MASK tokens	Vocabulary grows by 1 (V: 27 → 28)
Fixed-length representation	Names padded to max length L

The representation choice (MLP vs. attention vs. convolution) is free. For our MLP-based model, the simplest realization:

Input:   L character IDs (some are MASK), plus noise_level
Embed:   Look up each character in E (28 × Emb), concatenate all L
Concat:  Append noise_level embedding to the flat vector
Hidden:  h = tanh(W₁ · concat + b₁)       (bidirectional: sees all positions)
Output:  logits = W₂ · h + b₂              shape: (L × V)

This is almost identical to what we have now. The only structural differences:

Input is the full name (L positions) rather than a K-character left context
Output is L × V logits rather than 1 × V
The MASK token exists in the vocabulary
Noise level is threaded through as a conditioning signal

🔑 Step 8: The Homomorphism Is Already Correct

The deepest Conal move: the AD framework should be a homomorphism — it should map the algebraic structure of function composition to the algebraic structure of the chain rule:

D(f ∘ g) = D(f) ∘ D(g)

Our Node<T> / back(g) pattern already satisfies this. Every primitive returns (value, backpropagator), and backpropagators compose by calling each other. This is a correct homomorphism from the category of smooth functions to the category of linear maps.

For the diffusion model, nothing changes here. We might add new primitives (masking, noise-level embedding, multi-position output), and each one gets a (value, back) pair. They compose the same way. The AD framework is already correct — we just compose different things.

📋 Summary of Changes

Aspect	Current (Autoregressive)	New (Diffusion)
Denotation	`P(xₜ ∣ x₁...xₜ₋₁)`	`denoise : MaskedName × NoiseLevel → Vector(L, Dist(Char))`
Generation	Sequential: one char at a time	Iterative: all chars, refine over T steps
Vocab	27 (a-z + BOS)	28 (+ MASK)
Input	K=5 char left context	L=16 chars (full name, some masked)
Output	V=27 logits (one position)	L×V logits (all positions)
Loss	Cross-entropy on next char	Cross-entropy on masked positions
Dataset	Sliding window within names	Whole names with random masking
AD framework	`Node<T>`, unchanged	`Node<T>`, unchanged
Laws	Chain rule of probability	Identity at noise=0, marginal at noise=1, consistency, equivariance

🧭 The Conal Checklist

Before writing any code, verify:

Denotation stated precisely — semantic type is denoise : MaskedName × NoiseLevel → Vector(L, Distribution(Char))
Laws identified — identity, marginal, consistency, equivariance
Loss derived from meaning — cross-entropy follows from the Distribution(Char) codomain
Sampling derived from meaning — iterative unmasking follows from the Markov chain structure
Architecture constrained by type — bidirectional, multi-position output, noise-level conditioning
AD homomorphism preserved — Node<T> pattern unchanged
Implementation — write the code, confident that every choice is forced by the mathematics

The model changes. The framework doesn't. That's the power of getting the denotation right.