LessWrong AI
2026-06-29 16:38 UTC
By zw5
USR-0152-20260629-community-fo-9161b2f4
Gradient-free Single-pass Model Beats nanoGPT on Shakespeare
Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies. At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior ₒⱼ Each order receives a capacity score composed of two terms: Concentration: ₒ where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform. Reliability: where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed. A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10: ₒₒⱼⱼ The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions: ₒₒₒ This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree. The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set. Results Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens,…
Beam is a character-level language model that computes count tables mapping character contexts to next-character frequencies. At prediction time, each order looks up the current context in its count table and produces a distribution over the vocabulary, smoothed over a symmetric Dirichlet prior ₒⱼ Each order receives a capacity score composed of two terms: Concentration: ₒ where H(pₒ) is the Shannon entropy of the smoothed distribution. This is 1 when all mass is on one token and 0 when the distribution is uniform. Reliability: where n is the total count for the current context. This saturates toward 1 as evidence accumulates and is 0 when the context has not been observed. A third term, capacity, is computed from the product of concentration and reliability. The capacity scores are converted to weights via softmax at temperature τ = 0.10: ₒₒⱼⱼ The low temperature makes the routing nearly winner-take-all: the highest-capacity order almost always dominates. The final prediction is the weighted geometric mean of the per-order distributions: ₒₒₒ This was chosen deliberately to assign high probability to a token only when multiple weighted orders agree. The model has four hyperparameters: the set of context orders, α, τ, and the reliability threshold (min_count = 1). These were selected by evaluating variants on the validation set. Results Evaluation uses the nanoGPT shakespeare_char benchmark: character-level Shakespeare, about 1M training tokens, about 100K validation tokens,…
Full article content could not be extracted automatically. Read the original below.
Source:
LessWrong AI
· lesswrong.com