
Every large language model, at the very last step of producing a single word, does the same five things: it computes a score for every possible next token, scales those scores by a number called temperature, converts them into probabilities, draws a random sample weighted by those probabilities, and appends whatever it drew to the output. Everything else — the attention layers, the billions of parameters, the training run that cost millions of dollars — exists to make that first step, the scoring, as good as possible. The last four steps are comparatively simple arithmetic, and they’re the part most explanations skip entirely in favour of “it’s just predicting the next word.” This post builds those four steps for real.
The Pipeline, in Order
A model with a 100,000-token vocabulary produces 100,000 scores at every single position, most of them very low. Softmax converts that list of raw scores — logits — into a proper probability distribution: every value becomes positive and the whole list sums to exactly 1. The formula is p_i = exp(logit_i) / Σ exp(logit_j). Temperature slots in as a divisor on the logits before that normalisation: p_i = exp(logit_i / T) / Σ exp(logit_j / T). Dividing by a number smaller than 1 stretches the gaps between scores apart — the model’s favourite token gets pulled even further ahead, so the distribution sharpens towards picking the same top candidate almost every time. Dividing by a number larger than 1 compresses the gaps, flattening the distribution so weaker candidates get a real chance of being picked. At the limit, T → 0 always picks the single highest-scoring token (this is called greedy decoding), and T → ∞ approaches picking uniformly at random from the entire vocabulary.
That’s the whole mechanism. The rest of this post is watching it run.
A Real (Tiny) Language Model You Can Watch Think
Real model logits come from a transformer’s output layer — not something forty lines of JavaScript can reproduce. But the four steps after the logits are computed — temperature, softmax, sampling, append — are exactly the same regardless of how the logits were produced. So below is a genuine, tiny language model: a bigram model, trained on the short paragraph quoted underneath it, using literal word-pair counts as a stand-in for logits. Adjust temperature, click to generate one word at a time, and watch the actual probability distribution for what comes next, computed live from real counts in that text.
Training corpus: “the model predicts the next word using probabilities the model learns these probabilities from data the next word depends on the previous words spatial data is structured into grids the grid divides space into hexagons hexagons tile the plane efficiently the model can also predict rare words rarely the model predicts words it has never seen the next token is sampled from a distribution the distribution changes with temperature high temperature makes the distribution flatter low temperature makes the distribution sharper the sharper distribution favors common words the flatter distribution favors rare words sampling rare words sometimes produces nonsense sampling common words often produces repetition the best temperature depends on the task”
Set the slider to 0.2 and generate ten or fifteen words — you’ll see it fall into short, repetitive loops, because the sharpened distribution keeps picking the same dominant successor almost every time. Reset, set it to 1.8, and generate again — the bars flatten out, weaker candidates start getting picked, and the output drifts into word salad faster. Leave it at 1.0 and you’re looking at the unmodified distribution implied by the training counts, which is the closest this toy model gets to “what the data actually suggests” with no thumb on the scale either way. None of those three behaviours required different code — only a different number going into the same formula.
Why Real LLMs Aren’t Just Bigger Bigram Tables
Two honest differences separate this demo from a production model, and it’s worth being precise about which parts of the analogy hold. First, a bigram model only ever looks at one previous word; a transformer’s logits are conditioned on the model’s entire attention over potentially thousands of prior tokens, which is where almost all of the actual intelligence lives — and where this toy model’s “dead ends” and repetition loops come from, since one word of context is nowhere near enough to disambiguate. Second, real models operate over subword tokens, not whole words, so “hexagonal” might be three tokens, not one — that changes the granularity but not the mechanism. The part that does carry over exactly is everything downstream of the logits: temperature, softmax, and weighted sampling are identical maths whether the logits came from word-pair counts or a 70-billion-parameter network.
One refinement worth knowing: most production APIs also support top-p (nucleus) sampling, which truncates the candidate list to the smallest set of tokens whose cumulative probability exceeds p before applying temperature — discarding the long tail of near-zero-probability tokens so that even a high temperature can’t sample something absurd from deep in the distribution’s tail.
from openai import OpenAI
client = OpenAI()
# Low temperature: near-deterministic, good for code, extraction, factual answers
precise = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write one sentence about hexagonal grids."}],
temperature=0.2,
)
# Higher temperature: more varied phrasing, better for brainstorming or creative writing
creative = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write one sentence about hexagonal grids."}],
temperature=1.3,
top_p=0.95,
)
Run that pair of calls a few times each and the pattern from the toy demo holds at full scale: the temperature=0.2 call returns near-identical sentences every run, and temperature=1.3 returns something different — and occasionally something stranger — each time. Same formula, same trade-off, just with a vastly better set of logits underneath it.