You hit Enter. Your request is born as a single spark.

You hit Enter. Your question leaves your keyboard as a single request.

It races through fiber-optic cable to a datacenter — often on the other side of the planet — in milliseconds.

There it waits in a queue for a free GPU. The compute itself is only 20—100 ms of the 30—200 ms you wait; most of the delay is lining up.

When a GPU is free, your text is turned into tokens — the model's true input. Meet the dots we'll follow for the rest of the journey.

Your prompt starts as plain text — a row of characters.

Byte-pair encoding merges common characters into tokens. The model's real alphabet.

Each token is just an integer ID. The model never sees letters — which is why it miscounts them.

The model has read your prompt. Generation is a loop: predict the next token, append it, repeat.

First it scores every possible next token — one probability for each entry in its vocabulary.

Temperature, top-k, and top-p reshape those odds. Drag the controls and watch the field respond.

One token is sampled from what's left and appended to the sequence.

Then the loop runs again. Each token streams out — which is why the answer appears one word at a time.

Your whole prompt arrives at once — a solid slab of tokens ready to be read in a single sweep.

Prefill: every token is processed in parallel in one forward pass. The GPU's compute cores run flat out — this is compute-bound work.

That sweep produces the first token, and the mode flips. The chamber narrows from a wide wave to a single stream.

Decode: one token enters, one exits, again and again. Most compute sits idle; the memory bus does the work — this is memory-bound.

Two phases, two cost shapes. That's why time-to-first-token (0.5–2 s) and inter-token latency (20–100 ms) are never the same number.

The KV cache is the model's short-term memory — a hall of empty slots, one per token, waiting.

Prefill seeds it: every prompt token's key and value is written at once. The hall lights up blue.

Then decode appends — one new slot per token, glowing gold, extending the hall step by step.

The cache grows linearly with the conversation. Try the calculator: longer context and higher precision stretch it fast.

Fewer KV heads (GQA) shrink each slot — how a 70B model keeps its cache affordable.

Every key and value from earlier tokens sits in the cache — a row of stored answers.

The new token's query sweeps across them, scoring each key: how relevant are you to me right now?

Softmax turns those raw scores into weights between 0 and 1 that sum to one.

The output is a weighted blend — tokens the model deems relevant contribute more of their value.

That blended vector flows to the next layer, and the new token's own key and value join the cache.

A modern GPU is a vast grid of compute units — hundreds of thousands of cores waiting for work.

During prefill, the whole grid ignites: one weight stream feeds all prompt tokens at once. Compute-bound.

During decode, the lights go out. A single token lights only a handful of cores; the rest sit idle.

Why? Every weight must still cross the narrow memory bridge for that one token — memory-bandwidth-bound.

More TFLOPS can't fix this; only more bandwidth can. The memory wall gets taller every GPU generation.

A single GPU serves thousands of users by batching their requests together — sharing one weight read across many tokens.

Static batching locks the batch: it waits for the room to fill, then waits for the slowest request to finish.

So fast requests sit idle in dark, wasted slots while one long request holds the whole batch hostage.

Continuous batching re-evaluates the batch every single token: finished requests slide off, new ones join immediately.

The result: steady throughput no matter how much output lengths vary — drag the variance slider to feel the gap.

The old way reserved one big contiguous slab per request — sized for the worst case.

Most of each slab went unused: 60–80% of KV cache memory wasted to fragmentation and duplication.

PagedAttention splits memory into uniform fixed-size blocks — like an operating system paging memory.

A block table maps each sequence's logical order to whatever physical blocks happen to be free. No block needs to neighbor another.

And shared prompts live once: Copy-on-Write lets many sequences read the same blocks until one diverges.

Naive attention builds the full N×N score matrix and writes it to slow HBM — over 90% of the memory traffic.

But the output never needs the whole matrix at once. It only needs dot products, computed block by block.

FlashAttention tiles Q, K and V into small blocks that fit in tiny, blazing-fast on-chip SRAM.

Each score tile is computed, consumed by an online softmax, and discarded — it never touches HBM.

Only the final output crosses back to HBM. Same math, same FLOPs — the N×N matrix was never born in slow memory.

Standard decode produces one token per step while almost all of the GPU's compute sits idle.

So a small, fast draft model sprints ahead and guesses several tokens in a row — at a fraction of the cost.

The big target model then verifies all those guesses in a single parallel forward pass.

Matching guesses are accepted; the first miss and everything after it are rejected and resampled.

The result: several tokens per step, big-model quality, and a provable zero quality loss.

A 70B model in FP16 is 140 GB — too big for a single GPU. Each weight takes 2 bytes.

Quantization stores each weight in fewer bits: INT8 halves it, INT4 quarters it — no retraining needed.

Smart schemes (GPTQ, AWQ) protect the important weights so quality barely moves, even at 4 bits.

The KV cache compresses too — FP8 keys and values free up memory for more batches or longer context.

Since decode is memory-bound, fewer bytes across the HBM bridge means faster and cheaper inference.

Every pricing page shows three prices: input, cached input, and output. Why three?

Input is prefill — massively parallel, compute-bound, and cheap per token.

Cached input is a KV-cache hit — no forward pass at all, so it's nearly free.

Output is decode — sequential and memory-bound, so it's the most expensive of the three.

Price is the physics of the work. Try the calculator: every ratio traces back to silicon.

The answer has finished. Let's retrace the whole journey in one flight.

Your words became tokens, looped through the model one prediction at a time, and streamed back to your screen.

Underneath it all sat one villain: the memory wall — a GPU starved for data, waiting on its own memory bus.

Every optimization after that is an answer to one of two enemies.

Fight the Memory Wall: batching, paging, FlashAttention, speculative decoding, quantization.

Avoid Recomputation: the KV cache, prompt caching, prefix sharing. Two enemies. Thirteen chapters. One journey.