You built an efficient Transformer (L12) that fits on a GPU. Now serve it to real users. That means: shrink the weights to INT4 without destroying accuracy, make the KV cache fit longer contexts, and answer hundreds of requests per second with low latency. This lesson builds the full deployment stack from quantization math to speculative decoding to PagedAttention — with worked numbers at every step.
You've trained a 7B-parameter language model. It's brilliant. Now ship it. A single FP16 copy occupies 14 GB — already filling an entire consumer GPU. Serving a thousand simultaneous users means you need enough GPU memory for weights plus KV caches for every active sequence. The math doesn't work in FP16.
Beyond memory, there's latency. L12 established that autoregressive decode is memory-bandwidth bound — at batch size 1, arithmetic intensity is ~1 FLOP/byte, 156× below the A100's compute ridge. Every token you generate requires reading all the model weights from HBM. With a 7B FP16 model that's 14 GB of reads per token. Shrink the weights and you directly shrink the time per token.
The deployment stack attacks three problems simultaneously: weight quantization (fit more model in GPU memory, speed up weight reads), KV-cache memory management (serve many users without fragmentation waste), and generation throughput (batch requests intelligently, use speculative decoding to extract multiple tokens per big-model step).
Let's derive the numbers. A 7B model in FP16: 7 × 10⁹ × 2 bytes = 14 GB. In INT8: 7 GB. In INT4: 3.5 GB. On a 24 GB RTX 4090, INT4 leaves 20.5 GB for KV caches — enough for a 32k-token context batch. INT4 also means 4× fewer bytes to read per token → 4× faster decode on a memory-bandwidth-bound workload.
Select model size and bitwidth. The GPU memory line (dashed) shows an A100-80GB. Watch how INT4 opens up space for KV caches.
For CNNs and BERT-scale models, standard W8A8 quantization works beautifully — both weights and activations are quantized to INT8, arithmetic is done in INT8, and accuracy loss is under 0.5%. Then people tried W8A8 on LLMs beyond 6.7B parameters and found a cliff: perplexity explodes.
The culprit: systematic activation outliers. In large LLMs, certain hidden-state channels develop extreme values — magnitudes of 50–150 — while other channels sit near ±1. This isn't noise; the same channels are outliers across different inputs and tokens. It's a structural feature of how large transformers represent information.
Recall INT8 quantization: scale S = max|X| / 127. If one channel has magnitude 70 and everything else is ±2, then S = 70/127 ≈ 0.55. A value of 2.0 quantizes to round(2.0/0.55) = round(3.6) = 4, then dequantizes to 4 × 0.55 = 2.2. The small values lose 10% of their magnitude. Multiply by a weight matrix and sum over hidden dim 4096 — errors accumulate catastrophically.
Why do outliers emerge at 6.7B? Song Han's lab and others observe this empirically — it appears to be a phase transition tied to model capacity. Below 6.7B, activations are well-behaved. Above it, the model learns to use a small fraction of channels as high-magnitude "signals." Nobody has a fully satisfying theoretical explanation, but the empirical fact is robust: every model beyond ~7B (LLaMA, OPT, BLOOM) exhibits this.
The failure mode is concrete. OPT-175B with naive INT8 activation quantization loses ~7 perplexity points (from 8.3 to 15+ on WikiText-2). For a deployed chatbot that's the difference between fluent and incoherent. W8A8 is simply not viable without addressing the outlier problem.
SmoothQuant (Xiao et al., 2022) makes a simple but powerful observation: weights are easy to quantize, activations are hard. Can we move the difficulty from activations to weights? Yes — by applying a per-channel scale to the activations and its inverse to the corresponding weight channels, mathematically preserving the output while balancing the quantization difficulty.
Formally: given a linear layer Y = XW, insert a diagonal scaling matrix s and its inverse:
The scale vector s is chosen to make both X̂ and Ŵ easy to quantize. The formula used is per-channel:
Here j indexes the input channel. When α = 0.5, we split the difficulty equally. When α → 1, we push all difficulty to weights (activations become easy, weights harder). The migration strength α is a hyperparameter tuned per model — LLaMA family typically uses α ≈ 0.85.
In practice: run a small calibration set (128–512 samples) through the FP16 model. Record the per-channel absmax of activations at each linear layer. Compute s from the formula above. Divide the LayerNorm output scale by s (or equivalently, multiply the LayerNorm output by s⁻¹). Multiply the weight matrix rows by s. Quantize both to INT8. Done — the quantized model has comparable accuracy to FP16 for models up to 530B parameters (MT-NLG 530B: FP16 perplexity 12.0, SmoothQuant INT8: 12.1).
Toggle between original (hard activation, easy weight) and smoothed (balanced). Drag α to see difficulty tradeoff. Bar height = quantization difficulty (spread / dynamic range).
SmoothQuant enables W8A8. But W8A8 is still 2× larger than W4A16. For edge deployment — a 7B model on a laptop GPU with 8 GB VRAM — you need 4-bit weights. And you need a smarter method than naive round-to-nearest (RTN), which loses ~2 perplexity points on LLaMA-2-7B at INT4.
AWQ (Lin et al., MLSys 2024) starts from an observation: not all weights are equally important. Keeping just 1% of weight channels in FP16 (the "salient" ones) while quantizing the rest to INT4 dramatically reduces perplexity loss. The key question is: which 1%?
The naive answer is magnitude — keep the largest weights. But AWQ shows this is wrong. The right criterion is activation magnitude. A weight channel connected to a high-activation input channel is salient, because the output error from quantizing that weight is amplified by the large input.
Rather than storing 1% of weights in FP16 (which complicates hardware), AWQ uses a cleverer trick: scale salient channels up before quantizing, then divide the input by the same scale at runtime. Mathematically: Y = WX = (W·s)(s⁻¹·X). Scaling W up by s reduces its quantization error proportionally. The per-channel scale s is fused into the preceding layer (like SmoothQuant's folding trick), so runtime cost is zero.
The scale s is found by grid search over activation statistics:
where sX is the per-channel activation absmax. The only free parameter is α ∈ [0, 1], searched over 100 steps. This is data-efficient — a tiny calibration set, no gradient computation.
AWQ results (INT4 g128 on LLaMA-2-7B): RTN PPL = 5.73, GPTQ = 5.69, AWQ = 5.60 vs FP16 = 5.47. At INT3 g128 the gains are larger: RTN = 6.66, AWQ = 6.24 (FP16 = 5.68). AWQ also generalizes to multi-modal models — VILA-7B AWQ INT4 matches FP16 on 12 VQA benchmarks.
python # AWQ-style per-channel scaling (simplified) import torch def awq_scale_search(W, X_calib, n_steps=100): # W: [out_dim, in_dim], X_calib: [calib_tokens, in_dim] s_X = X_calib.abs().mean(dim=0) # [in_dim] avg activation magnitude best_loss, best_s = float('inf'), None for alpha in torch.linspace(0, 1, n_steps): s = s_X ** alpha # [in_dim] W_scaled = W * s.unsqueeze(0) # scale weight channels up W_q = quantize_int4(W_scaled) # round to INT4 group-wise W_q_deq = dequantize(W_q) # back to FP16 # Measure output error on calibration set X_scaled = X_calib / s.unsqueeze(0) loss = (W_q_deq @ X_scaled.T - W @ X_calib.T).norm() if loss < best_loss: best_loss, best_s = loss, s.clone() return best_s # fuse 1/s into preceding LayerNorm scale def pack_int4(w_int4): # Pack two 4-bit values into one uint8 # w_int4: [N] tensor of values in [0,15] assert w_int4.numel() % 2 == 0 w_int4 = w_int4.view(-1, 2) return (w_int4[:, 0] | (w_int4[:, 1] << 4)).to(torch.uint8) # Unpack at runtime: low = packed & 0x0F; high = (packed >> 4) & 0x0F
AWQ finds a per-channel scaling to reduce INT4 error. GPTQ (Frantar et al., 2022) takes a different approach: it quantizes weights one by one, and after each quantization, updates the remaining unquantized weights to compensate for the error just introduced. This is exact second-order optimization at the layer level.
The foundation is Optimal Brain Compression (OBC), which extends the classic OBD Hessian saliency idea to full-layer quantization. For a linear layer with weight matrix W and input activations X (from a calibration set), we want to minimize output error:
The Hessian of this objective with respect to W is H = 2XXT. GPTQ quantizes each column of W in sequence: quantize column q, compute the quantization error eq = wq − Q(wq), then update all remaining unquantized columns to absorb this error. The update rule is:
This is precisely the OBS weight update formula. The key practical innovation in GPTQ is computing H⁻¹ once and reusing it efficiently across all 4096 columns — making the method fast enough to quantize a 175B model in ~4 GPU hours.
python # GPTQ outer loop (conceptual — actual uses Cholesky) def gptq_quantize_layer(W, X_calib, group_size=128): # W: [out_dim, in_dim], X_calib: [n_calib, in_dim] H = 2 * X_calib.T @ X_calib # [in_dim, in_dim] Hessian H += 0.01 * torch.eye(H.shape[0]) # damping for numerical stability H_inv = torch.linalg.inv(H) # [in_dim, in_dim] — computed once W_q = W.clone() for q in range(W.shape[1]): # iterate over input dimension w_col = W_q[:, q] # [out_dim] w_col_quant = quantize_int4(w_col, group_size) err = w_col - w_col_quant # [out_dim] quantization error W_q[:, q] = w_col_quant # Compensate remaining columns W_q[:, q+1:] -= (err.unsqueeze(1) @ H_inv[q, q+1:].unsqueeze(0)) / H_inv[q, q] return W_q
We now have two quantization regimes: W8A8 and W4A16. Which do you use, and when? The answer follows directly from the roofline model.
Recall from L12: during a decode step at batch size B, the arithmetic intensity of a weight matrix multiply is approximately B FLOPs/byte. The A100's compute ridge is 312 TFLOPs / 2 TB/s = 156. So the decode step is memory-bandwidth bound whenever B < 156.
At B = 1 (single-user chat), AI = 1 FLOP/byte. W8A8 uses INT8 for both weights and activations — the weight bytes per step are halved vs FP16, so decode is 2× faster. W4A16 uses INT4 weights with FP16 activations — weight bytes are quartered, decode is 4× faster than FP16. And the arithmetic stays in FP16 (dequant weights on the fly → FP16 → tensor cores), so accuracy is excellent.
At large batch sizes (B ≫ 156), you're compute-bound. Now W8A8 wins: it uses INT8 tensor cores (2× throughput vs FP16) and is quantized for both weights and activations, enabling hardware INT8 matmul. W4A16 dequant-on-the-fly has overhead at high batch. So the rule is:
| Setting | Batch size | Bottleneck | Best scheme |
|---|---|---|---|
| Edge / single user | 1–8 | Memory bandwidth | W4A16 (AWQ/GPTQ) |
| Cloud mid-traffic | 16–64 | Transitional | W4A16 or W8A8 |
| Cloud high-traffic | 128+ | Compute | W8A8 (SmoothQuant) |
Worked numbers — LLaMA-2-7B decode latency at bs=1, RTX 4090 (1 TB/s bandwidth):
FP16: 14 GB weights → 14 ms/token theoretical (14 GB / 1 TB/s). Actual TinyChat measurement: ~310 ms/token (due to inefficiency). INT8: 7 GB → 7 ms theoretical. INT4 (AWQ): 3.5 GB → 3.5 ms theoretical. TinyChat AWQ: ~110 ms/token — still overhead but 2.8× faster than FP16. On Orin (80 GB/s bandwidth): FP16 OOM; INT4 AWQ: 52 tokens/sec.
Drag bitwidth slider. Latency scales with bytes read (proportional to bitwidth). Compute stays FP16 (flat). At small batch, memory dominates — INT4 wins.
You've quantized the model weights. Now serve 100 concurrent users. Each user has a KV cache growing with their conversation. For a LLaMA-2-7B (L=32, H=32, d_h=128, FP16): one user at N=2048 tokens needs 32 × 2 × 2048 × 32 × 128 × 2 bytes = 1.07 GB. For 100 users, that's 107 GB — impossible on a single 80 GB A100.
The deeper problem is fragmentation. Traditional LLM inference systems pre-allocate a contiguous KV cache slot for each request, sized to the maximum possible output length. If you allow up to 2048 tokens but most answers are 100 tokens, you're wasting 95% of the allocated space. vLLM (Kwon et al., 2023) identified three waste categories:
PagedAttention solves this by treating KV cache like OS virtual memory. The KV cache is divided into fixed-size blocks (e.g., 16 tokens × H × d_h × 2 × FP16 = 16 × 32 × 128 × 2 × 2 = 262 KB per block for LLaMA-2-7B). A block table maps logical block positions to physical GPU memory pages. Blocks are allocated on demand — one block at a time as tokens are generated — and freed immediately when a request finishes.
PagedAttention also enables efficient prompt sharing: if 10 users send requests with the same system prompt (common in production — "You are a helpful assistant..."), the KV blocks for the shared prefix can be referenced by all 10 requests via copy-on-write. The system prompt only needs to be computed once.
Continuous (in-flight) batching pairs naturally with PagedAttention. Traditional static batching: wait for a full batch of N requests, run to completion, release. Problem: the longest request blocks all shorter ones. Continuous batching processes requests token by token. When one sequence finishes (hits end-of-sequence token), its slot is immediately freed and a new request can be inserted into the batch at the next decode step. GPU utilization stays high and tail latency is bounded.
Drag to set actual output length and block size. See how contiguous allocation wastes memory vs paged allocation with minimal waste.
Even with W4A16 and continuous batching, a 70B model at batch size 1 is slow. The bottleneck is sequential: every token requires one full pass through 80 transformer layers. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) exploits the fact that a large target model's forward pass is almost free at batch size > 1 compared to batch size 1.
The protocol uses two models: a small, fast draft model (e.g., 7B) and the large target model (e.g., 70B). The draft model generates K tokens autoregressively (cheap — small model, fast). Those K tokens are then fed to the target model as a batch of length K+1 (the +1 is the context up to the draft). The target model runs one forward pass over K+1 tokens — it can do this in parallel thanks to causal attention. It outputs a probability distribution for each position.
For each of the K draft tokens, the target model either accepts or rejects based on acceptance sampling:
where u ~ Uniform[0,1]. If accepted, move to k+1. If rejected, sample a correction token from (p_target − p_draft)⁺ (the positive part of the difference). This is provably equivalent to sampling from p_target — output distribution is identical to running the target model alone.
Deriving expected speedup: Let p be the acceptance rate (probability each draft token is accepted). The number of accepted tokens before the first rejection follows a geometric distribution. The expected number of tokens generated per target-model step:
After a rejection (or after K tokens all accepted), the target model must run again (for the correction token or the next draft batch). So tokens per target-model step = (1 − p^(K+1)) / (1 − p). With p = 0.8 and K = 5: (1 − 0.8⁶) / (1 − 0.8) = (1 − 0.262) / 0.2 = 3.69 tokens per step. The wall-clock speedup = 3.69 × (cost_target / (cost_draft × K + cost_target)) — roughly 2–3× in practice.
python def speculative_decode_step(draft_model, target_model, context, K=5): # Step 1: draft K tokens with small model draft_tokens, draft_probs = [], [] ctx = context.clone() for _ in range(K): p_draft = draft_model(ctx) # [vocab] distribution t = torch.multinomial(p_draft, 1) draft_tokens.append(t); draft_probs.append(p_draft[t]) ctx = torch.cat([ctx, t.unsqueeze(0)]) # Step 2: verify with target model (ONE parallel forward pass) all_tokens = torch.cat([context, torch.stack(draft_tokens)]) p_targets = target_model(all_tokens) # [K+1, vocab] in one pass # Step 3: accept/reject each draft token accepted = [] for k in range(K): t = draft_tokens[k] p_tgt = p_targets[k, t] p_dft = draft_probs[k] u = torch.rand(1) if u < p_tgt / p_dft: # accept accepted.append(t) else: # reject — sample correction p_corr = (p_targets[k] - p_draft_model(context, k)).clamp(min=0) t_corr = torch.multinomial(p_corr / p_corr.sum(), 1) accepted.append(t_corr) break # If all K accepted, also take target's prediction for position K+1 if len(accepted) == K: accepted.append(p_targets[K].argmax()) return accepted # 1..K+1 tokens generated in ONE target-model step
This is the deployment simulator. Set K (draft tokens), acceptance rate p, and model size. Watch the draft-verify cycle animate and track expected throughput gain vs naive autoregressive decoding. Adjust KV-cache quantization to see how INT4 KV further extends context capacity.
Drag K (draft tokens) and acceptance rate p. The left panel shows expected tokens/step and speedup. The right panel shows the acceptance-chain probability tree.
Worked KV-cache numbers — LLaMA-2-70B (L=80, H_kv=8 GQA, d_h=128):
Continuous batching maximizes GPU utilization. Rather than a static batch of B requests all processed together, continuous batching maintains a set of active sequences and replaces finished ones after each decode step. The scheduler tracks available KV-page memory and admits new requests as pages free up. This eliminates the "longest sequence holds everyone hostage" problem — typical implementation doubles throughput vs static batching at the same latency budget.
You now have the full deployment stack. Here is how the pieces fit together and when to reach for each tool.
| Technique | When to use | Key tradeoff |
|---|---|---|
| W8A8 (SmoothQuant) | Cloud, batch ≥ 64, throughput-critical | High throughput; needs outlier migration; SmoothQuant offline |
| W4A16 (AWQ) | Edge, laptop GPU, batch ≤ 16 | 4× memory; dequant-on-fly overhead at high batch |
| W4A16 (GPTQ) | Any low-batch; llama.cpp ecosystem | Slower to quantize; compatible with many runtimes |
| W4A8KV4 (QServe) | Cloud A100/L40S, high throughput + low memory | Best TOPS efficiency; needs custom kernel |
| INT4 KV cache | Long context (32k+) or large batch | 4× KV memory; SmoothAttention needed for accuracy |
| PagedAttention (vLLM) | Any production serving system | Near-zero fragmentation; ~4× throughput vs naive |
| Continuous batching | Variable-length outputs | GPU always busy; pairs with PagedAttention |
| Speculative decoding | Low-batch latency-critical; need matching draft model | 2–3× faster; distribution-preserving; draft model overhead |
Three numbers govern every serving system decision:
Prerequisites in this series:
Related Gleams:
"The bottleneck is reading the model from memory, not running it. Make the model smaller, and everything downstream gets faster for free." — MIT 6.5940 course notes