LLM Deployment & Serving

Chapter 0: The Deployment Gap

You've trained a 7B-parameter language model. It's brilliant. Now ship it. A single FP16 copy occupies 14 GB — already filling an entire consumer GPU. Serving a thousand simultaneous users means you need enough GPU memory for weights plus KV caches for every active sequence. The math doesn't work in FP16.

Beyond memory, there's latency. L12 established that autoregressive decode is memory-bandwidth bound — at batch size 1, arithmetic intensity is ~1 FLOP/byte, 156× below the A100's compute ridge. Every token you generate requires reading all the model weights from HBM. With a 7B FP16 model that's 14 GB of reads per token. Shrink the weights and you directly shrink the time per token.

The deployment stack attacks three problems simultaneously: weight quantization (fit more model in GPU memory, speed up weight reads), KV-cache memory management (serve many users without fragmentation waste), and generation throughput (batch requests intelligently, use speculative decoding to extract multiple tokens per big-model step).

The core insight of this lecture: W8A8 (quantize both weights and activations to INT8) works great for CNNs and for large-batch LLM serving, but fails for LLMs at small batch because (1) activation outliers destroy INT8 accuracy, and (2) at bs=1 the bottleneck is weight bytes, not arithmetic — W4A16 (INT4 weights, FP16 activations) is the right answer for edge/single-user serving.

Let's derive the numbers. A 7B model in FP16: 7 × 10⁹ × 2 bytes = 14 GB. In INT8: 7 GB. In INT4: 3.5 GB. On a 24 GB RTX 4090, INT4 leaves 20.5 GB for KV caches — enough for a 32k-token context batch. INT4 also means 4× fewer bytes to read per token → 4× faster decode on a memory-bandwidth-bound workload.

Model Memory by Bitwidth — 7B / 13B / 70B

Select model size and bitwidth. The GPU memory line (dashed) shows an A100-80GB. Watch how INT4 opens up space for KV caches.

Model size 7B

A 7B model is deployed at FP16. Decode is slow. A profiler shows 8% GPU ALU utilization and 94% memory bandwidth utilization. What is the single most impactful fix?

(A) Upgrade to a GPU with more CUDA cores to raise ALU utilization. (B) Enable FlashAttention — the attention kernel is the bottleneck. (C) Quantize weights to INT4 (W4A16) — fewer bytes to read per token directly reduces memory-bound latency. (D) Increase batch size — more arithmetic intensity makes it compute-bound.

Chapter 1: The Outlier Problem — Why W8A8 Fails for LLMs

For CNNs and BERT-scale models, standard W8A8 quantization works beautifully — both weights and activations are quantized to INT8, arithmetic is done in INT8, and accuracy loss is under 0.5%. Then people tried W8A8 on LLMs beyond 6.7B parameters and found a cliff: perplexity explodes.

The culprit: systematic activation outliers. In large LLMs, certain hidden-state channels develop extreme values — magnitudes of 50–150 — while other channels sit near ±1. This isn't noise; the same channels are outliers across different inputs and tokens. It's a structural feature of how large transformers represent information.

Recall INT8 quantization: scale S = max|X| / 127. If one channel has magnitude 70 and everything else is ±2, then S = 70/127 ≈ 0.55. A value of 2.0 quantizes to round(2.0/0.55) = round(3.6) = 4, then dequantizes to 4 × 0.55 = 2.2. The small values lose 10% of their magnitude. Multiply by a weight matrix and sum over hidden dim 4096 — errors accumulate catastrophically.

Per-channel scaling helps weights, not activations. W8A8 with per-channel weight quantization (one scale per output channel of W) dramatically improves weight quantization accuracy. But activations change with every input — you can't calibrate per-channel scales at quantization time. The outlier channels are the same channels, but their exact magnitude varies per input. This is the fundamental asymmetry: weights are static and easy, activations are dynamic and hard.

Why do outliers emerge at 6.7B? Song Han's lab and others observe this empirically — it appears to be a phase transition tied to model capacity. Below 6.7B, activations are well-behaved. Above it, the model learns to use a small fraction of channels as high-magnitude "signals." Nobody has a fully satisfying theoretical explanation, but the empirical fact is robust: every model beyond ~7B (LLaMA, OPT, BLOOM) exhibits this.

The failure mode is concrete. OPT-175B with naive INT8 activation quantization loses ~7 perplexity points (from 8.3 to 15+ on WikiText-2). For a deployed chatbot that's the difference between fluent and incoherent. W8A8 is simply not viable without addressing the outlier problem.

An LLM has one activation channel with value 68 and 4095 channels with values between −2 and +2. You apply per-tensor INT8 quantization (S = max|X|/127). What happens to the small channels?

(A) They are quantized accurately; the scale is small so they use the full INT8 range. (B) S = 68/127 ≈ 0.535. A value of 1.5 → round(1.5/0.535) = round(2.8) = 3 → dequant 1.605. Values near 0 all round to 0. Catastrophic loss of precision in the small channels. (C) The outlier channel is clipped to INT8 max (127); other channels are unaffected. (D) Per-tensor quantization handles outliers correctly by centering the scale at zero.

Chapter 2: SmoothQuant — Migrate Difficulty to Weights

SmoothQuant (Xiao et al., 2022) makes a simple but powerful observation: weights are easy to quantize, activations are hard. Can we move the difficulty from activations to weights? Yes — by applying a per-channel scale to the activations and its inverse to the corresponding weight channels, mathematically preserving the output while balancing the quantization difficulty.

Formally: given a linear layer Y = XW, insert a diagonal scaling matrix s and its inverse:

Y = X W = (X · diag(s)^-1) · (diag(s) · W) = X̂ · Ŵ

The scale vector s is chosen to make both X̂ and Ŵ easy to quantize. The formula used is per-channel:

s_j = max|X_j|^α / max|W_j|^1-α

Here j indexes the input channel. When α = 0.5, we split the difficulty equally. When α → 1, we push all difficulty to weights (activations become easy, weights harder). The migration strength α is a hyperparameter tuned per model — LLaMA family typically uses α ≈ 0.85.

Misconception: SmoothQuant changes the model weights at inference. It does not. The smoothed weight Ŵ = diag(s) · W is computed offline at quantization time, then quantized and stored. The per-channel division s⁻¹ applied to activations X is fused into the preceding LayerNorm — it's folded into the LayerNorm scale parameters, adding zero runtime cost. The inference model is identical in structure; only the stored weights differ.

In practice: run a small calibration set (128–512 samples) through the FP16 model. Record the per-channel absmax of activations at each linear layer. Compute s from the formula above. Divide the LayerNorm output scale by s (or equivalently, multiply the LayerNorm output by s⁻¹). Multiply the weight matrix rows by s. Quantize both to INT8. Done — the quantized model has comparable accuracy to FP16 for models up to 530B parameters (MT-NLG 530B: FP16 perplexity 12.0, SmoothQuant INT8: 12.1).

SmoothQuant: Activation vs Weight Difficulty

Toggle between original (hard activation, easy weight) and smoothed (balanced). Drag α to see difficulty tradeoff. Bar height = quantization difficulty (spread / dynamic range).

Migration α 0.50

SmoothQuant computes s_j = max|X_j|^α / max|W_j|^(1−α). At α=1.0, what happens to the activations X̂ and weights Ŵ?

(A) Activations become harder to quantize; weights become easier. (B) Both activations and weights become equally hard to quantize. (C) s_j = max|X_j|/1 = max|X_j|. X̂ = X/diag(s) has absmax ≈ 1 (easy). Ŵ = diag(s)·W has channels scaled up by max|X_j| (harder). All difficulty migrates to weights. (D) α=1.0 disables the smoothing; the model is unchanged.

Chapter 3: AWQ — Activation-Aware Weight Quantization

SmoothQuant enables W8A8. But W8A8 is still 2× larger than W4A16. For edge deployment — a 7B model on a laptop GPU with 8 GB VRAM — you need 4-bit weights. And you need a smarter method than naive round-to-nearest (RTN), which loses ~2 perplexity points on LLaMA-2-7B at INT4.

AWQ (Lin et al., MLSys 2024) starts from an observation: not all weights are equally important. Keeping just 1% of weight channels in FP16 (the "salient" ones) while quantizing the rest to INT4 dramatically reduces perplexity loss. The key question is: which 1%?

The naive answer is magnitude — keep the largest weights. But AWQ shows this is wrong. The right criterion is activation magnitude. A weight channel connected to a high-activation input channel is salient, because the output error from quantizing that weight is amplified by the large input.

The AWQ error insight: For a weight w and input x, the output error from quantizing w is Δw · x. So error = Δw · x, not just Δw. Large x → large error from any quantization of w. The salient weights to protect are those in channels with large average activation magnitude — not those with large weight magnitude.

Rather than storing 1% of weights in FP16 (which complicates hardware), AWQ uses a cleverer trick: scale salient channels up before quantizing, then divide the input by the same scale at runtime. Mathematically: Y = WX = (W·s)(s⁻¹·X). Scaling W up by s reduces its quantization error proportionally. The per-channel scale s is fused into the preceding layer (like SmoothQuant's folding trick), so runtime cost is zero.

The scale s is found by grid search over activation statistics:

s = s_X^α α* = argmin_α ‖Q(W · s)(s^-1 · X) − WX‖

where s_X is the per-channel activation absmax. The only free parameter is α ∈ [0, 1], searched over 100 steps. This is data-efficient — a tiny calibration set, no gradient computation.

AWQ results (INT4 g128 on LLaMA-2-7B): RTN PPL = 5.73, GPTQ = 5.69, AWQ = 5.60 vs FP16 = 5.47. At INT3 g128 the gains are larger: RTN = 6.66, AWQ = 6.24 (FP16 = 5.68). AWQ also generalizes to multi-modal models — VILA-7B AWQ INT4 matches FP16 on 12 VQA benchmarks.

python
# AWQ-style per-channel scaling (simplified)
import torch

def awq_scale_search(W, X_calib, n_steps=100):
    # W: [out_dim, in_dim], X_calib: [calib_tokens, in_dim]
    s_X = X_calib.abs().mean(dim=0)   # [in_dim] avg activation magnitude
    best_loss, best_s = float('inf'), None
    for alpha in torch.linspace(0, 1, n_steps):
        s = s_X ** alpha             # [in_dim]
        W_scaled = W * s.unsqueeze(0)  # scale weight channels up
        W_q = quantize_int4(W_scaled)   # round to INT4 group-wise
        W_q_deq = dequantize(W_q)       # back to FP16
        # Measure output error on calibration set
        X_scaled = X_calib / s.unsqueeze(0)
        loss = (W_q_deq @ X_scaled.T - W @ X_calib.T).norm()
        if loss < best_loss:
            best_loss, best_s = loss, s.clone()
    return best_s   # fuse 1/s into preceding LayerNorm scale

def pack_int4(w_int4):
    # Pack two 4-bit values into one uint8
    # w_int4: [N] tensor of values in [0,15]
    assert w_int4.numel() % 2 == 0
    w_int4 = w_int4.view(-1, 2)
    return (w_int4[:, 0] | (w_int4[:, 1] << 4)).to(torch.uint8)
    # Unpack at runtime: low = packed & 0x0F; high = (packed >> 4) & 0x0F

AWQ finds salient weight channels by looking at activation magnitudes, not weight magnitudes. Why is weight magnitude the wrong criterion?

(A) Large weights are always easy to quantize because they use the full INT4 range. (B) Output error = Δw · x. A small weight in a high-activation channel causes large output error. Selecting by weight magnitude misses these channels — only activation magnitude reveals true output sensitivity. (C) Weight magnitude is the right criterion but AWQ uses activation magnitude for implementation simplicity. (D) All weight channels contribute equally to output error; neither criterion is correct.

Chapter 4: GPTQ — Second-Order Layer-Wise Quantization

AWQ finds a per-channel scaling to reduce INT4 error. GPTQ (Frantar et al., 2022) takes a different approach: it quantizes weights one by one, and after each quantization, updates the remaining unquantized weights to compensate for the error just introduced. This is exact second-order optimization at the layer level.

The foundation is Optimal Brain Compression (OBC), which extends the classic OBD Hessian saliency idea to full-layer quantization. For a linear layer with weight matrix W and input activations X (from a calibration set), we want to minimize output error:

min_Q ‖WX − QX‖_F²

The Hessian of this objective with respect to W is H = 2XX^T. GPTQ quantizes each column of W in sequence: quantize column q, compute the quantization error e_q = w_q − Q(w_q), then update all remaining unquantized columns to absorb this error. The update rule is:

W_{:, j} ← W_{:, j} − (e_q / [H^-1]_qq) · [H^-1]_{q, j}

This is precisely the OBS weight update formula. The key practical innovation in GPTQ is computing H⁻¹ once and reusing it efficiently across all 4096 columns — making the method fast enough to quantize a 175B model in ~4 GPU hours.

GPTQ vs AWQ tradeoff: GPTQ is slower to quantize (needs calibration data + Cholesky for H⁻¹) but tends to match AWQ at INT4. AWQ is faster to apply, hardware-friendlier (no special kernel needed beyond dequant-on-the-fly), and generalizes better across model types including VLMs. In practice both reach PPL within 0.1–0.2 of each other on LLaMA-2. AWQ is the preferred choice for edge deployment; GPTQ is widely used in cloud tools (AutoGPTQ, llama.cpp).

python
# GPTQ outer loop (conceptual — actual uses Cholesky)
def gptq_quantize_layer(W, X_calib, group_size=128):
    # W: [out_dim, in_dim], X_calib: [n_calib, in_dim]
    H = 2 * X_calib.T @ X_calib  # [in_dim, in_dim] Hessian
    H += 0.01 * torch.eye(H.shape[0])  # damping for numerical stability
    H_inv = torch.linalg.inv(H)     # [in_dim, in_dim] — computed once
    W_q = W.clone()
    for q in range(W.shape[1]):   # iterate over input dimension
        w_col = W_q[:, q]            # [out_dim]
        w_col_quant = quantize_int4(w_col, group_size)
        err = w_col - w_col_quant    # [out_dim] quantization error
        W_q[:, q] = w_col_quant
        # Compensate remaining columns
        W_q[:, q+1:] -= (err.unsqueeze(1) @
            H_inv[q, q+1:].unsqueeze(0)) / H_inv[q, q]
    return W_q

After GPTQ quantizes column q of weight matrix W, it updates the remaining unquantized columns. What is the purpose of this update?

(A) To re-normalize the weight matrix after INT4 rounding. (B) To prevent the Hessian from becoming singular during iteration. (C) To compensate for the output error introduced by quantizing column q — the remaining weights absorb the error, minimizing total layer output error. (D) To scale the weights to the INT4 range before they are quantized.

Chapter 5: The W4A16 Sweet Spot — Why Weight-Only INT4 Wins at Decode

We now have two quantization regimes: W8A8 and W4A16. Which do you use, and when? The answer follows directly from the roofline model.

Recall from L12: during a decode step at batch size B, the arithmetic intensity of a weight matrix multiply is approximately B FLOPs/byte. The A100's compute ridge is 312 TFLOPs / 2 TB/s = 156. So the decode step is memory-bandwidth bound whenever B < 156.

At B = 1 (single-user chat), AI = 1 FLOP/byte. W8A8 uses INT8 for both weights and activations — the weight bytes per step are halved vs FP16, so decode is 2× faster. W4A16 uses INT4 weights with FP16 activations — weight bytes are quartered, decode is 4× faster than FP16. And the arithmetic stays in FP16 (dequant weights on the fly → FP16 → tensor cores), so accuracy is excellent.

Misconception: W4A16 speeds up compute. It does not. The matrix multiply itself still runs in FP16 on tensor cores. The speedup comes entirely from reading 4× fewer bytes of weights from HBM — at memory-bound batch sizes, that IS the bottleneck. Fewer bytes read = fewer memory stall cycles = faster tokens.

At large batch sizes (B ≫ 156), you're compute-bound. Now W8A8 wins: it uses INT8 tensor cores (2× throughput vs FP16) and is quantized for both weights and activations, enabling hardware INT8 matmul. W4A16 dequant-on-the-fly has overhead at high batch. So the rule is:

Setting	Batch size	Bottleneck	Best scheme
Edge / single user	1–8	Memory bandwidth	W4A16 (AWQ/GPTQ)
Cloud mid-traffic	16–64	Transitional	W4A16 or W8A8
Cloud high-traffic	128+	Compute	W8A8 (SmoothQuant)

Worked numbers — LLaMA-2-7B decode latency at bs=1, RTX 4090 (1 TB/s bandwidth):

FP16: 14 GB weights → 14 ms/token theoretical (14 GB / 1 TB/s). Actual TinyChat measurement: ~310 ms/token (due to inefficiency). INT8: 7 GB → 7 ms theoretical. INT4 (AWQ): 3.5 GB → 3.5 ms theoretical. TinyChat AWQ: ~110 ms/token — still overhead but 2.8× faster than FP16. On Orin (80 GB/s bandwidth): FP16 OOM; INT4 AWQ: 52 tokens/sec.

Decode Latency vs Weight Bitwidth (Memory-Bound Model)

Drag bitwidth slider. Latency scales with bytes read (proportional to bitwidth). Compute stays FP16 (flat). At small batch, memory dominates — INT4 wins.

Batch size 1

W4A16 quantization keeps activations in FP16 and quantizes only weights to INT4. At batch size 1, why does W4A16 give ~4× faster decode than FP16 even though computation is still FP16?

(A) INT4 tensor cores run 4× faster than FP16 tensor cores. (B) Quantization reduces the number of matrix multiplications needed. (C) Decode is memory-bandwidth bound (AI ≈ 1 FLOP/byte ≪ ridge). Latency ∝ bytes read. INT4 weights are 4× fewer bytes than FP16, so the memory stall is 4× shorter — direct 4× speedup. (D) W4A16 enables activation quantization which reduces the KV cache size by 4×.

Chapter 6: PagedAttention & vLLM — KV Cache as Virtual Memory

You've quantized the model weights. Now serve 100 concurrent users. Each user has a KV cache growing with their conversation. For a LLaMA-2-7B (L=32, H=32, d_h=128, FP16): one user at N=2048 tokens needs 32 × 2 × 2048 × 32 × 128 × 2 bytes = 1.07 GB. For 100 users, that's 107 GB — impossible on a single 80 GB A100.

The deeper problem is fragmentation. Traditional LLM inference systems pre-allocate a contiguous KV cache slot for each request, sized to the maximum possible output length. If you allow up to 2048 tokens but most answers are 100 tokens, you're wasting 95% of the allocated space. vLLM (Kwon et al., 2023) identified three waste categories:

Internal fragmentation: pre-allocated for max output length, most unused
Reservation waste: slot reserved but tokens not yet generated
External fragmentation: gaps between sequences of different lengths

PagedAttention solves this by treating KV cache like OS virtual memory. The KV cache is divided into fixed-size blocks (e.g., 16 tokens × H × d_h × 2 × FP16 = 16 × 32 × 128 × 2 × 2 = 262 KB per block for LLaMA-2-7B). A block table maps logical block positions to physical GPU memory pages. Blocks are allocated on demand — one block at a time as tokens are generated — and freed immediately when a request finishes.

Key benefit: near-zero fragmentation. With 16-token blocks, the maximum internal fragmentation per sequence is 15 tokens (one partial block at the end). Compared to pre-allocating 2048 tokens for a sequence that actually uses 100, this is 15/100 = 15% waste vs the 1948/2048 = 95% waste of contiguous allocation. Real vLLM measurements show 4× throughput improvement over naive serving on Llama-based models.

PagedAttention also enables efficient prompt sharing: if 10 users send requests with the same system prompt (common in production — "You are a helpful assistant..."), the KV blocks for the shared prefix can be referenced by all 10 requests via copy-on-write. The system prompt only needs to be computed once.

Continuous (in-flight) batching pairs naturally with PagedAttention. Traditional static batching: wait for a full batch of N requests, run to completion, release. Problem: the longest request blocks all shorter ones. Continuous batching processes requests token by token. When one sequence finishes (hits end-of-sequence token), its slot is immediately freed and a new request can be inserted into the batch at the next decode step. GPU utilization stays high and tail latency is bounded.

PagedAttention: Fragmentation vs Block Size

Drag to set actual output length and block size. See how contiguous allocation wastes memory vs paged allocation with minimal waste.

Actual tokens generated 128

Block size (tokens) 16

A serving system pre-allocates 2048 tokens of KV cache per request. Most requests use 80–150 tokens. What problem does PagedAttention primarily solve, and how?

(A) It reduces the per-token KV cache size by quantizing keys and values to INT4. (B) Internal fragmentation — pre-allocating 2048 tokens for 100-token responses wastes ~95% of KV memory per slot. PagedAttention allocates fixed-size blocks on demand; only the blocks actually used (plus at most one partial block) are occupied. (C) It parallelizes the attention computation across multiple GPU cores. (D) It removes the need for the KV cache by recomputing keys and values each step.

Chapter 7: Speculative Decoding — Draft Fast, Verify in Parallel

Even with W4A16 and continuous batching, a 70B model at batch size 1 is slow. The bottleneck is sequential: every token requires one full pass through 80 transformer layers. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) exploits the fact that a large target model's forward pass is almost free at batch size > 1 compared to batch size 1.

The protocol uses two models: a small, fast draft model (e.g., 7B) and the large target model (e.g., 70B). The draft model generates K tokens autoregressively (cheap — small model, fast). Those K tokens are then fed to the target model as a batch of length K+1 (the +1 is the context up to the draft). The target model runs one forward pass over K+1 tokens — it can do this in parallel thanks to causal attention. It outputs a probability distribution for each position.

For each of the K draft tokens, the target model either accepts or rejects based on acceptance sampling:

accept token k if u < p_target(x_k) / p_draft(x_k)

where u ~ Uniform[0,1]. If accepted, move to k+1. If rejected, sample a correction token from (p_target − p_draft)⁺ (the positive part of the difference). This is provably equivalent to sampling from p_target — output distribution is identical to running the target model alone.

Deriving expected speedup: Let p be the acceptance rate (probability each draft token is accepted). The number of accepted tokens before the first rejection follows a geometric distribution. The expected number of tokens generated per target-model step:

E[accepted] = ∑_k=0^K k · p^k(1−p) + K · p^K = (1 − p^K+1) / (1 − p)

After a rejection (or after K tokens all accepted), the target model must run again (for the correction token or the next draft batch). So tokens per target-model step = (1 − p^(K+1)) / (1 − p). With p = 0.8 and K = 5: (1 − 0.8⁶) / (1 − 0.8) = (1 − 0.262) / 0.2 = 3.69 tokens per step. The wall-clock speedup = 3.69 × (cost_target / (cost_draft × K + cost_target)) — roughly 2–3× in practice.

Speculative decoding preserves distribution exactly. This is not an approximation. The acceptance/rejection sampling guarantees that the output tokens are distributed exactly as if the target model generated them autoregressively — no quality loss. The draft model only helps if it agrees with the target model often (high p). Using the same model family (e.g., LLaMA-2-7B as draft for LLaMA-2-70B) achieves p ≈ 0.7–0.85, giving 2–3× speedup.

python
def speculative_decode_step(draft_model, target_model, context, K=5):
    # Step 1: draft K tokens with small model
    draft_tokens, draft_probs = [], []
    ctx = context.clone()
    for _ in range(K):
        p_draft = draft_model(ctx)  # [vocab] distribution
        t = torch.multinomial(p_draft, 1)
        draft_tokens.append(t); draft_probs.append(p_draft[t])
        ctx = torch.cat([ctx, t.unsqueeze(0)])

    # Step 2: verify with target model (ONE parallel forward pass)
    all_tokens = torch.cat([context, torch.stack(draft_tokens)])
    p_targets = target_model(all_tokens)  # [K+1, vocab] in one pass

    # Step 3: accept/reject each draft token
    accepted = []
    for k in range(K):
        t = draft_tokens[k]
        p_tgt = p_targets[k, t]
        p_dft = draft_probs[k]
        u = torch.rand(1)
        if u < p_tgt / p_dft:     # accept
            accepted.append(t)
        else:                      # reject — sample correction
            p_corr = (p_targets[k] - p_draft_model(context, k)).clamp(min=0)
            t_corr = torch.multinomial(p_corr / p_corr.sum(), 1)
            accepted.append(t_corr)
            break
    # If all K accepted, also take target's prediction for position K+1
    if len(accepted) == K:
        accepted.append(p_targets[K].argmax())
    return accepted  # 1..K+1 tokens generated in ONE target-model step

Speculative decoding with K=4 draft tokens and acceptance rate p=0.7. Compute the expected number of tokens generated per target-model step using E = (1 − p^(K+1)) / (1 − p).

(A) E = 4 × 0.7 = 2.8 tokens per step. (B) E = 1 / (1 − 0.7) = 3.33 tokens per step (geometric series without cap). (C) E = (1 − 0.7⁵) / (1 − 0.7) = (1 − 0.168) / 0.3 = 0.832/0.3 = 2.77 tokens per step. (D) E = K = 4, since the target model verifies K tokens in one pass.

Chapter 8: Showcase — Speculative Decoding Explorer

This is the deployment simulator. Set K (draft tokens), acceptance rate p, and model size. Watch the draft-verify cycle animate and track expected throughput gain vs naive autoregressive decoding. Adjust KV-cache quantization to see how INT4 KV further extends context capacity.

Speculative Decoding: Expected Speedup vs Acceptance Rate

Drag K (draft tokens) and acceptance rate p. The left panel shows expected tokens/step and speedup. The right panel shows the acceptance-chain probability tree.

Draft tokens K 5

Acceptance rate p 0.75

KV bitwidth FP16

KV cache quantization (INT8/INT4): Quantizing keys and values to INT8 halves KV memory; INT4 quarters it. QServe (MIT Han Lab) achieves W4A8KV4 — 4-bit weights, 8-bit activations, 4-bit KV — enabling 2.4–3.5× higher throughput than TensorRT-LLM W8A8 on A100. The trick: SmoothAttention migrates KV outlier difficulty to the Q matrix (same idea as SmoothQuant but for attention), making INT4 KV quantization accurate.

Worked KV-cache numbers — LLaMA-2-70B (L=80, H_kv=8 GQA, d_h=128):

FP16, N=4096, bs=1: 80 × 2 × 4096 × 8 × 128 × 2 bytes = 1.34 GB
INT8 KV, same: 0.67 GB
INT4 KV, same: 0.34 GB
FP16, N=4096, bs=16: 21.5 GB — close to A100 limit
INT4 KV, N=4096, bs=16: 5.4 GB — room for 4× larger batch

Continuous batching maximizes GPU utilization. Rather than a static batch of B requests all processed together, continuous batching maintains a set of active sequences and replaces finished ones after each decode step. The scheduler tracks available KV-page memory and admits new requests as pages free up. This eliminates the "longest sequence holds everyone hostage" problem — typical implementation doubles throughput vs static batching at the same latency budget.

A serving system uses static batching with a batch size of 8. Some requests finish after 50 tokens; others take 500 tokens. What is the main inefficiency, and what does continuous batching fix?

(A) Static batching causes memory fragmentation in the KV cache. (B) Static batching wastes compute on padding tokens when sequences have different lengths. (C) Short requests finish but GPU slots sit idle waiting for the 500-token request to complete before accepting new work. Continuous batching replaces finished sequences mid-flight, keeping all GPU slots productive at every token step. (D) Static batching requires all requests to use the same KV-cache size.

Chapter 9: Connections & Deployment Cheat Sheet

You now have the full deployment stack. Here is how the pieces fit together and when to reach for each tool.

Technique	When to use	Key tradeoff
W8A8 (SmoothQuant)	Cloud, batch ≥ 64, throughput-critical	High throughput; needs outlier migration; SmoothQuant offline
W4A16 (AWQ)	Edge, laptop GPU, batch ≤ 16	4× memory; dequant-on-fly overhead at high batch
W4A16 (GPTQ)	Any low-batch; llama.cpp ecosystem	Slower to quantize; compatible with many runtimes
W4A8KV4 (QServe)	Cloud A100/L40S, high throughput + low memory	Best TOPS efficiency; needs custom kernel
INT4 KV cache	Long context (32k+) or large batch	4× KV memory; SmoothAttention needed for accuracy
PagedAttention (vLLM)	Any production serving system	Near-zero fragmentation; ~4× throughput vs naive
Continuous batching	Variable-length outputs	GPU always busy; pairs with PagedAttention
Speculative decoding	Low-batch latency-critical; need matching draft model	2–3× faster; distribution-preserving; draft model overhead

The deployment stack is composable. A production LLM server typically combines: AWQ/GPTQ (W4A16 weights) + INT8 KV cache + PagedAttention + continuous batching + speculative decoding. Each technique is orthogonal — they stack multiplicatively. vLLM supports all of these, and TinyChat adds AWQ edge deployment.

Metrics for LLM Serving

Three numbers govern every serving system decision:

TTFT (Time To First Token): driven by prefill latency. Critical for interactive UX. Dominated by prompt processing speed.
TPOT (Time Per Output Token): the decode speed. At 100 ms/token = 10 tokens/sec ≈ 450 words/min — fast enough to feel instant. Dominated by memory bandwidth and weight quantization.
Throughput: total tokens/sec across all requests. Dominated by batch size, continuous batching, and KV memory efficiency.

Related Lessons

Prerequisites in this series:

TinyML L6: Quantization II — PTQ, QAT, activation outliers, STE, GPTQ/AWQ intro
TinyML L12: Efficient Transformers — KV cache mechanics, memory-bandwidth-bound decode, MQA/GQA

Related Gleams:

CS224N L13: Reasoning II — speculative decoding + RoPE context extension
Attention Variants — MHA, MQA, GQA, FlashAttention
TinyML L5: Quantization I — INT8 from scratch, K-means, affine quant

What comes next — LLM Post-Training (L14): This lesson covered deployment-time compression. L14 covers post-training modifications to the model itself: RLHF fine-tuning, DPO, reward modeling — changing what the model says, not how fast it says it. The two are complementary: you post-train a high-quality model, then compress it with AWQ for deployment.

"The bottleneck is reading the model from memory, not running it. Make the model smaller, and everything downstream gets faster for free." — MIT 6.5940 course notes

LLM Deployment & Serving in Practice