TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 6

Quantization II: PTQ, QAT & LLM Quantization

You've quantized the weights. Now activations show up at inference and blow your INT8 range because one channel hits 70 while the rest hover around 1. This lesson shows how to calibrate activation ranges, train through fake-quant with the Straight-Through Estimator, and tame the outlier problem that makes naïve INT8 destroy LLM accuracy.

Prerequisites: TinyML L5 (Quantization I) — affine quantization r = S(q − Z), scale/zero-point, K-means and linear quant. Basic PyTorch autograd helpful for the STE derivation.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Outlier Problem

Lecture 5 gave you the tools to quantize weights offline. You compute the scale S and zero-point Z from the weight tensor, quantize once, and you're done. Weights are static — they don't change at inference time.

Activations are different. They're the intermediate values flowing through the network — the output of every ReLU, the input to every Linear layer. They depend entirely on the input you feed the model. A picture of a cat produces different activations than a picture of a sky. So you can't precompute their range offline the same way you did for weights.

Worse: at scale, activations develop systematic outliers. In large language models beyond 6.7B parameters, Song Han's group found that specific hidden channels consistently spike to values 10–100× larger than their neighbors — every single inference, regardless of input. One channel in a 4096-dimensional hidden state might hit ±70 while the other 4095 channels sit between ±1.

The outlier taxation problem: Your INT8 range has 256 discrete levels. If a single channel forces max(|x|) = 70, your scale becomes S = 70/127 ≈ 0.55. That means a weight at 0.1 gets rounded to the nearest 0.55 multiple — quantization error of ±0.27, which is larger than the value itself. The outlier taxes every other value in the tensor.

This lecture answers: how do you choose the quantization range for activations (calibration), how do you train the model to be quantization-friendly (QAT), and how do you handle LLMs where outliers are structural, not incidental?

Ch 0–1: The Problem + Calibration
Activation range varies at inference; min/max, percentile, and KL-divergence methods to find the best clipping threshold
Ch 2–3: Granularity + STE
Per-tensor vs per-channel vs per-group; the Straight-Through Estimator — why you can backprop through round()
Ch 4–5: QAT + Decision
Fake-quant nodes in the forward pass; QAT vs PTQ accuracy–bitwidth tradeoff; when to pay the training cost
Ch 6–7: LLM Methods
SmoothQuant migrates quantization difficulty from activations to weights; GPTQ second-order weight quant; AWQ protects salient channels
Ch 8–9: Showcase + Cheat Sheet
Full SmoothQuant explorer with α slider; decision table; W8A8 vs W4A16 guide
Outlier Taxation — see how one rogue channel inflates quantization error for all others

A 16-channel activation vector: 15 channels drawn from N(0,1), one outlier channel. Toggle the outlier and watch the quantization scale blow up — and how the error on the normal channels explodes.

Outlier magnitude
L5 covered quantizing weights offline. Why can't you quantize activations the same way — compute S and Z once at model-creation time?

Chapter 1: Calibration Methods

If you can't precompute activation ranges analytically, you compute them empirically. Calibration means running a small representative dataset through your trained FP32 model, recording the activation distributions at every layer, and deriving the best S and Z from those statistics before deployment. No retraining required.

Method 1: Min/Max (naive)

The simplest approach: collect min and max across all calibration samples, and use those as rmin and rmax. This guarantees no clipping — every observed value fits in the INT8 range. The problem is that a single outlier in your calibration set can force S to accommodate it, wasting resolution on everyone else.

Misconception: "Min/max calibration is safe because nothing clips." The absence of clipping is not the same as good quantization. If one activation in your dataset is 70 and the rest are between −1 and 1, min/max gives you a scale of 70/127 ≈ 0.55. Every value less than 0.55 quantizes to either 0 or ±1 — you've thrown away all resolution for 99% of your values to avoid clipping 1% of them.

Method 2: Percentile Clipping

Instead of using the absolute min and max, use the 99.9th or 99.99th percentile. You're deliberately clipping a tiny fraction of extreme values, accepting a small clipping error in exchange for much better resolution across the bulk of the distribution.

Worked example. Activation vector: [0.1, 0.3, −0.5, 0.8, 0.2, −0.9, 0.4, 0.6, −0.2, 52.0]

Spercentile = percentile(|x|, p) / qmax

Method 3: KL-Divergence (Entropy) Calibration

Developed by NVIDIA for TensorRT. Instead of minimizing MSE or avoiding clipping, you minimize the information loss from quantization. The idea: the FP32 activation distribution is your "truth." The INT8 representation is an approximation. KL divergence measures how much information the approximation loses.

KL(P || Q) = ∑i P(xi) · log( P(xi) / Q(xi) )

You sweep a threshold T from a small value up to the absolute max. For each T, you saturate all values above T, quantize to INT8, dequantize, and measure KL(FP32 hist || INT8 hist). The T that minimizes KL divergence is your winner. This typically clips a few percent of the distribution but dramatically reduces entropy loss.

Method 4: MSE Minimization (OCTAV/Newton-Raphson)

Directly minimize E[(x − Q(x))2] over the clipping threshold. Under a Laplace(0, b) distribution, the optimal threshold for N-bit quantization has closed-form solutions: |r|max = 2.83b (2-bit), 3.89b (3-bit), 5.03b (4-bit). The Laplace parameter b is estimated from your calibration set.

Calibration Explorer — compare min/max vs percentile vs KL on a vector with an outlier

A 64-value activation distribution (mostly Gaussian + one outlier at +40). Toggle calibration method and see the threshold, scale, and total quantization MSE.

Outlier magnitude40
Percentile clip %99%
An activation tensor has values almost all in [−1, 1] but one outlier at +52. After INT8 min/max calibration, values near 0.05 will be quantized to approximately:

Chapter 2: Granularity Revisited

The outlier problem has a structural dimension: outliers are not random. In LLMs, they cluster in specific channels — the same input channel index has an outlier every single token, while its neighbors are well-behaved. This motivates quantization granularity: how much of the tensor shares a single (S, Z) pair?

Per-Tensor Quantization

One S and one Z for the entire weight matrix. Works well for large models with well-behaved distributions. Fails badly when different output channels have very different dynamic ranges — a common occurrence in early layers of MobileNet-style architectures.

Per-Channel Quantization

One S and Z per output channel. If the weight matrix is W ∈ ℝCout × Cin, you have Cout scale factors. Each channel's scale is set by its own range, so a channel with large weights doesn't tax channels with small weights.

Why per-channel is standard for weights but not activations. Weight channels are fixed at export time — the Cout scales are constants stored alongside the model. Activation channels are different at every token. You'd need to compute per-channel scales at runtime, which costs extra ops and disrupts the INT8 GEMM pipeline. So weights get per-channel, activations get per-tensor (or per-token in some schemes).

Group Quantization

Group quantization is a middle ground. Instead of one scale per channel, you split each channel into groups of g elements and give each group its own scale. For a weight at INT4 with group size g=128, the effective bitwidth is 4 + 16/128 = 4.125 bits (the 16-bit scale overhead amortized over 128 values). This is the key innovation in AWQ and many modern LLM quantization schemes.

GranularityScales per layerOverheadAccuracy
Per-Tensor1MinimalLowest
Per-ChannelCoutLowGood (weights)
Group (g=128)Cout × Cin/128+0.125 bitsExcellent
Per-Token (activations)seq_lenRuntime computeBetter than per-tensor
The accuracy gap between per-tensor and per-channel for MobileNetV1: Per-tensor INT8 PTQ gave −11.8% accuracy (from 70.9% FP32 to 59.1%). Switching to per-channel quantization for weights brought that to essentially 0% loss. The entire accuracy collapse was caused by sharing one scale across all output channels with very different ranges.
You have a weight matrix W ∈ ℝ512 × 512. If you use per-channel quantization for the output dimension, how many scale factors do you store?

Chapter 3: The Straight-Through Estimator

Post-training quantization is fast but has limits. For aggressive quantization (4-bit and below), or for small models with limited representational capacity, PTQ often can't recover the accuracy lost by discretizing weights and activations. The solution: Quantization-Aware Training (QAT) — simulate quantization during training so the model learns to work within its discrete constraints.

But there's an immediate problem. Quantization uses the round() function: Q(w) = S · round(w/S). The round() function has zero gradient almost everywhere (it's piecewise constant), and undefined gradient at the integers. Backpropagation through a wall of zeros learns nothing.

∂Q(w) / ∂w = 0   (almost everywhere)
gw = ∂L / ∂w = (∂L / ∂Q(w)) · (∂Q(w) / ∂w) = 0

The gradient is zero, so the weight update is zero, so training stops. The model freezes in place.

The STE Solution: Pretend Round is Identity

Bengio (2013) proposed the Straight-Through Estimator (STE): in the backward pass, replace ∂Q(w)/∂w with 1 (the identity). You pass the gradient straight through the quantizer as if it weren't there.

STE:  ∂Q(w) / ∂w ≈ 1   when |w| ≤ 1
gw = ∂L / ∂w ← ∂L / ∂Q(w)

Is this mathematically correct? No. The true gradient is zero; the STE gradient is 1. It's a biased estimator — it introduces approximation error. But it works in practice. The intuition: the STE says "the gradient I would get if the network were continuous is a good enough proxy for updating my real weights."

Why the STE is acceptable: You're not trying to compute the exact gradient of a discrete function (that gradient doesn't exist in a useful sense). You're trying to find a signal that, when accumulated with stochastic gradient descent, moves the weights toward a lower loss. The STE provides that signal. The bias introduced is small relative to the optimization noise — SGD is already an approximation.

Full Forward/Backward Picture

In QAT, you maintain a full-precision shadow copy of the weights Wfp32. In the forward pass, you compute Q(W) and run the layer with quantized weights. Gradients flow back through Q using the STE, updating Wfp32. The small gradient steps accumulate in full precision without rounding error. At inference, you export only the quantized weights.

Forward: Wfp32 → Q(W) → Layer → output
Fake-quant: discretize weights and activations, but run actual arithmetic in FP32
↓ backward
Backward: ∂L/∂Wfp32 ← ∂L/∂Q(W) via STE
Gradient passes through the quantizer; Wfp32 accumulates updates in full precision
↓ export
Inference: only Q(W) is used
Deploy the quantized weights; Wfp32 is discarded
STE Visualizer — the round() staircase forward + identity straight-through gradient

Blue staircase = Q(w) forward. Orange line = the STE gradient (∂L/∂w ← ∂L/∂Q(w)), which is identical to the incoming gradient — passed straight through. The true gradient of Q(w) w.r.t. w is zero everywhere except at integers.

Incoming gradient ∂L/∂Q(w)1.0
Number of bits (scale)2 bits
During QAT backward pass, the gradient signal arrives at the quantizer: ∂L/∂Q(w) = −0.35. What does the STE pass downstream to ∂L/∂w?

Chapter 4: Quantization-Aware Training

With the STE in hand, you can build a QAT training loop. The key concept is the fake-quantization node (also called simulated quantization): a differentiable approximation that behaves like a quantizer in the forward pass but allows gradient flow in the backward pass.

python
import torch
import torch.nn as nn

class FakeQuantize(torch.autograd.Function):
    """
    Forward: round to N-bit quantization grid (fake-quant).
    Backward: Straight-Through Estimator — pass gradient unchanged.
    """
    @staticmethod
    def forward(ctx, x, scale, zero_point, bits):
        q_min = -2 ** (bits - 1)
        q_max = 2 ** (bits - 1) - 1
        # Quantize: x → integer grid
        q = torch.clamp(torch.round(x / scale + zero_point), q_min, q_max)
        # Dequantize: back to float (still in forward, but now on grid)
        x_fake_quant = scale * (q - zero_point)
        # Save a mask: gradient passes through if |x| <= range
        ctx.save_for_backward(
            (x >= scale * q_min).float() * (x <= scale * q_max).float()
        )
        return x_fake_quant

    @staticmethod
    def backward(ctx, grad_output):
        mask, = ctx.saved_tensors
        # STE: pass gradient through where not clipped, else 0
        grad_input = grad_output * mask
        return grad_input, None, None, None


class QATLinear(nn.Module):
    """Linear layer with fake-quant on weights and activations."""
    def __init__(self, in_f, out_f, bits=8):
        super().__init__()
        self.linear = nn.Linear(in_f, out_f)
        self.bits = bits
        # Learnable scale (initialize from weight range)
        self.w_scale = nn.Parameter(
            (self.linear.weight.abs().max() / (2 ** (bits - 1) - 1)).reshape(1)
        )

    def forward(self, x):
        # Fake-quantize weights
        w_fq = FakeQuantize.apply(self.linear.weight, self.w_scale, 0, self.bits)
        # Fake-quantize activations (scale from per-batch absmax)
        a_scale = x.abs().max() / (2 ** (self.bits - 1) - 1)
        x_fq = FakeQuantize.apply(x, a_scale, 0, self.bits)
        return nn.functional.linear(x_fq, w_fq, self.linear.bias)

BatchNorm Folding for QAT

At inference, BatchNorm layers are almost always folded into the preceding convolution/linear layer (the mean/variance subtraction and scale/shift become constant modifications to W and b). For QAT to be accurate, you must simulate this folding during training — otherwise the QAT sees weights that will be different at inference.

QAT protocol: (1) Start from a pretrained FP32 model — fine-tuning converges in 10–100× fewer steps than training from scratch. (2) Insert fake-quant nodes after every quantizable layer. (3) Fold BN before inserting fake-quant. (4) Train for 5–10% of original training budget. (5) Export by running real quantization on the shadow FP32 weights.
ModelFP32 Acc.PTQ INT8 (per-tensor)QAT INT8
MobileNetV170.9%59.1% (−11.8%)70.7% (−0.2%)
MobileNetV271.9%69.8% (−2.1%)71.1% (−0.8%)
NASNet-Mobile74.9%72.2% (−2.7%)73.0% (−1.9%)
Why does QAT maintain a full-precision (FP32) shadow copy of the weights during training, rather than directly training the quantized integer weights?

Chapter 5: PTQ vs QAT: When to Use What

PTQ and QAT are not competing — they're tools for different situations. The decision tree comes down to three factors: bitwidth, model size, and whether you have access to training data and compute.

When PTQ is sufficient

For INT8 quantization on large models (ResNets, BERT, large Transformers), PTQ with good calibration (percentile or KL) typically achieves less than 0.5% accuracy drop. The model has enough representational capacity to absorb the quantization error. PTQ takes minutes; QAT takes days.

When QAT is necessary

For INT4 and below, or for small efficient models (MobileNet, EfficientNet-lite), PTQ degrades unacceptably. MobileNetV1 at INT8 per-tensor loses 11.8% accuracy with PTQ — that's the difference between a useful model and a useless one. QAT recovers it to 0.2% loss. The training cost is justified.

Misconception: "If PTQ fails, just use more calibration data." More calibration helps up to a point, but PTQ is fundamentally limited by the fact that it never adjusts the weights. The model is solving for quantization parameters that fit a fixed distribution of weights. QAT actually changes the weights — it moves them to positions where the quantization grid can represent them accurately.

The Bitwidth Decision

A practical heuristic: use PTQ for INT8, consider QAT for INT4 and below. The transition point moves depending on model size — larger models can absorb INT4 PTQ (see GPTQ, AWQ for LLMs). Smaller models need QAT or at minimum per-channel calibration and careful range tuning.

PTQ vs QAT Accuracy vs Bitwidth — drag the bits slider to see the gap open at lower precision

Schematic accuracy curves for MobileNetV2-style small model. QAT (teal) holds accuracy much better than PTQ (orange) as bitwidth decreases below 8 bits.

Target bitwidth8 bits
ScenarioRecommended MethodReasoning
Large model, INT8PTQ (percentile or KL)Accuracy loss <0.5%; minutes of work
Small model, INT8PTQ per-channel or QATPer-tensor PTQ may collapse; per-channel often sufficient
Any model, INT4QAT preferred; try GPTQ/AWQ for LLMsPTQ INT4 loses significant accuracy without training signal
LLM, W8A8SmoothQuant + PTQOutlier migration makes activations quantizable without retraining
LLM, W4A16GPTQ or AWQSecond-order weight correction or salient-channel scaling without full QAT
A startup needs to deploy MobileNetV2 on a microcontroller with INT4 weight quantization. They have the original training dataset and GPU compute. Which method should they use?

Chapter 6: LLM Outliers & SmoothQuant

Large language models beyond 6.7B parameters develop a peculiar quantization pathology. Certain hidden-state channels consistently activate with values 10–100× larger than their neighbors. The phenomenon is systematic: the same channel indices spike across every token, every sequence, every context. It's a structural feature of how large Transformers represent information — not a distribution artifact you can calibrate away.

The LLM Transformer hidden state is Y = XW. In standard INT8, you'd quantize both X (activations) and W (weights) independently, then multiply the integers. But X has channels with max|Xj| ≈ 70 while others have max|Xk| ≈ 0.5. A single per-tensor scale of 70/127 ≈ 0.55 means channel k's values all round to zero. The matrix multiply is destroyed.

Why W8A8 failed on LLMs before SmoothQuant: Standard CNN quantization treats all channels equally. LLM activations have a structural outlier channel that dominates the scale, zero-ing out the 99% of channels with normal ranges. You can't fix this with better calibration — the outlier is a real value, not noise. You need to change the problem.

SmoothQuant: Migrate the Difficulty

The key insight from Lin et al. (2022): weights are easy to quantize, activations are hard to quantize — but we can balance this difficulty by migrating it from activations to weights. Weights can absorb more difficulty because they're quantized per-channel. Activations must use a single scale per-token.

The mathematical trick: introduce a per-channel diagonal scale matrix S. Because Y = XW, you can insert S and S−1 without changing the result:

Y = XW = (X diag(s)−1) · (diag(s) W) = X̂ · Ŵ

Now X̂ = X / sj (divide each activation channel by sj) and Ŵ = sj × Wj,:  (multiply each corresponding weight row by sj). The scale migrates the magnitude from activations to weights.

Deriving the Optimal Scale

How big should sj be? Too large — weights become hard to quantize. Too small — activations stay hard. Song Han's group derived a formula with a migration strength parameter α ∈ [0, 1]:

sj = max(|Xj|)α / max(|Wj|)1−α

Worked numbers. Say channel j has max|Xj| = 60 and max|Wj| = 2. With α = 0.5:

sj = 600.5 / 20.5 = 7.746 / 1.414 = 5.48

After smoothing: max|X̂j| = 60 / 5.48 ≈ 10.9, max|Ŵj| = 2 × 5.48 ≈ 10.96. The outlier activation dropped from 60 to 10.9 (5.5× improvement). The weight only grew from 2 to 10.96 — still manageable with per-channel quantization. Both are now in a similar range, making per-token activation quantization and per-channel weight quantization both effective.

The α tradeoff. α = 0: all difficulty stays in activations (original problem). α = 1: all difficulty moves to weights (weights become huge, activations perfect). Sweet spot α ≈ 0.5 balances both, making the joint quantization error minimal. The optimal α can be tuned per-layer on calibration data.

python
import torch

def compute_smoothquant_scale(X_calibration, W, alpha=0.5):
    """
    X_calibration: collected activation tensor [n_tokens, d_in]
    W: weight matrix [d_out, d_in]
    Returns: per-channel scale s of shape [d_in]
    """
    # Per-channel max of activations (over all tokens)
    act_max = X_calibration.abs().max(dim=0).values  # [d_in]
    # Per-input-channel max of weight rows
    w_max = W.abs().max(dim=0).values               # [d_in]
    # Balancing scale
    s = act_max.pow(alpha) / w_max.pow(1 - alpha)
    s = s.clamp(min=1e-5)  # avoid division by zero
    return s

def apply_smooth_quant(X, W, s):
    """Fold the scale: smooth activations, absorb into weights."""
    X_smooth = X / s.unsqueeze(0)   # divide activation channels
    W_smooth = W * s.unsqueeze(0)   # multiply weight rows
    # Now quantize X_smooth and W_smooth independently with INT8
    return X_smooth, W_smooth
SmoothQuant divides activation channel j by sj = 5.48 and multiplies weight row j by 5.48. Why doesn't this change the model's output Y?

Chapter 7: GPTQ & AWQ

SmoothQuant enables W8A8 (both weights and activations quantized to INT8) — excellent for batch serving on data-center GPUs. But single-query LLM inference on a local machine has a different bottleneck: memory bandwidth. At batch size 1, the GPU compute is idle while it waits for 65B parameters to transfer from HBM to registers. W8A8 halves the memory but still needs a fast GPU. For edge devices, you need W4A16: 4-bit weights, 16-bit activations — only weights are compressed, and they're decompressed on-the-fly to FP16 for compute.

GPTQ: Second-Order Weight Quantization

GPTQ (Frantar et al., 2022) is a layer-wise PTQ method that finds the best 4-bit weights by minimizing the output error: for each layer, find the quantized Ŵ that minimizes ||WX − ŴX||F2. It uses the second-order Hessian information — the curvature of the loss surface — to make smarter rounding decisions than round-to-nearest.

Key insight: not all weight rounding errors are equal. A weight with high Hessian curvature (high second derivative) contributes more to output error when rounded badly than a weight with low curvature. GPTQ quantizes weights in order of decreasing importance (Hessian magnitude) and compensates remaining weights after each rounding to absorb the error.

GPTQ in practice: Quantizes GPT-3 175B to 4-bit in about 4 GPU-hours. The quantization MSE is 2–4× lower than round-to-nearest. On WikiText perplexity, GPTQ 4-bit matches or nearly matches FP16 on models >13B — the larger the model, the more it can absorb quantization error.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) starts from a striking observation: 1% of weight channels are disproportionately important for model quality. Keeping just those 1% in FP16 while quantizing the remaining 99% to INT4 dramatically reduces perplexity. But which 1% to keep?

The answer: look at activations, not weights. Weight channels paired with activation channels that have large magnitudes are the most important — a large activation multiplied by a weight error produces a large output error. These salient channels are identified by running calibration data and measuring the average activation magnitude per input channel.

Instead of keeping salient channels in FP16 (which breaks hardware uniformity), AWQ uses the same scale trick as SmoothQuant, but just for the salient channels:

Q(W · s) · (s−1 · X) = Q(Ŵ) · X̂

Scaling up Wsalient by s before quantization concentrates the quantization grid around the important values, reducing their error. The inverse scale is applied to the corresponding activation, which is cheap since activations are large in those channels anyway.

Why does scaling up a weight channel reduce its quantization error? Quantization error is proportional to the step size Δ = max(|w|)/(2N−1). When you scale up a channel by s before quantization, max(|w·s|) = s·max(|w|), and Δ' = s·Δ. But the output error is Δ'·(1/s)·x = Δ·x — the same as before in the unscaled domain. However, within the channel, other weights also got scaled by s, so their quantization is finer (more of the range is covered). For salient channels where x is large, the improvement is dramatic.

Cross-Layer Equalization (CLE) for PTQ

Before SmoothQuant, Song Han's group (and Markus et al., ICCV 2019) proposed Cross-Layer Equalization (CLE) for CNN PTQ. The idea: consecutive layers separated by a positive-homogeneous activation function (like ReLU) can be rescaled without changing the network output. ReLU is positive-homogeneous: ReLU(sx) = s·ReLU(x) for s > 0.

For two consecutive linear layers W(1) and W(2) with a ReLU between them, introduce a per-channel scale sc:

(1)[c] = W(1)[c] / sc   (rescale output channels of layer 1)
(2)[:,c] = W(2)[:,c] × sc   (rescale input channels of layer 2)

The optimal per-channel scale that equalizes quantization ranges is: sc = √(max|W(1)[c]| / max|W(2)[:,c]|). This geometric mean balances the per-channel range between the two layers. CLE is data-free (no calibration data needed) and can precede any other PTQ method.

PTQ Calibration Loop in Practice

python
import torch
from collections import defaultdict

def run_ptq_calibration(model, calibration_loader, bits=8):
    """Collect activation statistics and compute per-layer quant params."""
    model.eval()
    act_stats = defaultdict(lambda: {'min': float('inf'), 'max': float('-inf'), 'all': []})
    hooks = []

    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
            def make_hook(n):
                def hook(m, inp, out):
                    x = out.detach()
                    act_stats[n]['min'] = min(act_stats[n]['min'], x.min().item())
                    act_stats[n]['max'] = max(act_stats[n]['max'], x.max().item())
                    act_stats[n]['all'].extend(x.abs().flatten().tolist()[:512])  # sample
                return hook
            hooks.append(module.register_forward_hook(make_hook(name)))

    with torch.no_grad():
        for i, (x, _) in enumerate(calibration_loader):
            model(x)
            if i >= 32: break  # 32 batches is typically sufficient

    for h in hooks: h.remove()

    quant_params = {}
    q_max = (1 << (bits - 1)) - 1
    for name, stats in act_stats.items():
        # Percentile calibration (99.9th percentile)
        vals = sorted(stats['all'])
        thresh = vals[int(len(vals) * 0.999)]
        scale = thresh / q_max
        quant_params[name] = {'scale': scale, 'zero_point': 0}

    return quant_params
MethodTypeQuantizationBest forRequires retraining?
SmoothQuantW8A8Weights + activations INT8Batch serving, datacenterNo (calibration only)
GPTQW4A16Weights INT4, activations FP16Large models, edgeNo (second-order opt)
AWQW4A16Weights INT4 w/ salient scalingLocal inference, TinyChatNo (calibration + scale search)
QATW4A8 or W4A4Both quantized, training-awareSmall models, aggressive quantYes
CLEW8A8Per-channel weight rescalingData-free CNN PTQ baselineNo (data-free)
AWQ determines which 1% of weight channels to protect by looking at the magnitude of the corresponding activation channels, not the weight magnitudes themselves. Why?

Chapter 8: Showcase: SmoothQuant & AWQ Explorer

This is the payoff chapter. Three linked canvases let you see the full quantization difficulty story, the SmoothQuant migration in action, and the AWQ salient-channel protection — all with live controls.

Canvas A — Calibration Method Comparison (PTQ on a vector with outlier)

An activation histogram with a configurable outlier. The three colored dashed lines show where each calibration method sets its clipping threshold. Lower total MSE is better.

Outlier magnitude20
Bits8
Canvas B — SmoothQuant Migration (drag α to shift difficulty between activations and weights)

Two bars: activation max (orange) and weight max (teal) before and after SmoothQuant. The red zone marks "hard to quantize" (above the INT8-friendly threshold). Drag α to find the sweet spot where both bars are in the green zone.

Migration strength α0.5
Activation outlier max(|Xj|)60
Weight channel max(|Wj|)2.0
Canvas C — AWQ Salient Channel Protection (activation magnitude → which channels to protect)

16 weight channels, colored by their corresponding activation magnitude (bright = high activation = salient). Toggle "Protect top 1%" to scale up salient channels before INT4 quantization. Watch the output error (red bar) drop dramatically.

Summary: what you just saw. Canvas A: any calibration method is better than min/max in the presence of outliers; KL-divergence minimizes information loss. Canvas B: SmoothQuant's α = 0.5 sweet spot typically balances activation and weight quantization difficulty. Canvas C: protecting the 2 most salient weight channels (by activation magnitude) shrinks the INT4 quantization error by 30–60%.

Chapter 9: Connections & Cheat Sheet

Complete Decision Guide: PTQ vs QAT vs LLM Methods

SituationMethodKey Paper
Large model (ResNet, BERT+), INT8PTQ + KL-div calibrationTensorRT (Migacz 2017)
Small model (MobileNet), INT8PTQ per-channel or QATKrishnamoorthi 2018
Any model, INT4QATJacob et al. CVPR 2018
LLM >6.7B, W8A8 (batch serving)SmoothQuant + PTQLin et al. 2022
LLM, W4A16 (local/edge)GPTQ or AWQFrantar 2022; Lin 2023
LLM, fine-tuning neededQLoRA (QAT for LoRA adapters)Dettmers 2023

Calibration Methods Cheat Sheet

MethodThresholdStrengthWeakness
Min/Maxabs max observedNo clippingOutliers ruin scale
Percentile (p=99.9%)p-th percentileRobust to outliersMust choose p
KL-DivergenceargminT KL(FP32||INT8)Minimizes information lossComputationally expensive
MSE (OCTAV)argminT MSEOptimal for Laplace dist.Assumes distribution shape

SmoothQuant One-Liner

Multiply each activation channel by sj−1 and each weight row by sj, where sj = max(|Xj|)α/max(|Wj|)1−α. The scaling is mathematically equivalent (Y = XW = (X/s)·(sW)) and is absorbed offline into weights and the preceding LayerNorm.

GPTQ One-Liner

Layer-wise second-order PTQ: quantize weights one-by-one in order of Hessian curvature, then compensate the remaining unquantized weights to absorb the just-introduced error. Enables INT4 on 175B GPT in ~4 GPU-hours.

AWQ One-Liner

Identify the 1% of input channels with the largest activation magnitudes. Scale those weight channels up by s before INT4 quantization (and scale the activations down correspondingly). The error on salient channels drops ∝ 1/s.

W8A8 vs W4A16 Guide

FormatWeightsActivationsBest batch sizeSpeedup mechanism
FP16FP16FP16AnyBaseline
W8A8INT8INT8≥32 (compute-bound)INT8 GEMM throughput
W4A16INT4FP161–4 (memory-bound)Reduced weight transfer
The big picture: Quantization techniques form a layered hierarchy. L5 laid the mathematical foundation (affine mapping, integer arithmetic). This lesson built the deployment stack: calibrate activations with percentile/KL, use QAT when PTQ fails, apply SmoothQuant to tame LLM outliers, and use GPTQ/AWQ for 4-bit weight-only compression. Next up: Neural Architecture Search (L7) — finding architectures that are inherently efficient, not just compressed post-hoc.

Related Lessons

"What I cannot create, I do not understand." — Feynman. You've now built every piece of the quantization stack from scratch — the calibration loop, the fake-quant node, the STE backward pass, the SmoothQuant scale formula. You understand quantization.