Quantization II: PTQ, QAT & LLM Quantization

Chapter 0: The Outlier Problem

Lecture 5 gave you the tools to quantize weights offline. You compute the scale S and zero-point Z from the weight tensor, quantize once, and you're done. Weights are static — they don't change at inference time.

Activations are different. They're the intermediate values flowing through the network — the output of every ReLU, the input to every Linear layer. They depend entirely on the input you feed the model. A picture of a cat produces different activations than a picture of a sky. So you can't precompute their range offline the same way you did for weights.

Worse: at scale, activations develop systematic outliers. In large language models beyond 6.7B parameters, Song Han's group found that specific hidden channels consistently spike to values 10–100× larger than their neighbors — every single inference, regardless of input. One channel in a 4096-dimensional hidden state might hit ±70 while the other 4095 channels sit between ±1.

The outlier taxation problem: Your INT8 range has 256 discrete levels. If a single channel forces max(|x|) = 70, your scale becomes S = 70/127 ≈ 0.55. That means a weight at 0.1 gets rounded to the nearest 0.55 multiple — quantization error of ±0.27, which is larger than the value itself. The outlier taxes every other value in the tensor.

This lecture answers: how do you choose the quantization range for activations (calibration), how do you train the model to be quantization-friendly (QAT), and how do you handle LLMs where outliers are structural, not incidental?

Ch 0–1: The Problem + Calibration

Activation range varies at inference; min/max, percentile, and KL-divergence methods to find the best clipping threshold

↓

Ch 2–3: Granularity + STE

Per-tensor vs per-channel vs per-group; the Straight-Through Estimator — why you can backprop through round()

↓

Ch 4–5: QAT + Decision

Fake-quant nodes in the forward pass; QAT vs PTQ accuracy–bitwidth tradeoff; when to pay the training cost

↓

Ch 6–7: LLM Methods

SmoothQuant migrates quantization difficulty from activations to weights; GPTQ second-order weight quant; AWQ protects salient channels

↓

Ch 8–9: Showcase + Cheat Sheet

Full SmoothQuant explorer with α slider; decision table; W8A8 vs W4A16 guide

Outlier Taxation — see how one rogue channel inflates quantization error for all others

A 16-channel activation vector: 15 channels drawn from N(0,1), one outlier channel. Toggle the outlier and watch the quantization scale blow up — and how the error on the normal channels explodes.

Outlier magnitude1×

L5 covered quantizing weights offline. Why can't you quantize activations the same way — compute S and Z once at model-creation time?

You can — activation ranges are fixed by the weight values, so they don't change. You can't because activations use floating-point arithmetic and weights use integers. You can't because activations depend on the input data, so their range changes with every inference. You can't because activations are larger tensors than weights, so they require more bits.

Chapter 1: Calibration Methods

If you can't precompute activation ranges analytically, you compute them empirically. Calibration means running a small representative dataset through your trained FP32 model, recording the activation distributions at every layer, and deriving the best S and Z from those statistics before deployment. No retraining required.

Method 1: Min/Max (naive)

The simplest approach: collect min and max across all calibration samples, and use those as r_min and r_max. This guarantees no clipping — every observed value fits in the INT8 range. The problem is that a single outlier in your calibration set can force S to accommodate it, wasting resolution on everyone else.

Misconception: "Min/max calibration is safe because nothing clips." The absence of clipping is not the same as good quantization. If one activation in your dataset is 70 and the rest are between −1 and 1, min/max gives you a scale of 70/127 ≈ 0.55. Every value less than 0.55 quantizes to either 0 or ±1 — you've thrown away all resolution for 99% of your values to avoid clipping 1% of them.

Method 2: Percentile Clipping

Instead of using the absolute min and max, use the 99.9th or 99.99th percentile. You're deliberately clipping a tiny fraction of extreme values, accepting a small clipping error in exchange for much better resolution across the bulk of the distribution.

Worked example. Activation vector: [0.1, 0.3, −0.5, 0.8, 0.2, −0.9, 0.4, 0.6, −0.2, 52.0]

Min/max: r_max = 52.0 → S = 52.0/127 ≈ 0.409. The value 0.1 quantizes to round(0.1/0.409) = 0. Every value below 0.2 maps to zero!
99th percentile clip: r_max = 0.9 (ignoring the 52.0 outlier) → S = 0.9/127 ≈ 0.0071. The value 0.1 maps to round(0.1/0.0071) = 14. Excellent resolution — 52.0 just clips to INT8_MAX.

S_percentile = percentile(|x|, p) / q_max

Method 3: KL-Divergence (Entropy) Calibration

Developed by NVIDIA for TensorRT. Instead of minimizing MSE or avoiding clipping, you minimize the information loss from quantization. The idea: the FP32 activation distribution is your "truth." The INT8 representation is an approximation. KL divergence measures how much information the approximation loses.

KL(P || Q) = ∑_i P(x_i) · log( P(x_i) / Q(x_i) )

You sweep a threshold T from a small value up to the absolute max. For each T, you saturate all values above T, quantize to INT8, dequantize, and measure KL(FP32 hist || INT8 hist). The T that minimizes KL divergence is your winner. This typically clips a few percent of the distribution but dramatically reduces entropy loss.

Method 4: MSE Minimization (OCTAV/Newton-Raphson)

Directly minimize E[(x − Q(x))²] over the clipping threshold. Under a Laplace(0, b) distribution, the optimal threshold for N-bit quantization has closed-form solutions: |r|_max = 2.83b (2-bit), 3.89b (3-bit), 5.03b (4-bit). The Laplace parameter b is estimated from your calibration set.

Calibration Explorer — compare min/max vs percentile vs KL on a vector with an outlier

A 64-value activation distribution (mostly Gaussian + one outlier at +40). Toggle calibration method and see the threshold, scale, and total quantization MSE.

Outlier magnitude40

Percentile clip %99%

An activation tensor has values almost all in [−1, 1] but one outlier at +52. After INT8 min/max calibration, values near 0.05 will be quantized to approximately:

0 — the scale becomes 52/127 ≈ 0.41, so 0.05/0.41 ≈ 0.12 → rounds to 0. 6 — they're quantized with high resolution despite the outlier. 52 — all small values collapse to the outlier. 127 — the scale puts all positive values at INT8 max.

Chapter 2: Granularity Revisited

The outlier problem has a structural dimension: outliers are not random. In LLMs, they cluster in specific channels — the same input channel index has an outlier every single token, while its neighbors are well-behaved. This motivates quantization granularity: how much of the tensor shares a single (S, Z) pair?

Per-Tensor Quantization

One S and one Z for the entire weight matrix. Works well for large models with well-behaved distributions. Fails badly when different output channels have very different dynamic ranges — a common occurrence in early layers of MobileNet-style architectures.

Per-Channel Quantization

One S and Z per output channel. If the weight matrix is W ∈ ℝ^{C_out × C_in}, you have C_out scale factors. Each channel's scale is set by its own range, so a channel with large weights doesn't tax channels with small weights.

Why per-channel is standard for weights but not activations. Weight channels are fixed at export time — the C_out scales are constants stored alongside the model. Activation channels are different at every token. You'd need to compute per-channel scales at runtime, which costs extra ops and disrupts the INT8 GEMM pipeline. So weights get per-channel, activations get per-tensor (or per-token in some schemes).

Group Quantization

Group quantization is a middle ground. Instead of one scale per channel, you split each channel into groups of g elements and give each group its own scale. For a weight at INT4 with group size g=128, the effective bitwidth is 4 + 16/128 = 4.125 bits (the 16-bit scale overhead amortized over 128 values). This is the key innovation in AWQ and many modern LLM quantization schemes.

Granularity	Scales per layer	Overhead	Accuracy
Per-Tensor	1	Minimal	Lowest
Per-Channel	C_out	Low	Good (weights)
Group (g=128)	C_out × C_in/128	+0.125 bits	Excellent
Per-Token (activations)	seq_len	Runtime compute	Better than per-tensor

The accuracy gap between per-tensor and per-channel for MobileNetV1: Per-tensor INT8 PTQ gave −11.8% accuracy (from 70.9% FP32 to 59.1%). Switching to per-channel quantization for weights brought that to essentially 0% loss. The entire accuracy collapse was caused by sharing one scale across all output channels with very different ranges.

You have a weight matrix W ∈ ℝ^{512 × 512}. If you use per-channel quantization for the output dimension, how many scale factors do you store?

1 — per-tensor means one scale for the whole matrix. 512 — one scale per output channel (row). 262,144 — one scale per weight element. 1024 — one scale per input channel and one per output channel.

Chapter 3: The Straight-Through Estimator

Post-training quantization is fast but has limits. For aggressive quantization (4-bit and below), or for small models with limited representational capacity, PTQ often can't recover the accuracy lost by discretizing weights and activations. The solution: Quantization-Aware Training (QAT) — simulate quantization during training so the model learns to work within its discrete constraints.

But there's an immediate problem. Quantization uses the round() function: Q(w) = S · round(w/S). The round() function has zero gradient almost everywhere (it's piecewise constant), and undefined gradient at the integers. Backpropagation through a wall of zeros learns nothing.

∂Q(w) / ∂w = 0 (almost everywhere)

g_w = ∂L / ∂w = (∂L / ∂Q(w)) · (∂Q(w) / ∂w) = 0

The gradient is zero, so the weight update is zero, so training stops. The model freezes in place.

The STE Solution: Pretend Round is Identity

Bengio (2013) proposed the Straight-Through Estimator (STE): in the backward pass, replace ∂Q(w)/∂w with 1 (the identity). You pass the gradient straight through the quantizer as if it weren't there.

STE: ∂Q(w) / ∂w ≈ 1 when |w| ≤ 1

g_w = ∂L / ∂w ← ∂L / ∂Q(w)

Is this mathematically correct? No. The true gradient is zero; the STE gradient is 1. It's a biased estimator — it introduces approximation error. But it works in practice. The intuition: the STE says "the gradient I would get if the network were continuous is a good enough proxy for updating my real weights."

Why the STE is acceptable: You're not trying to compute the exact gradient of a discrete function (that gradient doesn't exist in a useful sense). You're trying to find a signal that, when accumulated with stochastic gradient descent, moves the weights toward a lower loss. The STE provides that signal. The bias introduced is small relative to the optimization noise — SGD is already an approximation.

Full Forward/Backward Picture

In QAT, you maintain a full-precision shadow copy of the weights W_fp32. In the forward pass, you compute Q(W) and run the layer with quantized weights. Gradients flow back through Q using the STE, updating W_fp32. The small gradient steps accumulate in full precision without rounding error. At inference, you export only the quantized weights.

Forward: W_fp32 → Q(W) → Layer → output

Fake-quant: discretize weights and activations, but run actual arithmetic in FP32

↓ backward

Backward: ∂L/∂W_fp32 ← ∂L/∂Q(W) via STE

Gradient passes through the quantizer; W_fp32 accumulates updates in full precision

↓ export

Inference: only Q(W) is used

Deploy the quantized weights; W_fp32 is discarded

STE Visualizer — the round() staircase forward + identity straight-through gradient

Blue staircase = Q(w) forward. Orange line = the STE gradient (∂L/∂w ← ∂L/∂Q(w)), which is identical to the incoming gradient — passed straight through. The true gradient of Q(w) w.r.t. w is zero everywhere except at integers.

Incoming gradient ∂L/∂Q(w)1.0

Number of bits (scale)2 bits

During QAT backward pass, the gradient signal arrives at the quantizer: ∂L/∂Q(w) = −0.35. What does the STE pass downstream to ∂L/∂w?

0 — the true derivative of round(w) is 0, so the gradient dies here. 0.35 — the STE flips the sign of the gradient. −0.35 — the STE passes the gradient through unchanged (identity). −1 — the STE clips all gradients to ±1.

Chapter 4: Quantization-Aware Training

With the STE in hand, you can build a QAT training loop. The key concept is the fake-quantization node (also called simulated quantization): a differentiable approximation that behaves like a quantizer in the forward pass but allows gradient flow in the backward pass.

python
import torch
import torch.nn as nn

class FakeQuantize(torch.autograd.Function):
    """
    Forward: round to N-bit quantization grid (fake-quant).
    Backward: Straight-Through Estimator — pass gradient unchanged.
    """
    @staticmethod
    def forward(ctx, x, scale, zero_point, bits):
        q_min = -2 ** (bits - 1)
        q_max = 2 ** (bits - 1) - 1
        # Quantize: x → integer grid
        q = torch.clamp(torch.round(x / scale + zero_point), q_min, q_max)
        # Dequantize: back to float (still in forward, but now on grid)
        x_fake_quant = scale * (q - zero_point)
        # Save a mask: gradient passes through if |x| <= range
        ctx.save_for_backward(
            (x >= scale * q_min).float() * (x <= scale * q_max).float()
        )
        return x_fake_quant

    @staticmethod
    def backward(ctx, grad_output):
        mask, = ctx.saved_tensors
        # STE: pass gradient through where not clipped, else 0
        grad_input = grad_output * mask
        return grad_input, None, None, None


class QATLinear(nn.Module):
    """Linear layer with fake-quant on weights and activations."""
    def __init__(self, in_f, out_f, bits=8):
        super().__init__()
        self.linear = nn.Linear(in_f, out_f)
        self.bits = bits
        # Learnable scale (initialize from weight range)
        self.w_scale = nn.Parameter(
            (self.linear.weight.abs().max() / (2 ** (bits - 1) - 1)).reshape(1)
        )

    def forward(self, x):
        # Fake-quantize weights
        w_fq = FakeQuantize.apply(self.linear.weight, self.w_scale, 0, self.bits)
        # Fake-quantize activations (scale from per-batch absmax)
        a_scale = x.abs().max() / (2 ** (self.bits - 1) - 1)
        x_fq = FakeQuantize.apply(x, a_scale, 0, self.bits)
        return nn.functional.linear(x_fq, w_fq, self.linear.bias)

BatchNorm Folding for QAT

At inference, BatchNorm layers are almost always folded into the preceding convolution/linear layer (the mean/variance subtraction and scale/shift become constant modifications to W and b). For QAT to be accurate, you must simulate this folding during training — otherwise the QAT sees weights that will be different at inference.

QAT protocol: (1) Start from a pretrained FP32 model — fine-tuning converges in 10–100× fewer steps than training from scratch. (2) Insert fake-quant nodes after every quantizable layer. (3) Fold BN before inserting fake-quant. (4) Train for 5–10% of original training budget. (5) Export by running real quantization on the shadow FP32 weights.

Model	FP32 Acc.	PTQ INT8 (per-tensor)	QAT INT8
MobileNetV1	70.9%	59.1% (−11.8%)	70.7% (−0.2%)
MobileNetV2	71.9%	69.8% (−2.1%)	71.1% (−0.8%)
NASNet-Mobile	74.9%	72.2% (−2.7%)	73.0% (−1.9%)

Why does QAT maintain a full-precision (FP32) shadow copy of the weights during training, rather than directly training the quantized integer weights?

For compatibility: PyTorch optimizers can't work with integer tensors. The gradient updates are tiny floats (e.g., 0.00001). Accumulating them directly in INT8 would round every update to zero — no learning happens. FP32 accumulation preserves the signal. To allow the model to use FP32 for important layers and INT8 for unimportant ones. Because fake-quant nodes don't actually produce integer values — they produce FP32 values that look like integers.

Chapter 5: PTQ vs QAT: When to Use What

PTQ and QAT are not competing — they're tools for different situations. The decision tree comes down to three factors: bitwidth, model size, and whether you have access to training data and compute.

When PTQ is sufficient

For INT8 quantization on large models (ResNets, BERT, large Transformers), PTQ with good calibration (percentile or KL) typically achieves less than 0.5% accuracy drop. The model has enough representational capacity to absorb the quantization error. PTQ takes minutes; QAT takes days.

When QAT is necessary

For INT4 and below, or for small efficient models (MobileNet, EfficientNet-lite), PTQ degrades unacceptably. MobileNetV1 at INT8 per-tensor loses 11.8% accuracy with PTQ — that's the difference between a useful model and a useless one. QAT recovers it to 0.2% loss. The training cost is justified.

Misconception: "If PTQ fails, just use more calibration data." More calibration helps up to a point, but PTQ is fundamentally limited by the fact that it never adjusts the weights. The model is solving for quantization parameters that fit a fixed distribution of weights. QAT actually changes the weights — it moves them to positions where the quantization grid can represent them accurately.

The Bitwidth Decision

A practical heuristic: use PTQ for INT8, consider QAT for INT4 and below. The transition point moves depending on model size — larger models can absorb INT4 PTQ (see GPTQ, AWQ for LLMs). Smaller models need QAT or at minimum per-channel calibration and careful range tuning.

PTQ vs QAT Accuracy vs Bitwidth — drag the bits slider to see the gap open at lower precision

Schematic accuracy curves for MobileNetV2-style small model. QAT (teal) holds accuracy much better than PTQ (orange) as bitwidth decreases below 8 bits.

Target bitwidth8 bits

Scenario	Recommended Method	Reasoning
Large model, INT8	PTQ (percentile or KL)	Accuracy loss <0.5%; minutes of work
Small model, INT8	PTQ per-channel or QAT	Per-tensor PTQ may collapse; per-channel often sufficient
Any model, INT4	QAT preferred; try GPTQ/AWQ for LLMs	PTQ INT4 loses significant accuracy without training signal
LLM, W8A8	SmoothQuant + PTQ	Outlier migration makes activations quantizable without retraining
LLM, W4A16	GPTQ or AWQ	Second-order weight correction or salient-channel scaling without full QAT

A startup needs to deploy MobileNetV2 on a microcontroller with INT4 weight quantization. They have the original training dataset and GPU compute. Which method should they use?

PTQ with min/max calibration — fastest to implement. PTQ with KL-divergence calibration — same speed but better range selection. QAT — INT4 on a small model will collapse with PTQ; they have the training resources to use QAT and recover accuracy. SmoothQuant — it handles outliers and works without retraining.

Chapter 6: LLM Outliers & SmoothQuant

Large language models beyond 6.7B parameters develop a peculiar quantization pathology. Certain hidden-state channels consistently activate with values 10–100× larger than their neighbors. The phenomenon is systematic: the same channel indices spike across every token, every sequence, every context. It's a structural feature of how large Transformers represent information — not a distribution artifact you can calibrate away.

The LLM Transformer hidden state is Y = XW. In standard INT8, you'd quantize both X (activations) and W (weights) independently, then multiply the integers. But X has channels with max|X_j| ≈ 70 while others have max|X_k| ≈ 0.5. A single per-tensor scale of 70/127 ≈ 0.55 means channel k's values all round to zero. The matrix multiply is destroyed.

Why W8A8 failed on LLMs before SmoothQuant: Standard CNN quantization treats all channels equally. LLM activations have a structural outlier channel that dominates the scale, zero-ing out the 99% of channels with normal ranges. You can't fix this with better calibration — the outlier is a real value, not noise. You need to change the problem.

SmoothQuant: Migrate the Difficulty

The key insight from Lin et al. (2022): weights are easy to quantize, activations are hard to quantize — but we can balance this difficulty by migrating it from activations to weights. Weights can absorb more difficulty because they're quantized per-channel. Activations must use a single scale per-token.

The mathematical trick: introduce a per-channel diagonal scale matrix S. Because Y = XW, you can insert S and S⁻¹ without changing the result:

Y = XW = (X diag(s)⁻¹) · (diag(s) W) = X̂ · Ŵ

Now X̂ = X / s_j (divide each activation channel by s_j) and Ŵ = s_j × W_j,: (multiply each corresponding weight row by s_j). The scale migrates the magnitude from activations to weights.

Deriving the Optimal Scale

How big should s_j be? Too large — weights become hard to quantize. Too small — activations stay hard. Song Han's group derived a formula with a migration strength parameter α ∈ [0, 1]:

s_j = max(|X_j|)^α / max(|W_j|)^1−α

Worked numbers. Say channel j has max|X_j| = 60 and max|W_j| = 2. With α = 0.5:

s_j = 60^0.5 / 2^0.5 = 7.746 / 1.414 = 5.48

After smoothing: max|X̂_j| = 60 / 5.48 ≈ 10.9, max|Ŵ_j| = 2 × 5.48 ≈ 10.96. The outlier activation dropped from 60 to 10.9 (5.5× improvement). The weight only grew from 2 to 10.96 — still manageable with per-channel quantization. Both are now in a similar range, making per-token activation quantization and per-channel weight quantization both effective.

The α tradeoff. α = 0: all difficulty stays in activations (original problem). α = 1: all difficulty moves to weights (weights become huge, activations perfect). Sweet spot α ≈ 0.5 balances both, making the joint quantization error minimal. The optimal α can be tuned per-layer on calibration data.

python
import torch

def compute_smoothquant_scale(X_calibration, W, alpha=0.5):
    """
    X_calibration: collected activation tensor [n_tokens, d_in]
    W: weight matrix [d_out, d_in]
    Returns: per-channel scale s of shape [d_in]
    """
    # Per-channel max of activations (over all tokens)
    act_max = X_calibration.abs().max(dim=0).values  # [d_in]
    # Per-input-channel max of weight rows
    w_max = W.abs().max(dim=0).values               # [d_in]
    # Balancing scale
    s = act_max.pow(alpha) / w_max.pow(1 - alpha)
    s = s.clamp(min=1e-5)  # avoid division by zero
    return s

def apply_smooth_quant(X, W, s):
    """Fold the scale: smooth activations, absorb into weights."""
    X_smooth = X / s.unsqueeze(0)   # divide activation channels
    W_smooth = W * s.unsqueeze(0)   # multiply weight rows
    # Now quantize X_smooth and W_smooth independently with INT8
    return X_smooth, W_smooth

SmoothQuant divides activation channel j by s_j = 5.48 and multiplies weight row j by 5.48. Why doesn't this change the model's output Y?

It does change Y — that's the point. SmoothQuant accepts a small output change to enable quantization. Because s_j = 1 when computed from real activations, so the scale has no effect. Because (X/s) · (s·W) = XW. The s⁻¹ and s cancel in the matrix product — mathematically equivalent to the original. Because INT8 arithmetic is symmetric, so scaling doesn't affect the final result after dequantization.

Chapter 7: GPTQ & AWQ

SmoothQuant enables W8A8 (both weights and activations quantized to INT8) — excellent for batch serving on data-center GPUs. But single-query LLM inference on a local machine has a different bottleneck: memory bandwidth. At batch size 1, the GPU compute is idle while it waits for 65B parameters to transfer from HBM to registers. W8A8 halves the memory but still needs a fast GPU. For edge devices, you need W4A16: 4-bit weights, 16-bit activations — only weights are compressed, and they're decompressed on-the-fly to FP16 for compute.

GPTQ: Second-Order Weight Quantization

GPTQ (Frantar et al., 2022) is a layer-wise PTQ method that finds the best 4-bit weights by minimizing the output error: for each layer, find the quantized Ŵ that minimizes ||WX − ŴX||_F². It uses the second-order Hessian information — the curvature of the loss surface — to make smarter rounding decisions than round-to-nearest.

Key insight: not all weight rounding errors are equal. A weight with high Hessian curvature (high second derivative) contributes more to output error when rounded badly than a weight with low curvature. GPTQ quantizes weights in order of decreasing importance (Hessian magnitude) and compensates remaining weights after each rounding to absorb the error.

GPTQ in practice: Quantizes GPT-3 175B to 4-bit in about 4 GPU-hours. The quantization MSE is 2–4× lower than round-to-nearest. On WikiText perplexity, GPTQ 4-bit matches or nearly matches FP16 on models >13B — the larger the model, the more it can absorb quantization error.

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) starts from a striking observation: 1% of weight channels are disproportionately important for model quality. Keeping just those 1% in FP16 while quantizing the remaining 99% to INT4 dramatically reduces perplexity. But which 1% to keep?

The answer: look at activations, not weights. Weight channels paired with activation channels that have large magnitudes are the most important — a large activation multiplied by a weight error produces a large output error. These salient channels are identified by running calibration data and measuring the average activation magnitude per input channel.

Instead of keeping salient channels in FP16 (which breaks hardware uniformity), AWQ uses the same scale trick as SmoothQuant, but just for the salient channels:

Q(W · s) · (s⁻¹ · X) = Q(Ŵ) · X̂

Scaling up W_salient by s before quantization concentrates the quantization grid around the important values, reducing their error. The inverse scale is applied to the corresponding activation, which is cheap since activations are large in those channels anyway.

Why does scaling up a weight channel reduce its quantization error? Quantization error is proportional to the step size Δ = max(|w|)/(2^N−1). When you scale up a channel by s before quantization, max(|w·s|) = s·max(|w|), and Δ' = s·Δ. But the output error is Δ'·(1/s)·x = Δ·x — the same as before in the unscaled domain. However, within the channel, other weights also got scaled by s, so their quantization is finer (more of the range is covered). For salient channels where x is large, the improvement is dramatic.

Cross-Layer Equalization (CLE) for PTQ

Before SmoothQuant, Song Han's group (and Markus et al., ICCV 2019) proposed Cross-Layer Equalization (CLE) for CNN PTQ. The idea: consecutive layers separated by a positive-homogeneous activation function (like ReLU) can be rescaled without changing the network output. ReLU is positive-homogeneous: ReLU(sx) = s·ReLU(x) for s > 0.

For two consecutive linear layers W⁽¹⁾ and W⁽²⁾ with a ReLU between them, introduce a per-channel scale s_c:

Ŵ⁽¹⁾[c] = W⁽¹⁾[c] / s_c (rescale output channels of layer 1)

Ŵ⁽²⁾[:,c] = W⁽²⁾[:,c] × s_c (rescale input channels of layer 2)

The optimal per-channel scale that equalizes quantization ranges is: s_c = √(max|W⁽¹⁾[c]| / max|W⁽²⁾[:,c]|). This geometric mean balances the per-channel range between the two layers. CLE is data-free (no calibration data needed) and can precede any other PTQ method.

PTQ Calibration Loop in Practice

python
import torch
from collections import defaultdict

def run_ptq_calibration(model, calibration_loader, bits=8):
    """Collect activation statistics and compute per-layer quant params."""
    model.eval()
    act_stats = defaultdict(lambda: {'min': float('inf'), 'max': float('-inf'), 'all': []})
    hooks = []

    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)):
            def make_hook(n):
                def hook(m, inp, out):
                    x = out.detach()
                    act_stats[n]['min'] = min(act_stats[n]['min'], x.min().item())
                    act_stats[n]['max'] = max(act_stats[n]['max'], x.max().item())
                    act_stats[n]['all'].extend(x.abs().flatten().tolist()[:512])  # sample
                return hook
            hooks.append(module.register_forward_hook(make_hook(name)))

    with torch.no_grad():
        for i, (x, _) in enumerate(calibration_loader):
            model(x)
            if i >= 32: break  # 32 batches is typically sufficient

    for h in hooks: h.remove()

    quant_params = {}
    q_max = (1 << (bits - 1)) - 1
    for name, stats in act_stats.items():
        # Percentile calibration (99.9th percentile)
        vals = sorted(stats['all'])
        thresh = vals[int(len(vals) * 0.999)]
        scale = thresh / q_max
        quant_params[name] = {'scale': scale, 'zero_point': 0}

    return quant_params

Method	Type	Quantization	Best for	Requires retraining?
SmoothQuant	W8A8	Weights + activations INT8	Batch serving, datacenter	No (calibration only)
GPTQ	W4A16	Weights INT4, activations FP16	Large models, edge	No (second-order opt)
AWQ	W4A16	Weights INT4 w/ salient scaling	Local inference, TinyChat	No (calibration + scale search)
QAT	W4A8 or W4A4	Both quantized, training-aware	Small models, aggressive quant	Yes
CLE	W8A8	Per-channel weight rescaling	Data-free CNN PTQ baseline	No (data-free)

AWQ determines which 1% of weight channels to protect by looking at the magnitude of the corresponding activation channels, not the weight magnitudes themselves. Why?

Because weights are already quantized, so their magnitude can't be measured accurately. Because AWQ only quantizes activations, not weights. Because output error = weight_error × activation. A weight rounding error on a channel with large activation causes large output error. Weight magnitude alone doesn't tell you this — you need the activation magnitude to gauge the impact. Because activation magnitudes are larger than weight magnitudes, making them easier to measure.

Chapter 8: Showcase: SmoothQuant & AWQ Explorer

This is the payoff chapter. Three linked canvases let you see the full quantization difficulty story, the SmoothQuant migration in action, and the AWQ salient-channel protection — all with live controls.

Canvas A — Calibration Method Comparison (PTQ on a vector with outlier)

An activation histogram with a configurable outlier. The three colored dashed lines show where each calibration method sets its clipping threshold. Lower total MSE is better.

Outlier magnitude20

Bits8

Canvas B — SmoothQuant Migration (drag α to shift difficulty between activations and weights)

Two bars: activation max (orange) and weight max (teal) before and after SmoothQuant. The red zone marks "hard to quantize" (above the INT8-friendly threshold). Drag α to find the sweet spot where both bars are in the green zone.

Migration strength α0.5

Activation outlier max(|X_j|)60

Weight channel max(|W_j|)2.0

Canvas C — AWQ Salient Channel Protection (activation magnitude → which channels to protect)

16 weight channels, colored by their corresponding activation magnitude (bright = high activation = salient). Toggle "Protect top 1%" to scale up salient channels before INT4 quantization. Watch the output error (red bar) drop dramatically.

Summary: what you just saw. Canvas A: any calibration method is better than min/max in the presence of outliers; KL-divergence minimizes information loss. Canvas B: SmoothQuant's α = 0.5 sweet spot typically balances activation and weight quantization difficulty. Canvas C: protecting the 2 most salient weight channels (by activation magnitude) shrinks the INT4 quantization error by 30–60%.

Chapter 9: Connections & Cheat Sheet

Complete Decision Guide: PTQ vs QAT vs LLM Methods

Situation	Method	Key Paper
Large model (ResNet, BERT+), INT8	PTQ + KL-div calibration	TensorRT (Migacz 2017)
Small model (MobileNet), INT8	PTQ per-channel or QAT	Krishnamoorthi 2018
Any model, INT4	QAT	Jacob et al. CVPR 2018
LLM >6.7B, W8A8 (batch serving)	SmoothQuant + PTQ	Lin et al. 2022
LLM, W4A16 (local/edge)	GPTQ or AWQ	Frantar 2022; Lin 2023
LLM, fine-tuning needed	QLoRA (QAT for LoRA adapters)	Dettmers 2023

Calibration Methods Cheat Sheet

Method	Threshold	Strength	Weakness
Min/Max	abs max observed	No clipping	Outliers ruin scale
Percentile (p=99.9%)	p-th percentile	Robust to outliers	Must choose p
KL-Divergence	argmin_T KL(FP32\|\|INT8)	Minimizes information loss	Computationally expensive
MSE (OCTAV)	argmin_T MSE	Optimal for Laplace dist.	Assumes distribution shape

SmoothQuant One-Liner

Multiply each activation channel by s_j⁻¹ and each weight row by s_j, where s_j = max(|X_j|)^α/max(|W_j|)^1−α. The scaling is mathematically equivalent (Y = XW = (X/s)·(sW)) and is absorbed offline into weights and the preceding LayerNorm.

GPTQ One-Liner

Layer-wise second-order PTQ: quantize weights one-by-one in order of Hessian curvature, then compensate the remaining unquantized weights to absorb the just-introduced error. Enables INT4 on 175B GPT in ~4 GPU-hours.

AWQ One-Liner

Identify the 1% of input channels with the largest activation magnitudes. Scale those weight channels up by s before INT4 quantization (and scale the activations down correspondingly). The error on salient channels drops ∝ 1/s.

W8A8 vs W4A16 Guide

Format	Weights	Activations	Best batch size	Speedup mechanism
FP16	FP16	FP16	Any	Baseline
W8A8	INT8	INT8	≥32 (compute-bound)	INT8 GEMM throughput
W4A16	INT4	FP16	1–4 (memory-bound)	Reduced weight transfer

The big picture: Quantization techniques form a layered hierarchy. L5 laid the mathematical foundation (affine mapping, integer arithmetic). This lesson built the deployment stack: calibrate activations with percentile/KL, use QAT when PTQ fails, apply SmoothQuant to tame LLM outliers, and use GPTQ/AWQ for 4-bit weight-only compression. Next up: Neural Architecture Search (L7) — finding architectures that are inherently efficient, not just compressed post-hoc.

Related Lessons

TinyML L5: Quantization I — affine quantization, K-means, scale/zero-point, quantized matmul. The foundation for this lesson.
CS229S: Sparsity & Quantization — broader quantization ecosystem including mixed-precision and hardware perspectives.
TinyML L7: Neural Architecture Search I — building architectures that need less compression by design.
CS224N L9: PEFT & LoRA — QLoRA combines 4-bit quantization with LoRA fine-tuning for efficient LLM adaptation.

"What I cannot create, I do not understand." — Feynman. You've now built every piece of the quantization stack from scratch — the calibration loop, the fake-quant node, the STE backward pass, the SmoothQuant scale formula. You understand quantization.