You've quantized the weights. Now activations show up at inference and blow your INT8 range because one channel hits 70 while the rest hover around 1. This lesson shows how to calibrate activation ranges, train through fake-quant with the Straight-Through Estimator, and tame the outlier problem that makes naïve INT8 destroy LLM accuracy.
Lecture 5 gave you the tools to quantize weights offline. You compute the scale S and zero-point Z from the weight tensor, quantize once, and you're done. Weights are static — they don't change at inference time.
Activations are different. They're the intermediate values flowing through the network — the output of every ReLU, the input to every Linear layer. They depend entirely on the input you feed the model. A picture of a cat produces different activations than a picture of a sky. So you can't precompute their range offline the same way you did for weights.
Worse: at scale, activations develop systematic outliers. In large language models beyond 6.7B parameters, Song Han's group found that specific hidden channels consistently spike to values 10–100× larger than their neighbors — every single inference, regardless of input. One channel in a 4096-dimensional hidden state might hit ±70 while the other 4095 channels sit between ±1.
This lecture answers: how do you choose the quantization range for activations (calibration), how do you train the model to be quantization-friendly (QAT), and how do you handle LLMs where outliers are structural, not incidental?
A 16-channel activation vector: 15 channels drawn from N(0,1), one outlier channel. Toggle the outlier and watch the quantization scale blow up — and how the error on the normal channels explodes.
If you can't precompute activation ranges analytically, you compute them empirically. Calibration means running a small representative dataset through your trained FP32 model, recording the activation distributions at every layer, and deriving the best S and Z from those statistics before deployment. No retraining required.
The simplest approach: collect min and max across all calibration samples, and use those as rmin and rmax. This guarantees no clipping — every observed value fits in the INT8 range. The problem is that a single outlier in your calibration set can force S to accommodate it, wasting resolution on everyone else.
Instead of using the absolute min and max, use the 99.9th or 99.99th percentile. You're deliberately clipping a tiny fraction of extreme values, accepting a small clipping error in exchange for much better resolution across the bulk of the distribution.
Worked example. Activation vector: [0.1, 0.3, −0.5, 0.8, 0.2, −0.9, 0.4, 0.6, −0.2, 52.0]
Developed by NVIDIA for TensorRT. Instead of minimizing MSE or avoiding clipping, you minimize the information loss from quantization. The idea: the FP32 activation distribution is your "truth." The INT8 representation is an approximation. KL divergence measures how much information the approximation loses.
You sweep a threshold T from a small value up to the absolute max. For each T, you saturate all values above T, quantize to INT8, dequantize, and measure KL(FP32 hist || INT8 hist). The T that minimizes KL divergence is your winner. This typically clips a few percent of the distribution but dramatically reduces entropy loss.
Directly minimize E[(x − Q(x))2] over the clipping threshold. Under a Laplace(0, b) distribution, the optimal threshold for N-bit quantization has closed-form solutions: |r|max = 2.83b (2-bit), 3.89b (3-bit), 5.03b (4-bit). The Laplace parameter b is estimated from your calibration set.
A 64-value activation distribution (mostly Gaussian + one outlier at +40). Toggle calibration method and see the threshold, scale, and total quantization MSE.
The outlier problem has a structural dimension: outliers are not random. In LLMs, they cluster in specific channels — the same input channel index has an outlier every single token, while its neighbors are well-behaved. This motivates quantization granularity: how much of the tensor shares a single (S, Z) pair?
One S and one Z for the entire weight matrix. Works well for large models with well-behaved distributions. Fails badly when different output channels have very different dynamic ranges — a common occurrence in early layers of MobileNet-style architectures.
One S and Z per output channel. If the weight matrix is W ∈ ℝCout × Cin, you have Cout scale factors. Each channel's scale is set by its own range, so a channel with large weights doesn't tax channels with small weights.
Why per-channel is standard for weights but not activations. Weight channels are fixed at export time — the Cout scales are constants stored alongside the model. Activation channels are different at every token. You'd need to compute per-channel scales at runtime, which costs extra ops and disrupts the INT8 GEMM pipeline. So weights get per-channel, activations get per-tensor (or per-token in some schemes).
Group quantization is a middle ground. Instead of one scale per channel, you split each channel into groups of g elements and give each group its own scale. For a weight at INT4 with group size g=128, the effective bitwidth is 4 + 16/128 = 4.125 bits (the 16-bit scale overhead amortized over 128 values). This is the key innovation in AWQ and many modern LLM quantization schemes.
| Granularity | Scales per layer | Overhead | Accuracy |
|---|---|---|---|
| Per-Tensor | 1 | Minimal | Lowest |
| Per-Channel | Cout | Low | Good (weights) |
| Group (g=128) | Cout × Cin/128 | +0.125 bits | Excellent |
| Per-Token (activations) | seq_len | Runtime compute | Better than per-tensor |
Post-training quantization is fast but has limits. For aggressive quantization (4-bit and below), or for small models with limited representational capacity, PTQ often can't recover the accuracy lost by discretizing weights and activations. The solution: Quantization-Aware Training (QAT) — simulate quantization during training so the model learns to work within its discrete constraints.
But there's an immediate problem. Quantization uses the round() function: Q(w) = S · round(w/S). The round() function has zero gradient almost everywhere (it's piecewise constant), and undefined gradient at the integers. Backpropagation through a wall of zeros learns nothing.
The gradient is zero, so the weight update is zero, so training stops. The model freezes in place.
Bengio (2013) proposed the Straight-Through Estimator (STE): in the backward pass, replace ∂Q(w)/∂w with 1 (the identity). You pass the gradient straight through the quantizer as if it weren't there.
Is this mathematically correct? No. The true gradient is zero; the STE gradient is 1. It's a biased estimator — it introduces approximation error. But it works in practice. The intuition: the STE says "the gradient I would get if the network were continuous is a good enough proxy for updating my real weights."
In QAT, you maintain a full-precision shadow copy of the weights Wfp32. In the forward pass, you compute Q(W) and run the layer with quantized weights. Gradients flow back through Q using the STE, updating Wfp32. The small gradient steps accumulate in full precision without rounding error. At inference, you export only the quantized weights.
Blue staircase = Q(w) forward. Orange line = the STE gradient (∂L/∂w ← ∂L/∂Q(w)), which is identical to the incoming gradient — passed straight through. The true gradient of Q(w) w.r.t. w is zero everywhere except at integers.
With the STE in hand, you can build a QAT training loop. The key concept is the fake-quantization node (also called simulated quantization): a differentiable approximation that behaves like a quantizer in the forward pass but allows gradient flow in the backward pass.
python import torch import torch.nn as nn class FakeQuantize(torch.autograd.Function): """ Forward: round to N-bit quantization grid (fake-quant). Backward: Straight-Through Estimator — pass gradient unchanged. """ @staticmethod def forward(ctx, x, scale, zero_point, bits): q_min = -2 ** (bits - 1) q_max = 2 ** (bits - 1) - 1 # Quantize: x → integer grid q = torch.clamp(torch.round(x / scale + zero_point), q_min, q_max) # Dequantize: back to float (still in forward, but now on grid) x_fake_quant = scale * (q - zero_point) # Save a mask: gradient passes through if |x| <= range ctx.save_for_backward( (x >= scale * q_min).float() * (x <= scale * q_max).float() ) return x_fake_quant @staticmethod def backward(ctx, grad_output): mask, = ctx.saved_tensors # STE: pass gradient through where not clipped, else 0 grad_input = grad_output * mask return grad_input, None, None, None class QATLinear(nn.Module): """Linear layer with fake-quant on weights and activations.""" def __init__(self, in_f, out_f, bits=8): super().__init__() self.linear = nn.Linear(in_f, out_f) self.bits = bits # Learnable scale (initialize from weight range) self.w_scale = nn.Parameter( (self.linear.weight.abs().max() / (2 ** (bits - 1) - 1)).reshape(1) ) def forward(self, x): # Fake-quantize weights w_fq = FakeQuantize.apply(self.linear.weight, self.w_scale, 0, self.bits) # Fake-quantize activations (scale from per-batch absmax) a_scale = x.abs().max() / (2 ** (self.bits - 1) - 1) x_fq = FakeQuantize.apply(x, a_scale, 0, self.bits) return nn.functional.linear(x_fq, w_fq, self.linear.bias)
At inference, BatchNorm layers are almost always folded into the preceding convolution/linear layer (the mean/variance subtraction and scale/shift become constant modifications to W and b). For QAT to be accurate, you must simulate this folding during training — otherwise the QAT sees weights that will be different at inference.
| Model | FP32 Acc. | PTQ INT8 (per-tensor) | QAT INT8 |
|---|---|---|---|
| MobileNetV1 | 70.9% | 59.1% (−11.8%) | 70.7% (−0.2%) |
| MobileNetV2 | 71.9% | 69.8% (−2.1%) | 71.1% (−0.8%) |
| NASNet-Mobile | 74.9% | 72.2% (−2.7%) | 73.0% (−1.9%) |
PTQ and QAT are not competing — they're tools for different situations. The decision tree comes down to three factors: bitwidth, model size, and whether you have access to training data and compute.
For INT8 quantization on large models (ResNets, BERT, large Transformers), PTQ with good calibration (percentile or KL) typically achieves less than 0.5% accuracy drop. The model has enough representational capacity to absorb the quantization error. PTQ takes minutes; QAT takes days.
For INT4 and below, or for small efficient models (MobileNet, EfficientNet-lite), PTQ degrades unacceptably. MobileNetV1 at INT8 per-tensor loses 11.8% accuracy with PTQ — that's the difference between a useful model and a useless one. QAT recovers it to 0.2% loss. The training cost is justified.
A practical heuristic: use PTQ for INT8, consider QAT for INT4 and below. The transition point moves depending on model size — larger models can absorb INT4 PTQ (see GPTQ, AWQ for LLMs). Smaller models need QAT or at minimum per-channel calibration and careful range tuning.
Schematic accuracy curves for MobileNetV2-style small model. QAT (teal) holds accuracy much better than PTQ (orange) as bitwidth decreases below 8 bits.
| Scenario | Recommended Method | Reasoning |
|---|---|---|
| Large model, INT8 | PTQ (percentile or KL) | Accuracy loss <0.5%; minutes of work |
| Small model, INT8 | PTQ per-channel or QAT | Per-tensor PTQ may collapse; per-channel often sufficient |
| Any model, INT4 | QAT preferred; try GPTQ/AWQ for LLMs | PTQ INT4 loses significant accuracy without training signal |
| LLM, W8A8 | SmoothQuant + PTQ | Outlier migration makes activations quantizable without retraining |
| LLM, W4A16 | GPTQ or AWQ | Second-order weight correction or salient-channel scaling without full QAT |
Large language models beyond 6.7B parameters develop a peculiar quantization pathology. Certain hidden-state channels consistently activate with values 10–100× larger than their neighbors. The phenomenon is systematic: the same channel indices spike across every token, every sequence, every context. It's a structural feature of how large Transformers represent information — not a distribution artifact you can calibrate away.
The LLM Transformer hidden state is Y = XW. In standard INT8, you'd quantize both X (activations) and W (weights) independently, then multiply the integers. But X has channels with max|Xj| ≈ 70 while others have max|Xk| ≈ 0.5. A single per-tensor scale of 70/127 ≈ 0.55 means channel k's values all round to zero. The matrix multiply is destroyed.
The key insight from Lin et al. (2022): weights are easy to quantize, activations are hard to quantize — but we can balance this difficulty by migrating it from activations to weights. Weights can absorb more difficulty because they're quantized per-channel. Activations must use a single scale per-token.
The mathematical trick: introduce a per-channel diagonal scale matrix S. Because Y = XW, you can insert S and S−1 without changing the result:
Now X̂ = X / sj (divide each activation channel by sj) and Ŵ = sj × Wj,: (multiply each corresponding weight row by sj). The scale migrates the magnitude from activations to weights.
How big should sj be? Too large — weights become hard to quantize. Too small — activations stay hard. Song Han's group derived a formula with a migration strength parameter α ∈ [0, 1]:
Worked numbers. Say channel j has max|Xj| = 60 and max|Wj| = 2. With α = 0.5:
After smoothing: max|X̂j| = 60 / 5.48 ≈ 10.9, max|Ŵj| = 2 × 5.48 ≈ 10.96. The outlier activation dropped from 60 to 10.9 (5.5× improvement). The weight only grew from 2 to 10.96 — still manageable with per-channel quantization. Both are now in a similar range, making per-token activation quantization and per-channel weight quantization both effective.
The α tradeoff. α = 0: all difficulty stays in activations (original problem). α = 1: all difficulty moves to weights (weights become huge, activations perfect). Sweet spot α ≈ 0.5 balances both, making the joint quantization error minimal. The optimal α can be tuned per-layer on calibration data.
python import torch def compute_smoothquant_scale(X_calibration, W, alpha=0.5): """ X_calibration: collected activation tensor [n_tokens, d_in] W: weight matrix [d_out, d_in] Returns: per-channel scale s of shape [d_in] """ # Per-channel max of activations (over all tokens) act_max = X_calibration.abs().max(dim=0).values # [d_in] # Per-input-channel max of weight rows w_max = W.abs().max(dim=0).values # [d_in] # Balancing scale s = act_max.pow(alpha) / w_max.pow(1 - alpha) s = s.clamp(min=1e-5) # avoid division by zero return s def apply_smooth_quant(X, W, s): """Fold the scale: smooth activations, absorb into weights.""" X_smooth = X / s.unsqueeze(0) # divide activation channels W_smooth = W * s.unsqueeze(0) # multiply weight rows # Now quantize X_smooth and W_smooth independently with INT8 return X_smooth, W_smooth
SmoothQuant enables W8A8 (both weights and activations quantized to INT8) — excellent for batch serving on data-center GPUs. But single-query LLM inference on a local machine has a different bottleneck: memory bandwidth. At batch size 1, the GPU compute is idle while it waits for 65B parameters to transfer from HBM to registers. W8A8 halves the memory but still needs a fast GPU. For edge devices, you need W4A16: 4-bit weights, 16-bit activations — only weights are compressed, and they're decompressed on-the-fly to FP16 for compute.
GPTQ (Frantar et al., 2022) is a layer-wise PTQ method that finds the best 4-bit weights by minimizing the output error: for each layer, find the quantized Ŵ that minimizes ||WX − ŴX||F2. It uses the second-order Hessian information — the curvature of the loss surface — to make smarter rounding decisions than round-to-nearest.
Key insight: not all weight rounding errors are equal. A weight with high Hessian curvature (high second derivative) contributes more to output error when rounded badly than a weight with low curvature. GPTQ quantizes weights in order of decreasing importance (Hessian magnitude) and compensates remaining weights after each rounding to absorb the error.
AWQ (Lin et al., 2023) starts from a striking observation: 1% of weight channels are disproportionately important for model quality. Keeping just those 1% in FP16 while quantizing the remaining 99% to INT4 dramatically reduces perplexity. But which 1% to keep?
The answer: look at activations, not weights. Weight channels paired with activation channels that have large magnitudes are the most important — a large activation multiplied by a weight error produces a large output error. These salient channels are identified by running calibration data and measuring the average activation magnitude per input channel.
Instead of keeping salient channels in FP16 (which breaks hardware uniformity), AWQ uses the same scale trick as SmoothQuant, but just for the salient channels:
Scaling up Wsalient by s before quantization concentrates the quantization grid around the important values, reducing their error. The inverse scale is applied to the corresponding activation, which is cheap since activations are large in those channels anyway.
Why does scaling up a weight channel reduce its quantization error? Quantization error is proportional to the step size Δ = max(|w|)/(2N−1). When you scale up a channel by s before quantization, max(|w·s|) = s·max(|w|), and Δ' = s·Δ. But the output error is Δ'·(1/s)·x = Δ·x — the same as before in the unscaled domain. However, within the channel, other weights also got scaled by s, so their quantization is finer (more of the range is covered). For salient channels where x is large, the improvement is dramatic.
Before SmoothQuant, Song Han's group (and Markus et al., ICCV 2019) proposed Cross-Layer Equalization (CLE) for CNN PTQ. The idea: consecutive layers separated by a positive-homogeneous activation function (like ReLU) can be rescaled without changing the network output. ReLU is positive-homogeneous: ReLU(sx) = s·ReLU(x) for s > 0.
For two consecutive linear layers W(1) and W(2) with a ReLU between them, introduce a per-channel scale sc:
The optimal per-channel scale that equalizes quantization ranges is: sc = √(max|W(1)[c]| / max|W(2)[:,c]|). This geometric mean balances the per-channel range between the two layers. CLE is data-free (no calibration data needed) and can precede any other PTQ method.
python import torch from collections import defaultdict def run_ptq_calibration(model, calibration_loader, bits=8): """Collect activation statistics and compute per-layer quant params.""" model.eval() act_stats = defaultdict(lambda: {'min': float('inf'), 'max': float('-inf'), 'all': []}) hooks = [] for name, module in model.named_modules(): if isinstance(module, (torch.nn.Linear, torch.nn.Conv2d)): def make_hook(n): def hook(m, inp, out): x = out.detach() act_stats[n]['min'] = min(act_stats[n]['min'], x.min().item()) act_stats[n]['max'] = max(act_stats[n]['max'], x.max().item()) act_stats[n]['all'].extend(x.abs().flatten().tolist()[:512]) # sample return hook hooks.append(module.register_forward_hook(make_hook(name))) with torch.no_grad(): for i, (x, _) in enumerate(calibration_loader): model(x) if i >= 32: break # 32 batches is typically sufficient for h in hooks: h.remove() quant_params = {} q_max = (1 << (bits - 1)) - 1 for name, stats in act_stats.items(): # Percentile calibration (99.9th percentile) vals = sorted(stats['all']) thresh = vals[int(len(vals) * 0.999)] scale = thresh / q_max quant_params[name] = {'scale': scale, 'zero_point': 0} return quant_params
| Method | Type | Quantization | Best for | Requires retraining? |
|---|---|---|---|---|
| SmoothQuant | W8A8 | Weights + activations INT8 | Batch serving, datacenter | No (calibration only) |
| GPTQ | W4A16 | Weights INT4, activations FP16 | Large models, edge | No (second-order opt) |
| AWQ | W4A16 | Weights INT4 w/ salient scaling | Local inference, TinyChat | No (calibration + scale search) |
| QAT | W4A8 or W4A4 | Both quantized, training-aware | Small models, aggressive quant | Yes |
| CLE | W8A8 | Per-channel weight rescaling | Data-free CNN PTQ baseline | No (data-free) |
This is the payoff chapter. Three linked canvases let you see the full quantization difficulty story, the SmoothQuant migration in action, and the AWQ salient-channel protection — all with live controls.
An activation histogram with a configurable outlier. The three colored dashed lines show where each calibration method sets its clipping threshold. Lower total MSE is better.
Two bars: activation max (orange) and weight max (teal) before and after SmoothQuant. The red zone marks "hard to quantize" (above the INT8-friendly threshold). Drag α to find the sweet spot where both bars are in the green zone.
16 weight channels, colored by their corresponding activation magnitude (bright = high activation = salient). Toggle "Protect top 1%" to scale up salient channels before INT4 quantization. Watch the output error (red bar) drop dramatically.
| Situation | Method | Key Paper |
|---|---|---|
| Large model (ResNet, BERT+), INT8 | PTQ + KL-div calibration | TensorRT (Migacz 2017) |
| Small model (MobileNet), INT8 | PTQ per-channel or QAT | Krishnamoorthi 2018 |
| Any model, INT4 | QAT | Jacob et al. CVPR 2018 |
| LLM >6.7B, W8A8 (batch serving) | SmoothQuant + PTQ | Lin et al. 2022 |
| LLM, W4A16 (local/edge) | GPTQ or AWQ | Frantar 2022; Lin 2023 |
| LLM, fine-tuning needed | QLoRA (QAT for LoRA adapters) | Dettmers 2023 |
| Method | Threshold | Strength | Weakness |
|---|---|---|---|
| Min/Max | abs max observed | No clipping | Outliers ruin scale |
| Percentile (p=99.9%) | p-th percentile | Robust to outliers | Must choose p |
| KL-Divergence | argminT KL(FP32||INT8) | Minimizes information loss | Computationally expensive |
| MSE (OCTAV) | argminT MSE | Optimal for Laplace dist. | Assumes distribution shape |
Multiply each activation channel by sj−1 and each weight row by sj, where sj = max(|Xj|)α/max(|Wj|)1−α. The scaling is mathematically equivalent (Y = XW = (X/s)·(sW)) and is absorbed offline into weights and the preceding LayerNorm.
Layer-wise second-order PTQ: quantize weights one-by-one in order of Hessian curvature, then compensate the remaining unquantized weights to absorb the just-introduced error. Enables INT4 on 175B GPT in ~4 GPU-hours.
Identify the 1% of input channels with the largest activation magnitudes. Scale those weight channels up by s before INT4 quantization (and scale the activations down correspondingly). The error on salient channels drops ∝ 1/s.
| Format | Weights | Activations | Best batch size | Speedup mechanism |
|---|---|---|---|---|
| FP16 | FP16 | FP16 | Any | Baseline |
| W8A8 | INT8 | INT8 | ≥32 (compute-bound) | INT8 GEMM throughput |
| W4A16 | INT4 | FP16 | 1–4 (memory-bound) | Reduced weight transfer |
"What I cannot create, I do not understand." — Feynman. You've now built every piece of the quantization stack from scratch — the calibration loop, the fake-quant node, the STE backward pass, the SmoothQuant scale formula. You understand quantization.