A 7B-parameter model weighs 28 GB in FP32. Quantize it to INT4 and it shrinks to 3.5 GB — small enough to live on a phone. But you're throwing away 28 of every 32 bits. This lesson shows exactly how you can do that without throwing away the model.
LLaMA-2 7B has seven billion parameters. At the default training precision — 32-bit floats, four bytes per parameter — the model alone occupies 28 gigabytes. The newest iPhone has 8 GB of RAM. The newest consumer GPU with 24 GB can barely fit it.
Now consider this: an 8-bit integer takes one byte. If you could represent every weight as an integer instead of a float, the model would shrink to 7 GB — still large, but now accessible to a high-end phone. Cut to 4 bits and you're at 3.5 GB. At 2 bits, 1.75 GB. The savings are enormous. The question is whether the math can survive the reduction.
Here's the energy hierarchy from Horowitz (ISSCC 2014) that motivates everything in this lecture. The numbers are for 45nm CMOS at 0.9V:
| Operation | Energy (pJ) | Relative to INT8 ADD |
|---|---|---|
| 8-bit INT ADD | 0.03 | 1× |
| 32-bit INT ADD | 0.1 | 3× |
| 16-bit FP ADD | 0.4 | 13× |
| 32-bit FP ADD | 0.9 | 30× |
| 8-bit INT MULT | 0.2 | 7× |
| 32-bit INT MULT | 3.1 | 103× |
| 16-bit FP MULT | 1.1 | 37× |
| 32-bit FP MULT | 3.7 | 123× |
Switching from FP32 to INT8 saves about 30× energy per addition. On a mobile device doing millions of operations per second, that difference is the gap between an app that's usable for 2 hours and one that drains the battery in 20 minutes.
This lesson covers three approaches. K-means quantization finds the best codebook by clustering — you choose k centroids and every weight snaps to the nearest one. Linear (affine) quantization uses a uniform grid defined by a scale and zero-point — this is the approach that enables true integer arithmetic during inference. Binary and ternary quantization (the extreme case, only ±1 or {-1, 0, +1}) is covered in Lecture 6.
Drag the bitwidth slider and watch how a 7B-parameter model shrinks. The green bar shows the iPhone 15 Pro's available RAM (8 GB).
Before you can quantize a weight, you need to understand what you're starting from — and what you're quantizing to. Every number in a neural network is stored in a specific format, and the format determines two competing properties: dynamic range (how large or small a number can be) and precision (how finely you can distinguish between nearby numbers).
An unsigned integer with n bits can represent values from 0 to 2n − 1. An 8-bit unsigned integer (uint8) covers [0, 255]. A signed integer using two's complement covers [−2n−1, 2n−1 − 1]. For 8-bit (INT8): [−128, 127].
Two's complement is the standard. The value of an n-bit two's complement number with bits bn−1bn−2…b0 is:
Example: the 8-bit pattern 11001111 = −1×27 + 1×26 + 1×23 + 1×22 + 1×21 + 1×20 = −128 + 64 + 8 + 4 + 2 + 1 = −49.
Fixed-point takes a regular integer and places an implicit decimal point somewhere in the bit pattern. You're using the same hardware integer instructions, but you agree in advance to interpret the result as scaled by some factor 2−f where f is the number of fractional bits.
Example: an 8-bit number 00110001 = 49 as an integer. As fixed-point with 4 fractional bits, we interpret it as 49 × 2−4 = 49 × 0.0625 = 3.0625. Fixed-point DSPs use this constantly — it's fast, deterministic, and requires no special hardware. The downside: the scale is fixed, so you can't represent both very large and very small values simultaneously.
| Format | Sign | Exponent | Mantissa | Total | Notes |
|---|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | 32 | Training standard. Full range + precision. |
| FP16 | 1 | 5 | 10 | 16 | Smaller range than FP32 (max ≈ 65504). Overflow risk. |
| BF16 | 1 | 8 | 7 | 16 | Same exponent as FP32 → same range. Less precision. |
| INT8 | 1 (sign bit) | — | — | 8 | Uniform grid. Needs scale to represent real values. |
| INT4 | 1 (sign bit) | — | — | 4 | Only 16 values. Extreme compression, challenging accuracy. |
Why does BF16 matter? Because it was designed specifically to drop-in replace FP32 in training. It has the same 8-bit exponent, meaning it can represent values from ≈10−38 to ≈1038 — the same range as FP32. You just lose precision (7 mantissa bits instead of 23). For the stochastic nature of gradient updates, that precision loss is usually acceptable.
FP16, on the other hand, cuts the exponent to 5 bits, which limits the range to [≈6×10−5, 65504]. Gradient magnitudes during training can easily exceed 65504, causing overflow — the dreaded "loss scale" trick in mixed-precision training exists because of this.
The IEEE 754 floating-point standard encodes a number as three fields: a sign bit, an exponent, and a mantissa (also called the fraction or significand). For FP32 the formula is:
The "1 +" is the hidden bit — IEEE 754 assumes the leading 1 of a normalized number is implicit, so you don't need to store it. The bias for FP32 is 127 (= 28−1 − 1). This lets the 8-bit exponent represent both negative exponents (for tiny numbers) and positive exponents (for large numbers), centered around zero.
Step 1: Write the number in binary scientific notation.
0.265625 = 1.0625 × 2−2 = (1 + 0.0625) × 2−2
Step 2: Identify the fields. Sign = 0 (positive). Exponent = −2 + 127 = 125 = 01111101. Mantissa = 0.0625 = 2−4 = 00010000000000000000000.
Step 3: Concatenate. The full 32-bit pattern is: 0 01111101 00010000000000000000000.
FP16 has 5 exponent bits and 10 mantissa bits. The bias is 25−1 − 1 = 15. Given the bit pattern 1100011100000000:
• Sign = 1 (negative). • Exponent = 100012 = 17. Actual exponent = 17 − 15 = 2.
• Mantissa = 11000000002 = 2−1 + 2−2 = 0.5 + 0.25 = 0.75.
• Value = −(1 + 0.75) × 22 = −1.75 × 4 = −7.0.
Select a format and type a decimal value. The canvas shows the exact bit fields, encoded value, and the representable grid spacing near that value. The grid spacing tells you the precision you have.
0 10000 1100000000 (sign=0, exp=10000, mantissa=1100000000). What decimal value does this encode? (FP16 bias = 15)Here is the central observation of K-means quantization: the weight values of a trained neural network are not uniformly distributed. They cluster. Typical convolutional layers have bimodal or multimodal distributions — groups of weights near certain recurring values like −1, 0, +0.5, +1.5. If you can identify those clusters and replace every weight in a cluster with a single representative value, you can store each weight as a short cluster index instead of a full 32-bit float.
The approach is weight clustering, introduced in Deep Compression (Han et al., ICLR 2016). You run k-means on all the weights of a layer to find k centroids, then replace every weight with the index of its nearest centroid. During storage you keep only the indices (short integers) and the centroid values (a small codebook of k floats).
Consider a 4×4 weight matrix with 16 weights (shown in the canvas). K-means with k=4 finds four centroids. Let's say they are: c0 = −1.00, c1 = 0.00, c2 = 1.50, c3 = 2.00.
The weight 2.09 snaps to centroid 3 (2.00). The weight −0.98 snaps to centroid 0 (−1.00). And so on. Now instead of storing 16 float32 values (64 bytes), you store:
The quantization error for this particular example is small — the worst case is 2.09 − 2.00 = 0.09, and most others are comparable. The model can tolerate these errors because neural networks are over-parameterized and robust to small perturbations.
After snapping every weight to its centroid, you have quantization error that slightly degrades accuracy. You can recover it by fine-tuning the centroids. Here's the trick: keep the cluster assignments fixed. Run a forward pass to get the loss. Backpropagate to get gradients with respect to each weight. But instead of updating individual weights, group the gradients by cluster and add up all the gradients that belong to centroid k. That sum is the gradient for centroid k. Update each centroid by its summed gradient.
This is elegant: you're doing standard SGD, but with shared parameters. All weights in the same cluster move together because they are the same parameter (the centroid). After fine-tuning, the centroids settle into positions that minimize loss, not just reconstruction error — they can drift away from the original k-means positions if that's what the loss surface demands.
Each dot is a weight value. Choose k centroids (the colored vertical lines). Click "Run K-Means" to find the optimal cluster assignments. The compression ratio and reconstruction error update live.
Let's derive exactly how much compression K-means quantization provides and verify it on real numbers. This is the calculation from Han et al., ICLR 2016.
Suppose a layer has M weight parameters, each stored as a 32-bit float. Original storage = 32M bits. After K-means quantization with k centroids (where k = 2b for b-bit indices):
When M is large (M >> 2b, i.e., many more weights than centroids), the codebook is negligible and the compression ratio is approximately:
So 2-bit quantization (k=4 centroids) gives ≈ 16× compression; 4-bit (k=16) gives ≈ 8×; 8-bit (k=256) gives ≈ 4×.
Our example: M = 16 weights, k = 4 centroids (b = 2 bits).
With M=16 the codebook is still significant (16 bytes out of 20 total). As layers get larger — say M = 1,000,000 weights with b=4 bits — the codebook (16 × 4 = 64 bytes) is negligible, and compression is ~8×.
Deep Compression combined pruning (removing ~90% of weights) + quantization (4–8 bits) + Huffman coding. Results on real models:
| Network | Original | Compressed | Ratio | Accuracy |
|---|---|---|---|---|
| AlexNet | 240 MB | 6.9 MB | 35× | +0.03% (improved!) |
| VGGNet | 550 MB | 11.3 MB | 49× | +0.41% (improved!) |
| SqueezeNet + DC | 4.8 MB | 0.47 MB | 510× vs AlexNet | AlexNet parity |
The accuracy improvements happen because fine-tuning the centroids acts as a form of regularization — clustering forces weights to share values, which can slightly reduce overfitting on some tasks.
K-means quantization is flexible but limited: it only saves storage, not compute. Linear quantization (also called affine quantization, introduced in Jacob et al., CVPR 2018) solves both problems. Every weight is an integer. Every computation during inference is integer arithmetic. The 30× energy win becomes real.
The core idea is an affine (linear) mapping between real values r and integer quantized values q:
where S is the scale factor (a positive real number, stored as float32) and Z is the zero-point (an integer of the same bitwidth as q). To go from real to quantized (quantization), you invert the mapping:
then clamp q to [qmin, qmax]. For INT8: q ∈ [−128, 127].
The scale S sets the spacing between quantization levels — smaller S means finer grid, less rounding error, but smaller representable range. You can think of S as the physical size of one "tick mark" on your integer ruler.
The zero-point Z is an integer offset that ensures the real number 0.0 is exactly representable by some integer q. This matters for ReLU activations (which are often zero) and for padding operations. If Z ≠ 0, the map is asymmetric — the range [rmin, rmax] doesn't have to be symmetric around zero.
In symmetric quantization, you set Z = 0 and force the real range to be symmetric around zero: [−|r|max, |r|max]. The formula simplifies to:
Symmetric quantization is simpler and the Z=0 assumption lets you simplify the quantized matmul significantly (see Chapter 7). The downside: if the weight distribution is actually asymmetric (e.g., mostly positive due to ReLU), you waste half the integer range on values that don't appear.
Asymmetric quantization lets you fit [rmin, rmax] exactly into [qmin, qmax] regardless of symmetry. More accurate for skewed distributions. But Z ≠ 0 introduces extra terms in the matmul derivation.
In per-tensor quantization, a single S and Z applies to all weights in a layer's tensor. Simple but risky: if one channel has much larger magnitudes than others, the shared scale stretches to accommodate it, making all other channels coarsely quantized.
In per-channel quantization, each output channel of a convolutional or linear layer has its own S and Z. This is almost always better in practice — it lets each channel adapt its scale to its own magnitude range, dramatically reducing quantization error. The hardware cost is small (you only need to store N additional floats for a layer with N output channels).
python import numpy as np def quantize_asymmetric(r, n_bits=8): """Asymmetric linear quantization. Returns (q, S, Z).""" q_min = -2**(n_bits-1) # -128 for INT8 q_max = 2**(n_bits-1) - 1 # +127 for INT8 r_min, r_max = r.min(), r.max() S = (r_max - r_min) / (q_max - q_min) # scale Z = np.round(q_min - r_min / S).astype(int) # zero-point q = np.clip(np.round(r / S + Z), q_min, q_max).astype(np.int8) return q, S, Z def dequantize(q, S, Z): """Reconstruct float from integer + scale + zero-point.""" return S * (q.astype(np.float32) - Z) # Example: quantize a small weight vector weights = np.array([2.09, -0.98, 1.48, 0.09, 0.05, -0.14, -1.08, 2.12]) q, S, Z = quantize_asymmetric(weights) r_hat = dequantize(q, S, Z) print(f"Scale: {S:.4f}, Zero-point: {Z}") print(f"Quantized: {q}") print(f"Reconstructed: {np.round(r_hat, 3)}") print(f"Max error: {np.max(np.abs(r_hat - weights)):.4f}")
Running this gives: Scale ≈ 0.0126, Zero-point ≈ −39. Max error ≈ 0.006 — far less than K-means with only 4 centroids, because here we have 256 evenly-spaced levels across the full range.
Let's derive the formulas for S and Z from first principles using the constraint that the real range [rmin, rmax] must map exactly to the integer range [qmin, qmax].
Starting from r = S(q − Z), evaluate at both endpoints:
Subtract the second from the first:
For Z, use the second equation:
We round Z because it must be an integer.
The weight matrix from the slides has values ranging from rmin = −1.08 to rmax = 2.12. We're quantizing to 2-bit signed integers, so qmin = −2, qmax = 1.
Scale: S = (2.12 − (−1.08)) / (1 − (−2)) = 3.20 / 3 = 1.067.
Zero-point: Z = round(−2 − (−1.08) / 1.067) = round(−2 − (−1.012)) = round(−2 + 1.012) = round(−0.988) = −1.
Let's verify a few weights. Take w = 2.09: q = round(2.09 / 1.067 + (−1)) = round(1.958 − 1) = round(0.958) = 1. Dequantized: r̂ = 1.067 × (1 − (−1)) = 1.067 × 2 = 2.134. Error = 2.134 − 2.09 = 0.044.
Take w = −1.08: q = round(−1.08 / 1.067 + (−1)) = round(−1.012 − 1) = round(−2.012) = −2. Dequantized: r̂ = 1.067 × (−2 − (−1)) = 1.067 × (−1) = −1.067. Error = −1.067 − (−1.08) = 0.013. Very small!
Now a worked example with INT8 symmetric quantization. Suppose a layer has weights with absolute maximum |r|max = 2.12.
Symmetric INT8 uses q range [−127, 127] (or sometimes [−128, 127] — let's use [−127, 127] for perfect symmetry).
S = |r|max / 27 = 2.12 / 127 ≈ 0.01669. Z = 0.
Quantize the weight vector [2.09, −0.98, 1.48, 0.09, 0.05, −0.14, −1.08, 2.12]:
| Float r | r / S | q (INT8) | Dequant r̂ | Error |
|---|---|---|---|---|
| 2.09 | 125.2 | 125 | 2.086 | 0.004 |
| −0.98 | −58.7 | −59 | −0.985 | 0.005 |
| 1.48 | 88.7 | 89 | 1.485 | 0.005 |
| 0.09 | 5.4 | 5 | 0.083 | 0.007 |
| 2.12 | 127.0 | 127 | 2.119 | 0.001 |
| −1.08 | −64.7 | −65 | −1.085 | 0.005 |
Maximum error across all weights: ~0.008. That's 0.4% of the value range — excellent fidelity at INT8.
Adjust the real value range and bitwidth. The canvas shows the quantization grid, computed S and Z, and the per-value rounding error histogram.
The payoff of linear quantization is that you can do the entire forward pass in integer arithmetic — no floating-point operations until the very last rescale step. Let's derive exactly how a quantized matmul works.
Consider Y = W × X. Each element y = w · x (a dot product). Every value obeys r = S(q − Z), so:
Solve for qY:
Expand the product:
This looks complicated, but look at which terms can be precomputed. ZWZX is constant (doesn't change across inputs). ZWqX depends only on the input, not the weights. ZXqW depends only on the weights, not the current input. The only per-output term is qWqX.
If we use symmetric quantization for weights, ZW = 0. The formula simplifies dramatically:
Now qWqX is a pure N-bit integer multiplication (INT8 × INT8 → INT16 or INT32 accumulator). The ZXqW correction term can be precomputed per column of W before inference. The scale ratio SWSX/SY is a float that gets applied once at the end — one floating-point multiplication per output element, not per inner product.
A fully-connected layer computes Y = WX + b. With quantized bias (Zb = 0, Sb = SWSX), and using symmetric W (ZW = 0):
where qbias = qb − ZXqW is precomputed at model-load time. Everything in the parentheses is 32-bit integer accumulation. The scale multiply and +ZY at the end brings the output back to INT8 range.
python import numpy as np def quantized_matmul(W_int8, X_int8, S_W, S_X, S_Y, Z_X, Z_Y, bias_int32): """ Integer-arithmetic matmul. W is symmetric (Z_W=0). W_int8: [out, in], X_int8: [in, batch] Returns q_Y as int8. """ # Step 1: N-bit integer matmul, accumulate in int32 acc = W_int8.astype(np.int32) @ X_int8.astype(np.int32) # [out, batch] # Step 2: subtract Z_X * sum_of_weights (precomputed per output) sum_W = W_int8.sum(axis=1, keepdims=True).astype(np.int32) acc -= Z_X * sum_W # Step 3: add bias (precomputed in int32 space) acc += bias_int32 # Step 4: rescale — multiply by S_W*S_X/S_Y (float, then round to int) scale_ratio = (S_W * S_X) / S_Y q_Y = np.round(acc * scale_ratio + Z_Y) q_Y = np.clip(q_Y, -128, 127).astype(np.int8) return q_Y
This is the payoff chapter. The explorer below gives you direct control over everything we've built: the weight distribution, the quantization format, the bitwidth, and symmetric vs. asymmetric mode. Watch how the quantization grid, the reconstruction error, and the MSE change as you explore the parameter space.
Try the outlier experiment: enable "Add outlier" to inject one weight at ±8.0 — far outside the normal range. See how it blows up the per-tensor scale and ruins all other weights. Then switch to per-channel (per-element here) mode to see the fix.
The top panel shows float weights (blue) and quantized+dequantized values (orange) on the same axis. The bottom panel shows the error bar for each weight. MSE updates live.
| Format | Bits | Sign | Exp | Mantissa | Max Value | Best For |
|---|---|---|---|---|---|---|
| FP32 | 32 | 1 | 8 | 23 | ~3.4×1038 | Training (reference) |
| FP16 | 16 | 1 | 5 | 10 | 65,504 | Training (beware overflow) |
| BF16 | 16 | 1 | 8 | 7 | ~3.4×1038 | Training (drop-in FP32) |
| INT8 | 8 | sign bit | — | — | 127 (with scale) | Inference (energy win) |
| INT4 | 4 | sign bit | — | — | 7 (with scale) | LLM inference (aggressive) |
| Quantity | Formula | Notes |
|---|---|---|
| Scale S | S = (rmax − rmin) / (qmax − qmin) | Maps float range to int range |
| Zero-point Z | Z = round(qmin − rmin / S) | Integer; ensures 0.0 is exact |
| Quantize | q = clip(round(r/S + Z), qmin, qmax) | Float → int |
| Dequantize | r̂ = S × (q − Z) | Int → float (approximation) |
| Symmetric S | S = |r|max / 2n−1, Z = 0 | Simpler matmul; wastes range if skewed |
| Matmul output | qY = (SWSX/SY)(qWqX − ZXqW) + ZY | ZW=0 (symmetric weights) |
| Property | K-Means Quant | Linear Quant |
|---|---|---|
| Grid type | Non-uniform (cluster-optimal) | Uniform (evenly spaced) |
| Storage | b-bit indices + float codebook | b-bit integers + S, Z |
| Compute | Float arithmetic (after lookup) | Integer arithmetic |
| Hardware fit | Needs codebook lookup at runtime | Runs on INT8 tensor cores |
| Fine-tuning | Centroid gradient SGD | Quantization-Aware Training (QAT) |
| Best for | Storage-only compression (Deep Compression) | Inference acceleration (TFLite, TRT) |
This lesson covered the mechanics of quantization — what the mapping is and why it works. Lecture 6 covers how to calibrate and recover accuracy: