Quantization I: Representing Networks in Fewer Bits

Chapter 0: The 28 GB Problem

LLaMA-2 7B has seven billion parameters. At the default training precision — 32-bit floats, four bytes per parameter — the model alone occupies 28 gigabytes. The newest iPhone has 8 GB of RAM. The newest consumer GPU with 24 GB can barely fit it.

Now consider this: an 8-bit integer takes one byte. If you could represent every weight as an integer instead of a float, the model would shrink to 7 GB — still large, but now accessible to a high-end phone. Cut to 4 bits and you're at 3.5 GB. At 2 bits, 1.75 GB. The savings are enormous. The question is whether the math can survive the reduction.

The quantization bargain: You're buying memory and speed with a small amount of accuracy. INT8 inference is typically 30× cheaper in energy than FP32 (0.03 pJ vs 0.9 pJ per ADD in 45nm CMOS). The goal is to lose the minimum accuracy for the maximum compression.

Here's the energy hierarchy from Horowitz (ISSCC 2014) that motivates everything in this lecture. The numbers are for 45nm CMOS at 0.9V:

Operation	Energy (pJ)	Relative to INT8 ADD
8-bit INT ADD	0.03	1×
32-bit INT ADD	0.1	3×
16-bit FP ADD	0.4	13×
32-bit FP ADD	0.9	30×
8-bit INT MULT	0.2	7×
32-bit INT MULT	3.1	103×
16-bit FP MULT	1.1	37×
32-bit FP MULT	3.7	123×

Switching from FP32 to INT8 saves about 30× energy per addition. On a mobile device doing millions of operations per second, that difference is the gap between an app that's usable for 2 hours and one that drains the battery in 20 minutes.

What quantization actually does: It constrains the continuous distribution of weight values to a finite discrete set. Instead of 2³² possible float values, each weight can only take one of 256 (INT8) or 16 (INT4) values. The art is in choosing which discrete values to use — and how to recover from the approximation.

This lesson covers three approaches. K-means quantization finds the best codebook by clustering — you choose k centroids and every weight snaps to the nearest one. Linear (affine) quantization uses a uniform grid defined by a scale and zero-point — this is the approach that enables true integer arithmetic during inference. Binary and ternary quantization (the extreme case, only ±1 or {-1, 0, +1}) is covered in Lecture 6.

Ch 1–2: Data Types

Integers, fixed-point, IEEE floats (FP32/FP16/BF16) — bit layouts and the range/precision tradeoff

↓

Ch 3–4: K-Means Quant

Weight clustering with a codebook; Deep Compression storage math; centroid fine-tuning

↓

Ch 5–7: Linear Quant

Affine mapping r = S(q − Z); derive S and Z; symmetric vs asymmetric; quantized matmul

↓

Ch 8: Showcase

Full quantization explorer: drag weights, choose format, watch error, see integer arithmetic

↓

Ch 9: Connections

Cheat sheet of all formats and formulas; bridge to PTQ/QAT (L6)

Model Size vs Precision — see the storage savings in real time

Drag the bitwidth slider and watch how a 7B-parameter model shrinks. The green bar shows the iPhone 15 Pro's available RAM (8 GB).

Bits per parameter32 bits

A 7B-parameter model in FP32 uses 28 GB. What is the approximate size in INT4 (4-bit integer)?

14 GB — INT4 is half the bits of INT8, so half of FP32. 7 GB — INT4 is a quarter of the bits of FP32. 3.5 GB — INT4 = 4 bits = 1/8 of FP32's 32 bits, so 28 GB / 8 = 3.5 GB. 1.75 GB — INT4 quantization also removes redundant parameters.

Chapter 1: Numeric Data Types

Before you can quantize a weight, you need to understand what you're starting from — and what you're quantizing to. Every number in a neural network is stored in a specific format, and the format determines two competing properties: dynamic range (how large or small a number can be) and precision (how finely you can distinguish between nearby numbers).

Integers

An unsigned integer with n bits can represent values from 0 to 2ⁿ − 1. An 8-bit unsigned integer (uint8) covers [0, 255]. A signed integer using two's complement covers [−2ⁿ⁻¹, 2ⁿ⁻¹ − 1]. For 8-bit (INT8): [−128, 127].

Two's complement is the standard. The value of an n-bit two's complement number with bits b_n−1b_n−2…b₀ is:

value = −b_n−1 · 2ⁿ⁻¹ + ∑_i=0ⁿ⁻² b_i · 2ⁱ

Example: the 8-bit pattern 11001111 = −1×2⁷ + 1×2⁶ + 1×2³ + 1×2² + 1×2¹ + 1×2⁰ = −128 + 64 + 8 + 4 + 2 + 1 = −49.

Fixed-Point Numbers

Fixed-point takes a regular integer and places an implicit decimal point somewhere in the bit pattern. You're using the same hardware integer instructions, but you agree in advance to interpret the result as scaled by some factor 2^−f where f is the number of fractional bits.

Example: an 8-bit number 00110001 = 49 as an integer. As fixed-point with 4 fractional bits, we interpret it as 49 × 2⁻⁴ = 49 × 0.0625 = 3.0625. Fixed-point DSPs use this constantly — it's fast, deterministic, and requires no special hardware. The downside: the scale is fixed, so you can't represent both very large and very small values simultaneously.

The Key Tradeoff: Range vs. Precision

The fundamental tension: More bits in the exponent field → wider dynamic range (you can represent both tiny and huge numbers). More bits in the mantissa/fraction field → finer precision (you can distinguish nearby values). You have a fixed bit budget. Every bit you give to the exponent costs a mantissa bit, and vice versa. This tradeoff drives the design of every floating-point format from FP32 to BF16 to FP8.

Format	Sign	Exponent	Mantissa	Total	Notes
FP32	1	8	23	32	Training standard. Full range + precision.
FP16	1	5	10	16	Smaller range than FP32 (max ≈ 65504). Overflow risk.
BF16	1	8	7	16	Same exponent as FP32 → same range. Less precision.
INT8	1 (sign bit)	—	—	8	Uniform grid. Needs scale to represent real values.
INT4	1 (sign bit)	—	—	4	Only 16 values. Extreme compression, challenging accuracy.

Why does BF16 matter? Because it was designed specifically to drop-in replace FP32 in training. It has the same 8-bit exponent, meaning it can represent values from ≈10⁻³⁸ to ≈10³⁸ — the same range as FP32. You just lose precision (7 mantissa bits instead of 23). For the stochastic nature of gradient updates, that precision loss is usually acceptable.

FP16, on the other hand, cuts the exponent to 5 bits, which limits the range to [≈6×10⁻⁵, 65504]. Gradient magnitudes during training can easily exceed 65504, causing overflow — the dreaded "loss scale" trick in mixed-precision training exists because of this.

Common misconception — BF16 is not just "smaller FP16": BF16 and FP16 are both 16-bit formats, but they make completely different tradeoffs. FP16 sacrifices range to gain precision (10 mantissa bits vs 7). BF16 sacrifices precision to preserve range (8 exponent bits, matching FP32). For training stability, BF16 is usually safer. For certain inference tasks where you need fine-grained small-number representation (e.g., attention softmax probabilities), FP16 can be better.

BF16 uses 8 exponent bits, the same as FP32. Why was this design decision made for training deep networks?

Because 8-bit exponents are faster to compute than 5-bit exponents in hardware. Because gradients and activations during training span a wide dynamic range; FP16's 5-bit exponent can overflow on large gradients, while BF16's 8-bit exponent covers the same range as FP32. Because BF16 was designed for inference on edge devices, where wide range is needed. Because 7 mantissa bits gives higher precision than 10 mantissa bits.

Chapter 2: Float Bit Layout Explorer

The IEEE 754 floating-point standard encodes a number as three fields: a sign bit, an exponent, and a mantissa (also called the fraction or significand). For FP32 the formula is:

value = (−1)^sign × (1 + mantissa) × 2^{exponent − bias}

The "1 +" is the hidden bit — IEEE 754 assumes the leading 1 of a normalized number is implicit, so you don't need to store it. The bias for FP32 is 127 (= 2⁸⁻¹ − 1). This lets the 8-bit exponent represent both negative exponents (for tiny numbers) and positive exponents (for large numbers), centered around zero.

Worked Example: Encoding 0.265625 in FP32

Step 1: Write the number in binary scientific notation.

0.265625 = 1.0625 × 2⁻² = (1 + 0.0625) × 2⁻²

Step 2: Identify the fields. Sign = 0 (positive). Exponent = −2 + 127 = 125 = 01111101. Mantissa = 0.0625 = 2⁻⁴ = 00010000000000000000000.

Step 3: Concatenate. The full 32-bit pattern is: 0 01111101 00010000000000000000000.

Worked Example: Decoding an FP16 Number

FP16 has 5 exponent bits and 10 mantissa bits. The bias is 2⁵⁻¹ − 1 = 15. Given the bit pattern 1100011100000000:

• Sign = 1 (negative). • Exponent = 10001₂ = 17. Actual exponent = 17 − 15 = 2.

• Mantissa = 1100000000₂ = 2⁻¹ + 2⁻² = 0.5 + 0.25 = 0.75.

• Value = −(1 + 0.75) × 2² = −1.75 × 4 = −7.0.

Subnormal numbers: When the exponent field is all zeros, the hidden "1+" disappears and the formula becomes (−1)^sign × (0 + mantissa) × 2^1−bias. Subnormals allow gradual underflow — they let you represent numbers smaller than the smallest normal number, at the cost of precision. The smallest positive FP32 subnormal is 2⁻¹⁴⁹; the smallest positive normal is 2⁻¹²⁶. Special cases: all-one exponent with zero mantissa = ±∞; all-one exponent with nonzero mantissa = NaN.

Float Bit Layout Explorer — toggle format and enter a decimal value

Select a format and type a decimal value. The canvas shows the exact bit fields, encoded value, and the representable grid spacing near that value. The grid spacing tells you the precision you have.

Format Value

The FP16 bit pattern 0 10000 1100000000 (sign=0, exp=10000, mantissa=1100000000). What decimal value does this encode? (FP16 bias = 15)

+3.0 — exponent = 16−15 = 1, mantissa = 0.75, value = 1.75 × 2¹ = 3.5. Wait, that's 3.5. +3.5 — sign=0 (+), exponent = 16−15 = 1, mantissa = 2⁻¹ + 2⁻² = 0.75, value = (1 + 0.75) × 2¹ = 1.75 × 2 = 3.5. +7.0 — because exp=2 and mantissa=0.75 gives 1.75 × 4 = 7. +1.75 — just read the mantissa directly.

Chapter 3: K-Means Weight Quantization

Here is the central observation of K-means quantization: the weight values of a trained neural network are not uniformly distributed. They cluster. Typical convolutional layers have bimodal or multimodal distributions — groups of weights near certain recurring values like −1, 0, +0.5, +1.5. If you can identify those clusters and replace every weight in a cluster with a single representative value, you can store each weight as a short cluster index instead of a full 32-bit float.

The approach is weight clustering, introduced in Deep Compression (Han et al., ICLR 2016). You run k-means on all the weights of a layer to find k centroids, then replace every weight with the index of its nearest centroid. During storage you keep only the indices (short integers) and the centroid values (a small codebook of k floats).

The codebook metaphor: Think of the centroids as the only allowed paint colors. You have, say, 16 colors. Every pixel in your image gets replaced by the nearest available color. The painting is stored as pixel indices (4 bits each) plus a 16-entry color table. The quality depends on how well the 16 colors were chosen — and k-means finds the optimal set.

Worked Example from Deep Compression

Consider a 4×4 weight matrix with 16 weights (shown in the canvas). K-means with k=4 finds four centroids. Let's say they are: c₀ = −1.00, c₁ = 0.00, c₂ = 1.50, c₃ = 2.00.

The weight 2.09 snaps to centroid 3 (2.00). The weight −0.98 snaps to centroid 0 (−1.00). And so on. Now instead of storing 16 float32 values (64 bytes), you store:

16 cluster indices at 2 bits each = 32 bits = 4 bytes
4 centroids at 32 bits each = 128 bits = 16 bytes
Total: 20 bytes vs. 64 bytes → 3.2× compression

The quantization error for this particular example is small — the worst case is 2.09 − 2.00 = 0.09, and most others are comparable. The model can tolerate these errors because neural networks are over-parameterized and robust to small perturbations.

K-means quantization only saves storage, not compute: At inference time, you must look up each index in the codebook to get a float centroid, then do the matrix multiplication in floating-point. The indices are integers but the math is still floating-point. This is the key limitation vs. linear quantization (Chapter 5), which enables true integer arithmetic during computation — the 30× energy win from the table in Chapter 0.

Fine-Tuning the Centroids

After snapping every weight to its centroid, you have quantization error that slightly degrades accuracy. You can recover it by fine-tuning the centroids. Here's the trick: keep the cluster assignments fixed. Run a forward pass to get the loss. Backpropagate to get gradients with respect to each weight. But instead of updating individual weights, group the gradients by cluster and add up all the gradients that belong to centroid k. That sum is the gradient for centroid k. Update each centroid by its summed gradient.

Δc_k = −η · ∑_{i: cluster(i)=k} ∂L / ∂w_i

This is elegant: you're doing standard SGD, but with shared parameters. All weights in the same cluster move together because they are the same parameter (the centroid). After fine-tuning, the centroids settle into positions that minimize loss, not just reconstruction error — they can drift away from the original k-means positions if that's what the loss surface demands.

K-Means Weight Clustering — run k-means on a random weight distribution

Each dot is a weight value. Choose k centroids (the colored vertical lines). Click "Run K-Means" to find the optimal cluster assignments. The compression ratio and reconstruction error update live.

k (num centroids)4

After K-means weight quantization with k=4 centroids on a 16-weight layer, how does inference work? Which part uses integer arithmetic?

The indices are multiplied together as integers, replacing the float multiplications entirely. Both storage and computation use integers — that's the whole point of quantization. The weights are stored as 2-bit indices, but at runtime they're decoded back to float32 centroids via a lookup table — all math stays floating-point.

Chapter 4: Deep Compression: The Storage Math

Let's derive exactly how much compression K-means quantization provides and verify it on real numbers. This is the calculation from Han et al., ICLR 2016.

The General Formula

Suppose a layer has M weight parameters, each stored as a 32-bit float. Original storage = 32M bits. After K-means quantization with k centroids (where k = 2^b for b-bit indices):

Indices: M weights × b bits per index = Mb bits
Codebook: k centroids × 32 bits per float = 32k = 32 · 2^b bits
Total: Mb + 32 · 2^b bits

When M is large (M >> 2^b, i.e., many more weights than centroids), the codebook is negligible and the compression ratio is approximately:

compression ratio ≈ 32M / (Mb) = 32/b

So 2-bit quantization (k=4 centroids) gives ≈ 16× compression; 4-bit (k=16) gives ≈ 8×; 8-bit (k=256) gives ≈ 4×.

Worked Numbers: The 4×4 Layer from Chapter 3

Our example: M = 16 weights, k = 4 centroids (b = 2 bits).

Original: 16 × 32 = 512 bits = 64 bytes
Indices: 16 × 2 = 32 bits = 4 bytes
Codebook: 4 × 32 = 128 bits = 16 bytes
Total: 4 + 16 = 20 bytes
Compression: 64 / 20 = 3.2×

With M=16 the codebook is still significant (16 bytes out of 20 total). As layers get larger — say M = 1,000,000 weights with b=4 bits — the codebook (16 × 4 = 64 bytes) is negligible, and compression is ~8×.

AlexNet Results from Deep Compression

Deep Compression combined pruning (removing ~90% of weights) + quantization (4–8 bits) + Huffman coding. Results on real models:

Network	Original	Compressed	Ratio	Accuracy
AlexNet	240 MB	6.9 MB	35×	+0.03% (improved!)
VGGNet	550 MB	11.3 MB	49×	+0.41% (improved!)
SqueezeNet + DC	4.8 MB	0.47 MB	510× vs AlexNet	AlexNet parity

The accuracy improvements happen because fine-tuning the centroids acts as a form of regularization — clustering forces weights to share values, which can slightly reduce overfitting on some tasks.

Why Huffman coding adds more on top: After K-means quantization, the cluster indices are not uniformly distributed. Weights near zero (index 1 = centroid ≈ 0) are much more frequent than weights near large values. Huffman coding exploits this: frequent symbols get shorter codes. The result is an additional 20–30% compression on top of K-means alone. Combined, Deep Compression achieves 35–49× on AlexNet/VGGNet vs. the original dense FP32 weights.

A layer has M = 100,000 weights currently stored as FP32. You apply K-means quantization with k = 16 centroids (4-bit indices). Approximately how large is the compressed representation, and what is the compression ratio?

~6.25 KB, 8× compression. Indices: 100k × 4 bits = 400 kbits = 50 KB. Codebook: 16 × 32 = 512 bits ≈ 0.064 KB. Total ≈ 50 KB. Original: 100k × 32 = 400 KB. Ratio ≈ 8×. ~50 KB, 8× compression. Indices: 100k × 4 bits = 400 kbits = 50 KB. Codebook: 16 × 32 bits ≈ 0.064 KB (negligible). Total ≈ 50 KB. Original: 400 KB. Ratio = 400/50 = 8×. ~25 KB, 16× compression. Because k=16 means 16× smaller. ~100 KB, 4× compression. Because 4-bit = 1/4 of 16-bit.

Chapter 5: Linear / Affine Quantization

K-means quantization is flexible but limited: it only saves storage, not compute. Linear quantization (also called affine quantization, introduced in Jacob et al., CVPR 2018) solves both problems. Every weight is an integer. Every computation during inference is integer arithmetic. The 30× energy win becomes real.

The core idea is an affine (linear) mapping between real values r and integer quantized values q:

r = S · (q − Z)

where S is the scale factor (a positive real number, stored as float32) and Z is the zero-point (an integer of the same bitwidth as q). To go from real to quantized (quantization), you invert the mapping:

q = round(r / S + Z)

then clamp q to [q_min, q_max]. For INT8: q ∈ [−128, 127].

What S and Z Do

The scale S sets the spacing between quantization levels — smaller S means finer grid, less rounding error, but smaller representable range. You can think of S as the physical size of one "tick mark" on your integer ruler.

The zero-point Z is an integer offset that ensures the real number 0.0 is exactly representable by some integer q. This matters for ReLU activations (which are often zero) and for padding operations. If Z ≠ 0, the map is asymmetric — the range [r_min, r_max] doesn't have to be symmetric around zero.

Misconception — quantization isn't just truncation: A naive approach would be: "INT8 stores integers, so just chop the decimals off." That's pure truncation — it ignores the actual range of the weights. A weight of 1.37 truncates to 1 in integer format. But what about a weight of 155.7? That overflows INT8's range [−128, 127]. And a weight of 0.003? It maps to 0, losing all information. The scale factor S stretches or compresses the entire weight distribution to fit inside the integer range. Without it, most of the INT8 range is wasted.

Symmetric vs. Asymmetric Quantization

In symmetric quantization, you set Z = 0 and force the real range to be symmetric around zero: [−|r|_max, |r|_max]. The formula simplifies to:

r = S · q (Z = 0)

S = |r|_max / 2ⁿ⁻¹

Symmetric quantization is simpler and the Z=0 assumption lets you simplify the quantized matmul significantly (see Chapter 7). The downside: if the weight distribution is actually asymmetric (e.g., mostly positive due to ReLU), you waste half the integer range on values that don't appear.

Asymmetric quantization lets you fit [r_min, r_max] exactly into [q_min, q_max] regardless of symmetry. More accurate for skewed distributions. But Z ≠ 0 introduces extra terms in the matmul derivation.

Per-Tensor vs. Per-Channel Quantization

In per-tensor quantization, a single S and Z applies to all weights in a layer's tensor. Simple but risky: if one channel has much larger magnitudes than others, the shared scale stretches to accommodate it, making all other channels coarsely quantized.

In per-channel quantization, each output channel of a convolutional or linear layer has its own S and Z. This is almost always better in practice — it lets each channel adapt its scale to its own magnitude range, dramatically reducing quantization error. The hardware cost is small (you only need to store N additional floats for a layer with N output channels).

python
import numpy as np

def quantize_asymmetric(r, n_bits=8):
    """Asymmetric linear quantization. Returns (q, S, Z)."""
    q_min = -2**(n_bits-1)       # -128 for INT8
    q_max =  2**(n_bits-1) - 1   # +127 for INT8
    r_min, r_max = r.min(), r.max()
    S = (r_max - r_min) / (q_max - q_min)    # scale
    Z = np.round(q_min - r_min / S).astype(int)  # zero-point
    q = np.clip(np.round(r / S + Z), q_min, q_max).astype(np.int8)
    return q, S, Z

def dequantize(q, S, Z):
    """Reconstruct float from integer + scale + zero-point."""
    return S * (q.astype(np.float32) - Z)

# Example: quantize a small weight vector
weights = np.array([2.09, -0.98, 1.48, 0.09, 0.05,
                    -0.14, -1.08, 2.12])
q, S, Z = quantize_asymmetric(weights)
r_hat = dequantize(q, S, Z)
print(f"Scale: {S:.4f}, Zero-point: {Z}")
print(f"Quantized: {q}")
print(f"Reconstructed: {np.round(r_hat, 3)}")
print(f"Max error: {np.max(np.abs(r_hat - weights)):.4f}")

Running this gives: Scale ≈ 0.0126, Zero-point ≈ −39. Max error ≈ 0.006 — far less than K-means with only 4 centroids, because here we have 256 evenly-spaced levels across the full range.

For a weight layer with values ranging from −1.08 to +2.12, asymmetric INT8 quantization gives S = (2.12 − (−1.08)) / (127 − (−128)) = 3.20 / 255 ≈ 0.01255. What does the zero-point Z represent physically?

Z is the number of quantization levels reserved for negative values. Z is the floating-point value that maps to the integer 0. Z is the integer value that represents the real number 0.0 — it ensures the float 0 is exactly representable, which matters for zero-padded convolutions and ReLU outputs. Z is always 0 in asymmetric quantization by definition.

Chapter 6: Deriving Scale & Zero-Point

Let's derive the formulas for S and Z from first principles using the constraint that the real range [r_min, r_max] must map exactly to the integer range [q_min, q_max].

Starting from r = S(q − Z), evaluate at both endpoints:

r_max = S(q_max − Z)

r_min = S(q_min − Z)

Subtract the second from the first:

r_max − r_min = S(q_max − q_min)

S = (r_max − r_min) / (q_max − q_min)

For Z, use the second equation:

r_min = S(q_min − Z) ⇒ Z = q_min − r_min / S

Z = round(q_min − r_min / S)

We round Z because it must be an integer.

Worked Numbers from Deep Compression Example

The weight matrix from the slides has values ranging from r_min = −1.08 to r_max = 2.12. We're quantizing to 2-bit signed integers, so q_min = −2, q_max = 1.

Scale: S = (2.12 − (−1.08)) / (1 − (−2)) = 3.20 / 3 = 1.067.

Zero-point: Z = round(−2 − (−1.08) / 1.067) = round(−2 − (−1.012)) = round(−2 + 1.012) = round(−0.988) = −1.

Let's verify a few weights. Take w = 2.09: q = round(2.09 / 1.067 + (−1)) = round(1.958 − 1) = round(0.958) = 1. Dequantized: r̂ = 1.067 × (1 − (−1)) = 1.067 × 2 = 2.134. Error = 2.134 − 2.09 = 0.044.

Take w = −1.08: q = round(−1.08 / 1.067 + (−1)) = round(−1.012 − 1) = round(−2.012) = −2. Dequantized: r̂ = 1.067 × (−2 − (−1)) = 1.067 × (−1) = −1.067. Error = −1.067 − (−1.08) = 0.013. Very small!

INT8 Symmetric Worked Example

Now a worked example with INT8 symmetric quantization. Suppose a layer has weights with absolute maximum |r|_max = 2.12.

Symmetric INT8 uses q range [−127, 127] (or sometimes [−128, 127] — let's use [−127, 127] for perfect symmetry).

S = |r|_max / 2⁷ = 2.12 / 127 ≈ 0.01669. Z = 0.

Quantize the weight vector [2.09, −0.98, 1.48, 0.09, 0.05, −0.14, −1.08, 2.12]:

q = round(r / 0.01669)

Float r	r / S	q (INT8)	Dequant r̂	Error
2.09	125.2	125	2.086	0.004
−0.98	−58.7	−59	−0.985	0.005
1.48	88.7	89	1.485	0.005
0.09	5.4	5	0.083	0.007
2.12	127.0	127	2.119	0.001
−1.08	−64.7	−65	−1.085	0.005

Maximum error across all weights: ~0.008. That's 0.4% of the value range — excellent fidelity at INT8.

Outliers blow up per-tensor quant: Suppose one weight in the layer is 50.0 while all others are in [−1, 1]. The per-tensor scale becomes S = 50 / 127 ≈ 0.394. Now every weight in [−1, 1] gets quantized to q ∈ [−3, 3] — only 7 out of 256 INT8 values are used! The quantization error is enormous. This is why per-channel quantization matters, and why techniques like GPTQ and AWQ (Lecture 6) explicitly handle weight outliers. Activation outliers in LLMs are an even bigger problem — a single outlier dimension can dominate the scale and ruin all other dimensions.

Linear Quantization Mapper — drag range and bitwidth to see the grid

Adjust the real value range and bitwidth. The canvas shows the quantization grid, computed S and Z, and the per-value rounding error histogram.

r_min−2.0

r_max2.0

Bits4

Mode

You have a weight tensor with r_min = −3.0 and r_max = 3.0. You apply asymmetric INT8 quantization (q_min = −128, q_max = 127). What is S and Z?

S = 6/255 ≈ 0.0235, Z = 0. Because the range is symmetric, Z is always 0. S = 6/255 ≈ 0.0235, Z = round(−128 − (−3.0)/0.0235) = round(−128 + 127.7) ≈ 0. Both formulas agree: S = (3−(−3))/(127−(−128)) = 6/255, Z = round(−128 − (−3/0.0235)) = round(−128 + 127.66) = round(−0.34) = 0. S = 3/127 ≈ 0.0236, Z = 128. Because the range is symmetric, you need Z = 128 to center it. S = 1/42 ≈ 0.024, Z = −3. The scale is 1 over the range divided by bitwidth.

Chapter 7: Quantized Matrix Multiplication

The payoff of linear quantization is that you can do the entire forward pass in integer arithmetic — no floating-point operations until the very last rescale step. Let's derive exactly how a quantized matmul works.

The Full Derivation

Consider Y = W × X. Each element y = w · x (a dot product). Every value obeys r = S(q − Z), so:

S_Y(q_Y − Z_Y) = S_W(q_W − Z_W) · S_X(q_X − Z_X)

Solve for q_Y:

q_Y = (S_WS_X / S_Y) · (q_W − Z_W)(q_X − Z_X) + Z_Y

Expand the product:

q_Y = (S_WS_X / S_Y) · (q_Wq_X − Z_Wq_X − Z_Xq_W + Z_WZ_X) + Z_Y

This looks complicated, but look at which terms can be precomputed. Z_WZ_X is constant (doesn't change across inputs). Z_Wq_X depends only on the input, not the weights. Z_Xq_W depends only on the weights, not the current input. The only per-output term is q_Wq_X.

The Symmetric Shortcut: Z_W = 0

If we use symmetric quantization for weights, Z_W = 0. The formula simplifies dramatically:

q_Y = (S_WS_X / S_Y) · (q_Wq_X − Z_Xq_W) + Z_Y

Now q_Wq_X is a pure N-bit integer multiplication (INT8 × INT8 → INT16 or INT32 accumulator). The Z_Xq_W correction term can be precomputed per column of W before inference. The scale ratio S_WS_X/S_Y is a float that gets applied once at the end — one floating-point multiplication per output element, not per inner product.

The scale ratio is always less than 1: In practice, S_WS_X/S_Y always falls in (0, 1). This means you can represent it as M₀ × 2⁻ⁿ where M₀ ∈ [0.5, 1). Multiplying by M₀ can be done as a fixed-point integer multiply (no float needed), and the 2⁻ⁿ factor is just a right bit-shift. This is how TensorFlow Lite achieves fully integer-only inference — even the rescale step uses integers.

Full FC Layer with Bias

A fully-connected layer computes Y = WX + b. With quantized bias (Z_b = 0, S_b = S_WS_X), and using symmetric W (Z_W = 0):

q_Y = (S_WS_X/S_Y) · (q_Wq_X + q_bias) + Z_Y

where q_bias = q_b − Z_Xq_W is precomputed at model-load time. Everything in the parentheses is 32-bit integer accumulation. The scale multiply and +Z_Y at the end brings the output back to INT8 range.

python
import numpy as np

def quantized_matmul(W_int8, X_int8, S_W, S_X, S_Y, Z_X, Z_Y, bias_int32):
    """
    Integer-arithmetic matmul. W is symmetric (Z_W=0).
    W_int8: [out, in], X_int8: [in, batch]
    Returns q_Y as int8.
    """
    # Step 1: N-bit integer matmul, accumulate in int32
    acc = W_int8.astype(np.int32) @ X_int8.astype(np.int32)  # [out, batch]

    # Step 2: subtract Z_X * sum_of_weights (precomputed per output)
    sum_W = W_int8.sum(axis=1, keepdims=True).astype(np.int32)
    acc -= Z_X * sum_W

    # Step 3: add bias (precomputed in int32 space)
    acc += bias_int32

    # Step 4: rescale — multiply by S_W*S_X/S_Y (float, then round to int)
    scale_ratio = (S_W * S_X) / S_Y
    q_Y = np.round(acc * scale_ratio + Z_Y)
    q_Y = np.clip(q_Y, -128, 127).astype(np.int8)
    return q_Y

In integer-only inference with symmetric weight quantization (Z_W = 0), which operations are done entirely in integer arithmetic, and which require a floating-point step?

Everything is floating-point. The integer weights are just storage; computation always uses floats. Everything is integer. The entire forward pass, including scale factors, uses integer arithmetic. The core accumulation (q_W × q_X, add bias, add Z_X correction) is all integer. Only the final rescale (multiply by S_W·S_X/S_Y) involves a floating-point multiplication — but this can itself be converted to a fixed-point integer multiply + bit shift. Only the bias addition uses integers. All multiplications require floating-point.

Chapter 8: Showcase: Full Quantization Explorer

This is the payoff chapter. The explorer below gives you direct control over everything we've built: the weight distribution, the quantization format, the bitwidth, and symmetric vs. asymmetric mode. Watch how the quantization grid, the reconstruction error, and the MSE change as you explore the parameter space.

Try the outlier experiment: enable "Add outlier" to inject one weight at ±8.0 — far outside the normal range. See how it blows up the per-tensor scale and ruins all other weights. Then switch to per-channel (per-element here) mode to see the fix.

Full Quantization Explorer — weights, format, bitwidth, errors, MSE

The top panel shows float weights (blue) and quantized+dequantized values (orange) on the same axis. The bottom panel shows the error bar for each weight. MSE updates live.

Bitwidth8 bits

Format

Add outlier weight (+8.0) — see the range collapse

The key experiment to run: Start at 8 bits asymmetric, no outlier. Note the MSE — it's tiny. Now add the outlier and watch MSE explode even at 8 bits. Switch to "Symmetric": Z=0 can't adapt to asymmetric distributions as well. Now remove the outlier and try 4 bits vs 2 bits — you'll see the "staircase" error pattern where coarse quantization creates clusters of identical reconstructed values. This is the core intuition for why INT4 needs careful per-channel calibration to work well.

Chapter 9: Connections & Cheat Sheet

Numeric Format Cheat Sheet

Format	Bits	Sign	Exp	Mantissa	Max Value	Best For
FP32	32	1	8	23	~3.4×10³⁸	Training (reference)
FP16	16	1	5	10	65,504	Training (beware overflow)
BF16	16	1	8	7	~3.4×10³⁸	Training (drop-in FP32)
INT8	8	sign bit	—	—	127 (with scale)	Inference (energy win)
INT4	4	sign bit	—	—	7 (with scale)	LLM inference (aggressive)

Quantization Formula Cheat Sheet

Quantity	Formula	Notes
Scale S	S = (r_max − r_min) / (q_max − q_min)	Maps float range to int range
Zero-point Z	Z = round(q_min − r_min / S)	Integer; ensures 0.0 is exact
Quantize	q = clip(round(r/S + Z), q_min, q_max)	Float → int
Dequantize	r̂ = S × (q − Z)	Int → float (approximation)
Symmetric S	S = \|r\|_max / 2ⁿ⁻¹, Z = 0	Simpler matmul; wastes range if skewed
Matmul output	q_Y = (S_WS_X/S_Y)(q_Wq_X − Z_Xq_W) + Z_Y	Z_W=0 (symmetric weights)

K-Means vs. Linear Quantization

Property	K-Means Quant	Linear Quant
Grid type	Non-uniform (cluster-optimal)	Uniform (evenly spaced)
Storage	b-bit indices + float codebook	b-bit integers + S, Z
Compute	Float arithmetic (after lookup)	Integer arithmetic
Hardware fit	Needs codebook lookup at runtime	Runs on INT8 tensor cores
Fine-tuning	Centroid gradient SGD	Quantization-Aware Training (QAT)
Best for	Storage-only compression (Deep Compression)	Inference acceleration (TFLite, TRT)

What's Next — Quantization II (Lecture 6)

This lesson covered the mechanics of quantization — what the mapping is and why it works. Lecture 6 covers how to calibrate and recover accuracy:

Post-Training Quantization (PTQ) — calibrate S and Z using a small representative dataset after training. No retraining. Works well for INT8; degrades at INT4.
Quantization-Aware Training (QAT) — simulate quantization during training with "fake quant" nodes (straight-through estimator for gradients). Recovers most accuracy even at INT4.
GPTQ — second-order PTQ for LLMs: solve a per-layer weight reconstruction problem using the Hessian. Achieves INT4 with minimal perplexity loss.
AWQ (Activation-Aware Quantization) — identify "salient" weight channels (those with large activation magnitudes) and scale them to use the full integer range before quantization. Handles the outlier problem from Chapter 6.

Related Lessons

TinyML L1

Efficiency metrics: where quantization fits in the efficiency landscape

TinyML L2

NN building blocks: MAC counts that quantization reduces in energy

TinyML L3

Pruning I: complementary to quantization in Deep Compression pipeline

CS229S L1

Systems for ML: hardware context for integer tensor cores

Closing thought: Quantization is not about discarding information carelessly — it is about choosing the right basis to represent what actually matters. A 7B-parameter model in FP32 contains 32 bits per weight, but only a handful of those bits carry signal that survives gradient descent. The scale factor S and zero-point Z are the minimal translation layer that lets integer hardware see the same model that float hardware trained. When you internalize that, the whole field of post-training quantization, GPTQ, and AWQ becomes a single question: how do you choose S and Z so the model forgets as little as possible?

References

Han, S., Mao, H., Dally, W. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR, 2016. arXiv
Jacob, B. et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR, 2018. arXiv
Horowitz, M. "Computing's Energy Problem (and What We Can Do About it)." IEEE ISSCC, 2014.
Song Han. "MIT 6.5940: TinyML and Efficient Deep Learning Computing." Lecture 5 — Quantization Part I. 2024.