TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 5

Quantization I: Representing Networks in Fewer Bits

A 7B-parameter model weighs 28 GB in FP32. Quantize it to INT4 and it shrinks to 3.5 GB — small enough to live on a phone. But you're throwing away 28 of every 32 bits. This lesson shows exactly how you can do that without throwing away the model.

Prerequisites: TinyML L1 (Efficiency & Metrics) — parameter count, MACs, memory bandwidth. Basic familiarity with floating-point representation helpful but not required (we derive it here). No quantization background needed.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The 28 GB Problem

LLaMA-2 7B has seven billion parameters. At the default training precision — 32-bit floats, four bytes per parameter — the model alone occupies 28 gigabytes. The newest iPhone has 8 GB of RAM. The newest consumer GPU with 24 GB can barely fit it.

Now consider this: an 8-bit integer takes one byte. If you could represent every weight as an integer instead of a float, the model would shrink to 7 GB — still large, but now accessible to a high-end phone. Cut to 4 bits and you're at 3.5 GB. At 2 bits, 1.75 GB. The savings are enormous. The question is whether the math can survive the reduction.

The quantization bargain: You're buying memory and speed with a small amount of accuracy. INT8 inference is typically 30× cheaper in energy than FP32 (0.03 pJ vs 0.9 pJ per ADD in 45nm CMOS). The goal is to lose the minimum accuracy for the maximum compression.

Here's the energy hierarchy from Horowitz (ISSCC 2014) that motivates everything in this lecture. The numbers are for 45nm CMOS at 0.9V:

OperationEnergy (pJ)Relative to INT8 ADD
8-bit INT ADD0.03
32-bit INT ADD0.1
16-bit FP ADD0.413×
32-bit FP ADD0.930×
8-bit INT MULT0.2
32-bit INT MULT3.1103×
16-bit FP MULT1.137×
32-bit FP MULT3.7123×

Switching from FP32 to INT8 saves about 30× energy per addition. On a mobile device doing millions of operations per second, that difference is the gap between an app that's usable for 2 hours and one that drains the battery in 20 minutes.

What quantization actually does: It constrains the continuous distribution of weight values to a finite discrete set. Instead of 232 possible float values, each weight can only take one of 256 (INT8) or 16 (INT4) values. The art is in choosing which discrete values to use — and how to recover from the approximation.

This lesson covers three approaches. K-means quantization finds the best codebook by clustering — you choose k centroids and every weight snaps to the nearest one. Linear (affine) quantization uses a uniform grid defined by a scale and zero-point — this is the approach that enables true integer arithmetic during inference. Binary and ternary quantization (the extreme case, only ±1 or {-1, 0, +1}) is covered in Lecture 6.

Ch 1–2: Data Types
Integers, fixed-point, IEEE floats (FP32/FP16/BF16) — bit layouts and the range/precision tradeoff
Ch 3–4: K-Means Quant
Weight clustering with a codebook; Deep Compression storage math; centroid fine-tuning
Ch 5–7: Linear Quant
Affine mapping r = S(q − Z); derive S and Z; symmetric vs asymmetric; quantized matmul
Ch 8: Showcase
Full quantization explorer: drag weights, choose format, watch error, see integer arithmetic
Ch 9: Connections
Cheat sheet of all formats and formulas; bridge to PTQ/QAT (L6)
Model Size vs Precision — see the storage savings in real time

Drag the bitwidth slider and watch how a 7B-parameter model shrinks. The green bar shows the iPhone 15 Pro's available RAM (8 GB).

Bits per parameter32 bits
A 7B-parameter model in FP32 uses 28 GB. What is the approximate size in INT4 (4-bit integer)?

Chapter 1: Numeric Data Types

Before you can quantize a weight, you need to understand what you're starting from — and what you're quantizing to. Every number in a neural network is stored in a specific format, and the format determines two competing properties: dynamic range (how large or small a number can be) and precision (how finely you can distinguish between nearby numbers).

Integers

An unsigned integer with n bits can represent values from 0 to 2n − 1. An 8-bit unsigned integer (uint8) covers [0, 255]. A signed integer using two's complement covers [−2n−1, 2n−1 − 1]. For 8-bit (INT8): [−128, 127].

Two's complement is the standard. The value of an n-bit two's complement number with bits bn−1bn−2…b0 is:

value = −bn−1 · 2n−1 + ∑i=0n−2 bi · 2i

Example: the 8-bit pattern 11001111 = −1×27 + 1×26 + 1×23 + 1×22 + 1×21 + 1×20 = −128 + 64 + 8 + 4 + 2 + 1 = −49.

Fixed-Point Numbers

Fixed-point takes a regular integer and places an implicit decimal point somewhere in the bit pattern. You're using the same hardware integer instructions, but you agree in advance to interpret the result as scaled by some factor 2−f where f is the number of fractional bits.

Example: an 8-bit number 00110001 = 49 as an integer. As fixed-point with 4 fractional bits, we interpret it as 49 × 2−4 = 49 × 0.0625 = 3.0625. Fixed-point DSPs use this constantly — it's fast, deterministic, and requires no special hardware. The downside: the scale is fixed, so you can't represent both very large and very small values simultaneously.

The Key Tradeoff: Range vs. Precision

The fundamental tension: More bits in the exponent field → wider dynamic range (you can represent both tiny and huge numbers). More bits in the mantissa/fraction field → finer precision (you can distinguish nearby values). You have a fixed bit budget. Every bit you give to the exponent costs a mantissa bit, and vice versa. This tradeoff drives the design of every floating-point format from FP32 to BF16 to FP8.
FormatSignExponentMantissaTotalNotes
FP32182332Training standard. Full range + precision.
FP16151016Smaller range than FP32 (max ≈ 65504). Overflow risk.
BF1618716Same exponent as FP32 → same range. Less precision.
INT81 (sign bit)8Uniform grid. Needs scale to represent real values.
INT41 (sign bit)4Only 16 values. Extreme compression, challenging accuracy.

Why does BF16 matter? Because it was designed specifically to drop-in replace FP32 in training. It has the same 8-bit exponent, meaning it can represent values from ≈10−38 to ≈1038 — the same range as FP32. You just lose precision (7 mantissa bits instead of 23). For the stochastic nature of gradient updates, that precision loss is usually acceptable.

FP16, on the other hand, cuts the exponent to 5 bits, which limits the range to [≈6×10−5, 65504]. Gradient magnitudes during training can easily exceed 65504, causing overflow — the dreaded "loss scale" trick in mixed-precision training exists because of this.

Common misconception — BF16 is not just "smaller FP16": BF16 and FP16 are both 16-bit formats, but they make completely different tradeoffs. FP16 sacrifices range to gain precision (10 mantissa bits vs 7). BF16 sacrifices precision to preserve range (8 exponent bits, matching FP32). For training stability, BF16 is usually safer. For certain inference tasks where you need fine-grained small-number representation (e.g., attention softmax probabilities), FP16 can be better.
BF16 uses 8 exponent bits, the same as FP32. Why was this design decision made for training deep networks?

Chapter 2: Float Bit Layout Explorer

The IEEE 754 floating-point standard encodes a number as three fields: a sign bit, an exponent, and a mantissa (also called the fraction or significand). For FP32 the formula is:

value = (−1)sign × (1 + mantissa) × 2exponent − bias

The "1 +" is the hidden bit — IEEE 754 assumes the leading 1 of a normalized number is implicit, so you don't need to store it. The bias for FP32 is 127 (= 28−1 − 1). This lets the 8-bit exponent represent both negative exponents (for tiny numbers) and positive exponents (for large numbers), centered around zero.

Worked Example: Encoding 0.265625 in FP32

Step 1: Write the number in binary scientific notation.

0.265625 = 1.0625 × 2−2 = (1 + 0.0625) × 2−2

Step 2: Identify the fields. Sign = 0 (positive). Exponent = −2 + 127 = 125 = 01111101. Mantissa = 0.0625 = 2−4 = 00010000000000000000000.

Step 3: Concatenate. The full 32-bit pattern is: 0 01111101 00010000000000000000000.

Worked Example: Decoding an FP16 Number

FP16 has 5 exponent bits and 10 mantissa bits. The bias is 25−1 − 1 = 15. Given the bit pattern 1100011100000000:

• Sign = 1 (negative). • Exponent = 100012 = 17. Actual exponent = 17 − 15 = 2.

• Mantissa = 11000000002 = 2−1 + 2−2 = 0.5 + 0.25 = 0.75.

• Value = −(1 + 0.75) × 22 = −1.75 × 4 = −7.0.

Subnormal numbers: When the exponent field is all zeros, the hidden "1+" disappears and the formula becomes (−1)sign × (0 + mantissa) × 21−bias. Subnormals allow gradual underflow — they let you represent numbers smaller than the smallest normal number, at the cost of precision. The smallest positive FP32 subnormal is 2−149; the smallest positive normal is 2−126. Special cases: all-one exponent with zero mantissa = ±∞; all-one exponent with nonzero mantissa = NaN.
Float Bit Layout Explorer — toggle format and enter a decimal value

Select a format and type a decimal value. The canvas shows the exact bit fields, encoded value, and the representable grid spacing near that value. The grid spacing tells you the precision you have.

Format Value
The FP16 bit pattern 0 10000 1100000000 (sign=0, exp=10000, mantissa=1100000000). What decimal value does this encode? (FP16 bias = 15)

Chapter 3: K-Means Weight Quantization

Here is the central observation of K-means quantization: the weight values of a trained neural network are not uniformly distributed. They cluster. Typical convolutional layers have bimodal or multimodal distributions — groups of weights near certain recurring values like −1, 0, +0.5, +1.5. If you can identify those clusters and replace every weight in a cluster with a single representative value, you can store each weight as a short cluster index instead of a full 32-bit float.

The approach is weight clustering, introduced in Deep Compression (Han et al., ICLR 2016). You run k-means on all the weights of a layer to find k centroids, then replace every weight with the index of its nearest centroid. During storage you keep only the indices (short integers) and the centroid values (a small codebook of k floats).

The codebook metaphor: Think of the centroids as the only allowed paint colors. You have, say, 16 colors. Every pixel in your image gets replaced by the nearest available color. The painting is stored as pixel indices (4 bits each) plus a 16-entry color table. The quality depends on how well the 16 colors were chosen — and k-means finds the optimal set.

Worked Example from Deep Compression

Consider a 4×4 weight matrix with 16 weights (shown in the canvas). K-means with k=4 finds four centroids. Let's say they are: c0 = −1.00, c1 = 0.00, c2 = 1.50, c3 = 2.00.

The weight 2.09 snaps to centroid 3 (2.00). The weight −0.98 snaps to centroid 0 (−1.00). And so on. Now instead of storing 16 float32 values (64 bytes), you store:

The quantization error for this particular example is small — the worst case is 2.09 − 2.00 = 0.09, and most others are comparable. The model can tolerate these errors because neural networks are over-parameterized and robust to small perturbations.

K-means quantization only saves storage, not compute: At inference time, you must look up each index in the codebook to get a float centroid, then do the matrix multiplication in floating-point. The indices are integers but the math is still floating-point. This is the key limitation vs. linear quantization (Chapter 5), which enables true integer arithmetic during computation — the 30× energy win from the table in Chapter 0.

Fine-Tuning the Centroids

After snapping every weight to its centroid, you have quantization error that slightly degrades accuracy. You can recover it by fine-tuning the centroids. Here's the trick: keep the cluster assignments fixed. Run a forward pass to get the loss. Backpropagate to get gradients with respect to each weight. But instead of updating individual weights, group the gradients by cluster and add up all the gradients that belong to centroid k. That sum is the gradient for centroid k. Update each centroid by its summed gradient.

Δck = −η · ∑i: cluster(i)=k ∂L / ∂wi

This is elegant: you're doing standard SGD, but with shared parameters. All weights in the same cluster move together because they are the same parameter (the centroid). After fine-tuning, the centroids settle into positions that minimize loss, not just reconstruction error — they can drift away from the original k-means positions if that's what the loss surface demands.

K-Means Weight Clustering — run k-means on a random weight distribution

Each dot is a weight value. Choose k centroids (the colored vertical lines). Click "Run K-Means" to find the optimal cluster assignments. The compression ratio and reconstruction error update live.

k (num centroids)4
After K-means weight quantization with k=4 centroids on a 16-weight layer, how does inference work? Which part uses integer arithmetic?

Chapter 4: Deep Compression: The Storage Math

Let's derive exactly how much compression K-means quantization provides and verify it on real numbers. This is the calculation from Han et al., ICLR 2016.

The General Formula

Suppose a layer has M weight parameters, each stored as a 32-bit float. Original storage = 32M bits. After K-means quantization with k centroids (where k = 2b for b-bit indices):

When M is large (M >> 2b, i.e., many more weights than centroids), the codebook is negligible and the compression ratio is approximately:

compression ratio ≈ 32M / (Mb) = 32/b

So 2-bit quantization (k=4 centroids) gives ≈ 16× compression; 4-bit (k=16) gives ≈ 8×; 8-bit (k=256) gives ≈ 4×.

Worked Numbers: The 4×4 Layer from Chapter 3

Our example: M = 16 weights, k = 4 centroids (b = 2 bits).

With M=16 the codebook is still significant (16 bytes out of 20 total). As layers get larger — say M = 1,000,000 weights with b=4 bits — the codebook (16 × 4 = 64 bytes) is negligible, and compression is ~8×.

AlexNet Results from Deep Compression

Deep Compression combined pruning (removing ~90% of weights) + quantization (4–8 bits) + Huffman coding. Results on real models:

NetworkOriginalCompressedRatioAccuracy
AlexNet240 MB6.9 MB35×+0.03% (improved!)
VGGNet550 MB11.3 MB49×+0.41% (improved!)
SqueezeNet + DC4.8 MB0.47 MB510× vs AlexNetAlexNet parity

The accuracy improvements happen because fine-tuning the centroids acts as a form of regularization — clustering forces weights to share values, which can slightly reduce overfitting on some tasks.

Why Huffman coding adds more on top: After K-means quantization, the cluster indices are not uniformly distributed. Weights near zero (index 1 = centroid ≈ 0) are much more frequent than weights near large values. Huffman coding exploits this: frequent symbols get shorter codes. The result is an additional 20–30% compression on top of K-means alone. Combined, Deep Compression achieves 35–49× on AlexNet/VGGNet vs. the original dense FP32 weights.
A layer has M = 100,000 weights currently stored as FP32. You apply K-means quantization with k = 16 centroids (4-bit indices). Approximately how large is the compressed representation, and what is the compression ratio?

Chapter 5: Linear / Affine Quantization

K-means quantization is flexible but limited: it only saves storage, not compute. Linear quantization (also called affine quantization, introduced in Jacob et al., CVPR 2018) solves both problems. Every weight is an integer. Every computation during inference is integer arithmetic. The 30× energy win becomes real.

The core idea is an affine (linear) mapping between real values r and integer quantized values q:

r = S · (q − Z)

where S is the scale factor (a positive real number, stored as float32) and Z is the zero-point (an integer of the same bitwidth as q). To go from real to quantized (quantization), you invert the mapping:

q = round(r / S + Z)

then clamp q to [qmin, qmax]. For INT8: q ∈ [−128, 127].

What S and Z Do

The scale S sets the spacing between quantization levels — smaller S means finer grid, less rounding error, but smaller representable range. You can think of S as the physical size of one "tick mark" on your integer ruler.

The zero-point Z is an integer offset that ensures the real number 0.0 is exactly representable by some integer q. This matters for ReLU activations (which are often zero) and for padding operations. If Z ≠ 0, the map is asymmetric — the range [rmin, rmax] doesn't have to be symmetric around zero.

Misconception — quantization isn't just truncation: A naive approach would be: "INT8 stores integers, so just chop the decimals off." That's pure truncation — it ignores the actual range of the weights. A weight of 1.37 truncates to 1 in integer format. But what about a weight of 155.7? That overflows INT8's range [−128, 127]. And a weight of 0.003? It maps to 0, losing all information. The scale factor S stretches or compresses the entire weight distribution to fit inside the integer range. Without it, most of the INT8 range is wasted.

Symmetric vs. Asymmetric Quantization

In symmetric quantization, you set Z = 0 and force the real range to be symmetric around zero: [−|r|max, |r|max]. The formula simplifies to:

r = S · q     (Z = 0)
S = |r|max / 2n−1

Symmetric quantization is simpler and the Z=0 assumption lets you simplify the quantized matmul significantly (see Chapter 7). The downside: if the weight distribution is actually asymmetric (e.g., mostly positive due to ReLU), you waste half the integer range on values that don't appear.

Asymmetric quantization lets you fit [rmin, rmax] exactly into [qmin, qmax] regardless of symmetry. More accurate for skewed distributions. But Z ≠ 0 introduces extra terms in the matmul derivation.

Per-Tensor vs. Per-Channel Quantization

In per-tensor quantization, a single S and Z applies to all weights in a layer's tensor. Simple but risky: if one channel has much larger magnitudes than others, the shared scale stretches to accommodate it, making all other channels coarsely quantized.

In per-channel quantization, each output channel of a convolutional or linear layer has its own S and Z. This is almost always better in practice — it lets each channel adapt its scale to its own magnitude range, dramatically reducing quantization error. The hardware cost is small (you only need to store N additional floats for a layer with N output channels).

python
import numpy as np

def quantize_asymmetric(r, n_bits=8):
    """Asymmetric linear quantization. Returns (q, S, Z)."""
    q_min = -2**(n_bits-1)       # -128 for INT8
    q_max =  2**(n_bits-1) - 1   # +127 for INT8
    r_min, r_max = r.min(), r.max()
    S = (r_max - r_min) / (q_max - q_min)    # scale
    Z = np.round(q_min - r_min / S).astype(int)  # zero-point
    q = np.clip(np.round(r / S + Z), q_min, q_max).astype(np.int8)
    return q, S, Z

def dequantize(q, S, Z):
    """Reconstruct float from integer + scale + zero-point."""
    return S * (q.astype(np.float32) - Z)

# Example: quantize a small weight vector
weights = np.array([2.09, -0.98, 1.48, 0.09, 0.05,
                    -0.14, -1.08, 2.12])
q, S, Z = quantize_asymmetric(weights)
r_hat = dequantize(q, S, Z)
print(f"Scale: {S:.4f}, Zero-point: {Z}")
print(f"Quantized: {q}")
print(f"Reconstructed: {np.round(r_hat, 3)}")
print(f"Max error: {np.max(np.abs(r_hat - weights)):.4f}")

Running this gives: Scale ≈ 0.0126, Zero-point ≈ −39. Max error ≈ 0.006 — far less than K-means with only 4 centroids, because here we have 256 evenly-spaced levels across the full range.

For a weight layer with values ranging from −1.08 to +2.12, asymmetric INT8 quantization gives S = (2.12 − (−1.08)) / (127 − (−128)) = 3.20 / 255 ≈ 0.01255. What does the zero-point Z represent physically?

Chapter 6: Deriving Scale & Zero-Point

Let's derive the formulas for S and Z from first principles using the constraint that the real range [rmin, rmax] must map exactly to the integer range [qmin, qmax].

Starting from r = S(q − Z), evaluate at both endpoints:

rmax = S(qmax − Z)
rmin = S(qmin − Z)

Subtract the second from the first:

rmax − rmin = S(qmax − qmin)
S = (rmax − rmin) / (qmax − qmin)

For Z, use the second equation:

rmin = S(qmin − Z)  ⇒  Z = qmin − rmin / S
Z = round(qmin − rmin / S)

We round Z because it must be an integer.

Worked Numbers from Deep Compression Example

The weight matrix from the slides has values ranging from rmin = −1.08 to rmax = 2.12. We're quantizing to 2-bit signed integers, so qmin = −2, qmax = 1.

Scale: S = (2.12 − (−1.08)) / (1 − (−2)) = 3.20 / 3 = 1.067.

Zero-point: Z = round(−2 − (−1.08) / 1.067) = round(−2 − (−1.012)) = round(−2 + 1.012) = round(−0.988) = −1.

Let's verify a few weights. Take w = 2.09: q = round(2.09 / 1.067 + (−1)) = round(1.958 − 1) = round(0.958) = 1. Dequantized: r̂ = 1.067 × (1 − (−1)) = 1.067 × 2 = 2.134. Error = 2.134 − 2.09 = 0.044.

Take w = −1.08: q = round(−1.08 / 1.067 + (−1)) = round(−1.012 − 1) = round(−2.012) = −2. Dequantized: r̂ = 1.067 × (−2 − (−1)) = 1.067 × (−1) = −1.067. Error = −1.067 − (−1.08) = 0.013. Very small!

INT8 Symmetric Worked Example

Now a worked example with INT8 symmetric quantization. Suppose a layer has weights with absolute maximum |r|max = 2.12.

Symmetric INT8 uses q range [−127, 127] (or sometimes [−128, 127] — let's use [−127, 127] for perfect symmetry).

S = |r|max / 27 = 2.12 / 127 ≈ 0.01669. Z = 0.

Quantize the weight vector [2.09, −0.98, 1.48, 0.09, 0.05, −0.14, −1.08, 2.12]:

q = round(r / 0.01669)
Float rr / Sq (INT8)Dequant r̂Error
2.09125.21252.0860.004
−0.98−58.7−59−0.9850.005
1.4888.7891.4850.005
0.095.450.0830.007
2.12127.01272.1190.001
−1.08−64.7−65−1.0850.005

Maximum error across all weights: ~0.008. That's 0.4% of the value range — excellent fidelity at INT8.

Outliers blow up per-tensor quant: Suppose one weight in the layer is 50.0 while all others are in [−1, 1]. The per-tensor scale becomes S = 50 / 127 ≈ 0.394. Now every weight in [−1, 1] gets quantized to q ∈ [−3, 3] — only 7 out of 256 INT8 values are used! The quantization error is enormous. This is why per-channel quantization matters, and why techniques like GPTQ and AWQ (Lecture 6) explicitly handle weight outliers. Activation outliers in LLMs are an even bigger problem — a single outlier dimension can dominate the scale and ruin all other dimensions.
Linear Quantization Mapper — drag range and bitwidth to see the grid

Adjust the real value range and bitwidth. The canvas shows the quantization grid, computed S and Z, and the per-value rounding error histogram.

r_min−2.0
r_max2.0
Bits4
Mode
You have a weight tensor with r_min = −3.0 and r_max = 3.0. You apply asymmetric INT8 quantization (q_min = −128, q_max = 127). What is S and Z?

Chapter 7: Quantized Matrix Multiplication

The payoff of linear quantization is that you can do the entire forward pass in integer arithmetic — no floating-point operations until the very last rescale step. Let's derive exactly how a quantized matmul works.

The Full Derivation

Consider Y = W × X. Each element y = w · x (a dot product). Every value obeys r = S(q − Z), so:

SY(qY − ZY) = SW(qW − ZW) · SX(qX − ZX)

Solve for qY:

qY = (SWSX / SY) · (qW − ZW)(qX − ZX) + ZY

Expand the product:

qY = (SWSX / SY) · (qWqX − ZWqX − ZXqW + ZWZX) + ZY

This looks complicated, but look at which terms can be precomputed. ZWZX is constant (doesn't change across inputs). ZWqX depends only on the input, not the weights. ZXqW depends only on the weights, not the current input. The only per-output term is qWqX.

The Symmetric Shortcut: ZW = 0

If we use symmetric quantization for weights, ZW = 0. The formula simplifies dramatically:

qY = (SWSX / SY) · (qWqX − ZXqW) + ZY

Now qWqX is a pure N-bit integer multiplication (INT8 × INT8 → INT16 or INT32 accumulator). The ZXqW correction term can be precomputed per column of W before inference. The scale ratio SWSX/SY is a float that gets applied once at the end — one floating-point multiplication per output element, not per inner product.

The scale ratio is always less than 1: In practice, SWSX/SY always falls in (0, 1). This means you can represent it as M0 × 2−n where M0 ∈ [0.5, 1). Multiplying by M0 can be done as a fixed-point integer multiply (no float needed), and the 2−n factor is just a right bit-shift. This is how TensorFlow Lite achieves fully integer-only inference — even the rescale step uses integers.

Full FC Layer with Bias

A fully-connected layer computes Y = WX + b. With quantized bias (Zb = 0, Sb = SWSX), and using symmetric W (ZW = 0):

qY = (SWSX/SY) · (qWqX + qbias) + ZY

where qbias = qb − ZXqW is precomputed at model-load time. Everything in the parentheses is 32-bit integer accumulation. The scale multiply and +ZY at the end brings the output back to INT8 range.

python
import numpy as np

def quantized_matmul(W_int8, X_int8, S_W, S_X, S_Y, Z_X, Z_Y, bias_int32):
    """
    Integer-arithmetic matmul. W is symmetric (Z_W=0).
    W_int8: [out, in], X_int8: [in, batch]
    Returns q_Y as int8.
    """
    # Step 1: N-bit integer matmul, accumulate in int32
    acc = W_int8.astype(np.int32) @ X_int8.astype(np.int32)  # [out, batch]

    # Step 2: subtract Z_X * sum_of_weights (precomputed per output)
    sum_W = W_int8.sum(axis=1, keepdims=True).astype(np.int32)
    acc -= Z_X * sum_W

    # Step 3: add bias (precomputed in int32 space)
    acc += bias_int32

    # Step 4: rescale — multiply by S_W*S_X/S_Y (float, then round to int)
    scale_ratio = (S_W * S_X) / S_Y
    q_Y = np.round(acc * scale_ratio + Z_Y)
    q_Y = np.clip(q_Y, -128, 127).astype(np.int8)
    return q_Y
In integer-only inference with symmetric weight quantization (Z_W = 0), which operations are done entirely in integer arithmetic, and which require a floating-point step?

Chapter 8: Showcase: Full Quantization Explorer

This is the payoff chapter. The explorer below gives you direct control over everything we've built: the weight distribution, the quantization format, the bitwidth, and symmetric vs. asymmetric mode. Watch how the quantization grid, the reconstruction error, and the MSE change as you explore the parameter space.

Try the outlier experiment: enable "Add outlier" to inject one weight at ±8.0 — far outside the normal range. See how it blows up the per-tensor scale and ruins all other weights. Then switch to per-channel (per-element here) mode to see the fix.

Full Quantization Explorer — weights, format, bitwidth, errors, MSE

The top panel shows float weights (blue) and quantized+dequantized values (orange) on the same axis. The bottom panel shows the error bar for each weight. MSE updates live.

Bitwidth8 bits
Format
The key experiment to run: Start at 8 bits asymmetric, no outlier. Note the MSE — it's tiny. Now add the outlier and watch MSE explode even at 8 bits. Switch to "Symmetric": Z=0 can't adapt to asymmetric distributions as well. Now remove the outlier and try 4 bits vs 2 bits — you'll see the "staircase" error pattern where coarse quantization creates clusters of identical reconstructed values. This is the core intuition for why INT4 needs careful per-channel calibration to work well.

Chapter 9: Connections & Cheat Sheet

Numeric Format Cheat Sheet

FormatBitsSignExpMantissaMax ValueBest For
FP32321823~3.4×1038Training (reference)
FP1616151065,504Training (beware overflow)
BF1616187~3.4×1038Training (drop-in FP32)
INT88sign bit127 (with scale)Inference (energy win)
INT44sign bit7 (with scale)LLM inference (aggressive)

Quantization Formula Cheat Sheet

QuantityFormulaNotes
Scale SS = (rmax − rmin) / (qmax − qmin)Maps float range to int range
Zero-point ZZ = round(qmin − rmin / S)Integer; ensures 0.0 is exact
Quantizeq = clip(round(r/S + Z), qmin, qmax)Float → int
Dequantizer̂ = S × (q − Z)Int → float (approximation)
Symmetric SS = |r|max / 2n−1, Z = 0Simpler matmul; wastes range if skewed
Matmul outputqY = (SWSX/SY)(qWqX − ZXqW) + ZYZW=0 (symmetric weights)

K-Means vs. Linear Quantization

PropertyK-Means QuantLinear Quant
Grid typeNon-uniform (cluster-optimal)Uniform (evenly spaced)
Storageb-bit indices + float codebookb-bit integers + S, Z
ComputeFloat arithmetic (after lookup)Integer arithmetic
Hardware fitNeeds codebook lookup at runtimeRuns on INT8 tensor cores
Fine-tuningCentroid gradient SGDQuantization-Aware Training (QAT)
Best forStorage-only compression (Deep Compression)Inference acceleration (TFLite, TRT)

What's Next — Quantization II (Lecture 6)

This lesson covered the mechanics of quantization — what the mapping is and why it works. Lecture 6 covers how to calibrate and recover accuracy:

Related Lessons

Closing thought: Quantization is not about discarding information carelessly — it is about choosing the right basis to represent what actually matters. A 7B-parameter model in FP32 contains 32 bits per weight, but only a handful of those bits carry signal that survives gradient descent. The scale factor S and zero-point Z are the minimal translation layer that lets integer hardware see the same model that float hardware trained. When you internalize that, the whole field of post-training quantization, GPTQ, and AWQ becomes a single question: how do you choose S and Z so the model forgets as little as possible?

References

  1. Han, S., Mao, H., Dally, W. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR, 2016. arXiv
  2. Jacob, B. et al. "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference." CVPR, 2018. arXiv
  3. Horowitz, M. "Computing's Energy Problem (and What We Can Do About it)." IEEE ISSCC, 2014.
  4. Song Han. "MIT 6.5940: TinyML and Efficient Deep Learning Computing." Lecture 5 — Quantization Part I. 2024.