TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 4

Pruning II: Lottery Tickets, Automation & Hardware

L3 left two open wounds: we never proved which sparse subnetwork to keep, and our sparse model was no faster in practice. This lesson heals both. The Lottery Ticket Hypothesis gives a precise recipe for finding the right subnetwork — and weight rewinding explains why the mask alone is not enough. AMC turns per-layer sparsity allocation into a reinforcement-learning policy. NetAdapt does it greedily with a real latency table. Then we descend into hardware: why unstructured sparsity is nearly free on GPUs, how CSR encoding really works, what EIE's processing elements actually compute, and why NVIDIA's 2:4 pattern was the minimum regularity needed to guarantee 2× throughput. MIT 6.5940 by Song Han.

Prerequisites: TinyML L3 (Pruning I) — granularity, criteria, iterative prune-finetune loop. Basic familiarity with neural network training. No RL background needed for AMC (explained from scratch).

Chapters

Live Canvases

Derived

From First Principles

Chapter 0: Two Gaps from L3

By the end of Pruning I you could prune AlexNet to 10% of its weights, finetune, and recover near-original accuracy. That is genuinely impressive. But if you tried to deploy that pruned network on a phone and timed inference, you'd find something embarrassing: it ran at roughly the same speed as the full network. And if someone asked you why those 10% survived rather than some other 10%, you wouldn't have a satisfying answer.

These are the two gaps. Gap 1 is theoretical: is there something special about the specific subnetwork that survives magnitude pruning + finetuning, or could any random 10% work just as well? Gap 2 is practical: why doesn't a 90%-sparse network run 10× faster, and what has to change before sparsity translates to actual throughput?

Gap 1 (the "which subnetwork?" question): Magnitude pruning + finetuning finds a good sparse network — but that network starts from the finetuned weights, not the original initialization. The Lottery Ticket Hypothesis (Frankle & Carlin, 2019) asks a sharper question: is there a subnetwork with the original initialization that could have trained to the same accuracy by itself? The answer is yes — and the recipe for finding it is the topic of chapters 1–2.

Gap 2 (the "no speedup" problem): A GPU is a dense matrix engine. It executes a GEMM as if every element is non-zero. Zeroing 90% of weights doesn't reduce the number of multiply-accumulate cycles — the hardware still fetches the same tensor and multiplies by zero. Speedup requires either structured sparsity (whole channels removed → smaller dense GEMM) or hardware that can skip zeros. Chapters 6–8 explain exactly when and why each approach works.

The lesson also covers the practical question of how much to prune each layer — not by hand, but automatically. Chapter 3 shows the manual sensitivity-analysis approach. Chapters 4–5 show two automated alternatives: AMC (an RL agent) and NetAdapt (a greedy hardware-in-the-loop search).

Ch 1–2: Lottery Ticket

Winning tickets exist; IMP finds them; weight rewinding is critical; random reinit fails

↓

Ch 3–5: Automation

Sensitivity analysis → AMC (DDPG per-layer policy) → NetAdapt (greedy latency-table loop)

↓

Ch 6–8: Hardware

Why unstructured is slow → CSR encoding → EIE (zero-skipping ASIC) → 2:4 tensor core

↓

Ch 9: Connections

Cheat sheet; bridge to Quantization (L5); links to related Gleams

After magnitude pruning + finetuning, you have a 90%-sparse network with good accuracy. What is the main reason it still runs at roughly the same latency on a standard GPU?

The zeros are not actually stored in memory, so memory traffic is the same. Finetuning reintroduces the pruned weights, so the network is no longer sparse. GPU GEMM kernels execute dense multiply-accumulates regardless of zero values — the tensor shape hasn't changed. The finetuning step increases the remaining weights to compensate, making computation heavier.

Chapter 1: The Lottery Ticket Hypothesis

In 2019, Frankle and Carlin ran an experiment that seemed almost too clean to be true. They took a dense network, pruned it with iterative magnitude pruning, and then — instead of finetuning from the pruned weights — they reset every surviving weight back to its value at the beginning of training (its original random initialization). Then they trained just those weights, keeping the rest frozen at zero.

The result: this sparse subnetwork, trained from scratch with its original initialization, matched the accuracy of the full dense network. With far fewer parameters. In far fewer training iterations.

The Lottery Ticket Hypothesis: A randomly initialized, dense neural network contains a sparse subnetwork — a "winning ticket" — such that, when trained in isolation from its original initialization, the winning ticket can match the full network's accuracy in at most the same number of training steps.

The word "lottery" is deliberate. When you randomly initialize a large network, you are buying many lottery tickets at once. Most tickets are losers — their random starting values don't sit in the right region of weight space to converge well when trained sparsely. But a few are winners. The full training process finds one of them (the surviving weights after pruning happen to be one such winning ticket). The hypothesis says this lucky subset could have trained alone from the start, if only you'd known which one it was.

The hypothesis makes a precise claim about initialization. It is not just saying "a small network can achieve the same accuracy as a large one." It says the right small network is already embedded inside the random initialization of the large one — and the way to find it is to train large, prune, then look at which weights survived.

Critical misconception — the mask alone is not enough: You might think: "Take the pruned mask, apply it to a fresh random initialization, and train." That doesn't work. The experiment is specific — the surviving weights must be reset to their own original values (the particular random numbers they had at initialization), not to a new random sample. This is weight rewinding, and Chapter 2 explains why it matters so much.

Why does this matter beyond intellectual curiosity? Three reasons. First, it suggests that the capacity we actually need is much less than what we train with — training wide-then-prune is the practical path to finding a small, well-initialized subnetwork. Second, it gives a theoretical grounding for why iterative magnitude pruning works so well: it is a procedure that identifies winning tickets. Third, it opens a research direction of finding winning tickets earlier (at initialization) without needing to train the full network first.

Lottery Ticket Experiment — accuracy curves across training

Three training runs at the same sparsity level: (1) the full dense network, (2) the winning ticket — sparse mask with original initialization, (3) same mask but random re-initialization. Watch how the winning ticket tracks the dense network while random reinit diverges early.

Sparsity level50%

The Lottery Ticket Hypothesis says the winning ticket achieves high accuracy when trained from its original initialization. What happens if you instead train the same mask structure but with a fresh random initialization?

It performs identically — only the mask structure matters, not the specific weight values. It performs better — fresh random values remove any bias from the original training run. It performs significantly worse — accuracy degrades, often approaching random chance at high sparsity. It performs slightly worse due to numerical precision differences.

Chapter 2: Weight Rewinding & Iterative Magnitude Pruning

To actually find a winning ticket, Frankle and Carlin used a procedure called Iterative Magnitude Pruning (IMP) combined with weight rewinding. Let's walk through it step by step, with actual numbers.

The procedure has a loop. Start with a randomly initialized dense network. Train to convergence. Prune the p% of weights with the smallest absolute values (set them to zero permanently — the mask is now fixed). Then rewind: reset every surviving weight to the value it had at initialization (step 0). Train again, with the same mask. Repeat: prune more, rewind, train. After k rounds of pruning, each removing p% of the remaining weights, the surviving fraction is (1-p/100)^k.

After k rounds at p% per round: sparsity = 1 − (1 − p/100)^k

Example: p=20%, k=5 rounds → (0.8)⁵ = 0.328 remaining = 67.2% sparse

The "rewind" step is the key. The original Frankle-Carlin paper rewound to step 0 (the initial random values). Later work by Frankle et al. (2020) refined this to "early rewinding" — rewinding to a checkpoint taken after a small number of warm-up steps (e.g., after 1-5% of training), not step 0. For large networks, early rewinding works better because the network needs a few steps to escape the chaotic early phase of SGD before the surviving weights "know" which direction to converge.

Why random reinit fails — the signal in the initialization: The original initialization values are not arbitrary noise for the winning ticket. They encode a specific starting point on the loss landscape that, together with the mask, sits in a basin that converges to a good minimum. A fresh random initialization for the same mask structure is almost certainly outside that basin. The loss surface is highly non-convex; the particular random numbers matter for which attractor you end up in.

Here is the IMP algorithm as runnable PyTorch code. The key moves are: save the initial state dict before any training, apply magnitude pruning with torch.nn.utils.prune, and restore the initial values via load_state_dict at the rewinding step.

python
import torch, copy
import torch.nn.utils.prune as prune

def iterative_magnitude_pruning(model, train_fn, n_rounds, prune_pct):
    # Save original (step-0) initialization
    init_state = copy.deepcopy(model.state_dict())

    masks = {}  # will accumulate binary masks per parameter

    for round_idx in range(n_rounds):
        # 1. Train to convergence (or fixed steps)
        train_fn(model)

        # 2. Compute magnitude-based pruning mask
        for name, param in model.named_parameters():
            if 'weight' in name:
                flat = param.data.abs().flatten()
                threshold = torch.quantile(flat, prune_pct / 100.0)
                masks[name] = (param.data.abs() >= threshold).float()

        # 3. REWIND — restore weights to step-0 initialization
        model.load_state_dict(init_state, strict=False)

        # 4. Apply the accumulated mask (zero out pruned weights)
        with torch.no_grad():
            for name, param in model.named_parameters():
                if name in masks:
                    param.data.mul_(masks[name])

        pct_remaining = 100 * (1 - prune_pct/100) ** (round_idx + 1)
        print(f"Round {round_idx+1}: {pct_remaining:.1f}% weights remain")

    return model, masks

# Example: 5 rounds at 20% pruning each → 32.8% remain
model, masks = iterative_magnitude_pruning(my_model, train_fn, n_rounds=5, prune_pct=20)

In IMP with weight rewinding, after the pruning step you reset surviving weights to their original initialization values. Why not simply finetune from the current (post-training) weight values instead?

Finetuning from current values always diverges due to vanishing gradients in sparse networks. Finetuning from current values produces a good compressed model — it just isn't a winning ticket. The lottery ticket property (matching full network accuracy when trained from scratch) only holds with the original initialization. Rewinding is cheaper computationally — it avoids storing gradient checkpoints. Finetuning from current values causes weight values to collapse to zero.

Chapter 3: Sensitivity Analysis — Reading a Layer's Tolerance

Before we can ask a machine to set per-layer sparsity budgets (chapters 4–5), we should understand what makes a layer sensitive to pruning in the first place. The answer turns out to be visible in a simple sweep.

Pick one layer L_i. Prune it at ratios r ∈ {10%, 20%, …, 90%} while leaving all other layers untouched. For each ratio, measure the accuracy drop ΔAcc_i^r. Plot accuracy vs pruning ratio for L_i. Repeat for every layer. The resulting family of curves is the model's sensitivity profile.

What the curves reveal: Some layers show steep accuracy drops even at 20% pruning — these are sensitive layers (typically the first convolutional layer, which processes raw pixels with few filters and captures irreplaceable low-level features). Others remain flat until 70% or 80% — these are redundant layers (typically the large fully-connected layers whose thousands of neurons overlap heavily). The profile tells you exactly how much tolerance each layer has.

The practical rule: pick a global accuracy-drop threshold T (e.g., "allow at most 2% drop"). For each layer, read off the maximum sparsity that keeps its accuracy drop below T. Assign that as the layer's pruning ratio. Layers that can tolerate 80% pruning get 80%; layers that can only tolerate 20% get 20%. Total parameter savings is the weighted average.

This is better than uniform pruning (pruning every layer to the same ratio) but still has a flaw: it ignores interactions between layers. Pruning L₃ changes which activations L₄ sees; the sensitivity profile assumes each layer is independent. AMC, coming next, addresses this by treating the whole network jointly.

Sensitivity curves — layer-by-layer accuracy vs sparsity

Six representative layers of VGG-11 on CIFAR-10. Drag the threshold T slider to see which layers survive at each accuracy budget. The intersection points give per-layer sparsity ratios.

Accuracy drop budget T8%

Sensitivity analysis assigns the first convolutional layer a sparsity of 20% while assigning the last fully-connected layer 80% sparsity. What property of these layers explains this difference?

The first conv layer has more parameters, so it can be pruned more aggressively. The last FC layer uses larger weight values, so more can be removed below the magnitude threshold. The first conv processes raw pixels and captures irreplaceable low-level features with few filters; the last FC has massive redundancy — thousands of neurons encoding highly overlapping information. The FC layer is already sparse from dropout regularization during training.

Chapter 4: AMC — AutoML for Model Compression

Sensitivity analysis is laborious (one sweep per layer per model) and suboptimal (ignores layer interactions). AMC (He et al., ECCV 2018) replaces the human with a reinforcement learning agent that learns a compression policy: given the current layer's statistics, output a sparsity ratio. The agent sees all layers sequentially, building up the compressed network, and gets a reward at the end based on accuracy under a hardware budget constraint.

The RL formulation is precise:

State s_t — the embedding of the current layer: [layer index t, kernel size, channel counts C_in/C_out, feature map size H/W, remaining FLOP budget, current compression ratio, stride, padding]. This gives the agent full context about what it's deciding.
Action a_t — a continuous number in [0, 1) representing the sparsity ratio for layer t. Continuous action space is crucial — discrete buckets like {25%, 50%, 75%} miss the optimal fractions.
Agent — DDPG (Deep Deterministic Policy Gradient), chosen specifically because it supports continuous actions. A simpler agent like DQN only handles discrete actions and would miss the optimal compression ratios.
Reward — a hardware-constrained reward function: R = −Error if the compressed model satisfies the FLOP/latency constraint, or R = −∞ if it violates the constraint. This forces the agent to discover policies that hit the budget, not just maximize accuracy blindly.

R = −Error if FLOP(compressed) ≤ FLOP_budget

R = −∞ if FLOP(compressed) > FLOP_budget

In practice, a softer version is used: R = −Error × log(FLOP_compressed). This penalizes high FLOPs continuously rather than with a hard cutoff, which gives the agent a smoother gradient signal to learn from.

What AMC learns: After training, the AMC agent discovers something humans never explicitly programmed: 1×1 convolutions (pointwise, projecting channel dimensions) have less redundancy than 3×3 convolutions — and the policy automatically prunes them less. This pattern emerges from the reward signal alone. The agent also learns to exploit the remaining FLOP budget dynamically: if early layers consumed little budget, later layers can be pruned harder.

AMC vs sensitivity analysis: Sensitivity analysis sweeps each layer independently and picks a threshold. It ignores cross-layer effects. AMC observes all layers sequentially — when it picks a sparsity for layer t, it already knows what happened to layers 1…t-1 and can adjust. The RL formulation naturally captures these interactions because the state includes the running compression ratio and remaining budget. On MobileNet, AMC achieves 2× speedup at the same accuracy as sensitivity-analysis-tuned hand compression.

AMC agent — per-layer sparsity allocation to hit a FLOP budget

Eight ResNet-style layers. The agent allocates different sparsity to each layer (bars). Drag the FLOP budget slider — the agent dynamically re-allocates across layers to stay under budget. Reward = accuracy (simulated) under the constraint.

FLOP budget (%)50%

AMC uses a DDPG agent rather than a simpler DQN agent for the compression policy. What property of the compression problem makes DDPG the right choice?

DDPG is faster to train because it uses fewer environment interactions. DQN requires labeled datasets, which are unavailable for compression problems. The action (sparsity ratio) is a continuous value in [0,1) — DQN only handles discrete action spaces and would miss the exact optimal fractions that DDPG can output. DDPG supports multi-step lookahead, which is required for planning across all layers simultaneously.

Chapter 5: NetAdapt — Greedy Hardware-in-the-Loop Pruning

AMC is a powerful general policy learner, but it requires training the RL agent — which itself takes significant compute. NetAdapt (Yang et al., ECCV 2018) takes a different approach: instead of learning a policy, it runs a greedy search guided by real latency measurements from the target hardware.

The algorithm is built around a simple idea: at each iteration, we want to reduce total latency by exactly ΔR milliseconds. We try pruning each layer individually to achieve that reduction (using a pre-built latency lookup table to avoid re-measuring hardware every time). For each candidate pruned layer, we do a short fine-tune (10k steps — cheap) and measure accuracy. We pick the layer that gives the highest accuracy after fine-tuning. Then we repeat until total latency meets the constraint.

Start: full model, latency L

Target: reach latency L* = L − ΔR in each iteration

↓

For each layer k: find smallest channels ≥ 0 such that latency drops by ≥ ΔR (LUT lookup)

Short finetune (10k steps). Record accuracy Acc_k

↓

Pick layer k* = argmax Acc_k

Commit that pruning. New latency = L − ΔR

↻ repeat until L ≤ L_target

Long fine-tune

Full fine-tuning to recover accuracy after all pruning steps

The latency lookup table (LUT): Measuring real hardware latency for every candidate pruning requires actually running the model on the target device — that's too slow. Instead, NetAdapt pre-profiles a grid of layer configurations (e.g., all channel counts from 4 to 512 in steps of 4, for each layer type). These measurements go into a lookup table. During the greedy search, latency queries are just table lookups — microseconds instead of seconds. The LUT is the key piece that makes this approach practical.

The greedy nature means NetAdapt doesn't globally optimize. It makes locally optimal decisions (best layer to prune at each step) without lookahead. But empirically it performs well — often within 1-2% accuracy of exhaustive search — because layer sensitivities are roughly independent at each greedy step.

Compared to AMC: AMC learns a global policy (RL training required, models all layer interactions), NetAdapt does iterative local search (no RL training, uses real hardware measurements, slower wall-clock due to many fine-tuning passes). NetAdapt produces a series of models at different latency budgets (one per iteration), which is useful for a family of devices with different speed targets.

NetAdapt greedy trim — step-by-step latency reduction

Five layers, each with a latency cost from a lookup table. At each step, the greedy algorithm picks the layer giving the best accuracy/latency tradeoff and trims it. Click "Next Step" to advance.

Step 0 / 4 — Full model

NetAdapt uses a pre-built latency lookup table (LUT). What problem does the LUT solve?

It stores accuracy values so the algorithm doesn't need to fine-tune after each pruning step. It allows NetAdapt to be run without access to a GPU. It converts theoretical FLOP counts to accuracy predictions automatically. Measuring real hardware latency for every candidate pruning configuration is too slow — the LUT pre-profiles all configurations so greedy search queries are instant table lookups.

Chapter 6: Sparsity & Real Speedup — Why Zeros Don't Help Themselves

You have pruned 90% of a weight matrix to zero. You expect a 10× speedup. Instead, you get essentially no speedup on a GPU. Let's understand exactly why — and what it takes to change that.

A GPU executes matrix multiplication as a GEMM (General Matrix Multiply). The CUDA kernel receives two matrix pointers and a shape. It tiles the matrices into warps, loads 32 floats at a time into registers, and executes fused multiply-add (FMA) instructions. The key point: the kernel does not inspect individual weight values before executing FMAs. It executes 32-wide SIMD operations regardless. Multiplying by zero is just as costly as multiplying by 0.7. The zero bits are fetched from DRAM (same bandwidth), loaded into registers (same latency), and multiplied in the FMA unit (same cycle count).

Unstructured sparsity on GPU = zero speedup: Zeroing individual weights does not change the tensor shape. The GEMM kernel still processes an (M×K) × (K×N) multiplication with the same M, K, N dimensions. All that changes is that some of the K partial sums happen to add zero. The hardware cannot tell. You need either (a) a different tensor shape (structured pruning — remove entire channels so K shrinks) or (b) hardware that explicitly skips zeros (EIE for fine-grained, 2:4 tensor cores for structured-sparse).

Here is how to think about the three regimes:

Sparsity Type	Shape change?	GPU speedup?	Accuracy impact	Hardware needed
Unstructured (random zeros)	No	None on GPU	Minimal at high sparsity	EIE-class custom ASIC
2:4 N:M structured	Compressed representation	Up to 2× on Ampere	~0.1% drop	Ampere/Hopper sparse tensor cores
Channel pruning (structured)	Yes — smaller dense matrix	Proportional to channels removed	Moderate	Any hardware

The speedup from channel pruning is proportional: remove 50% of output channels, the GEMM shrinks by 50% in that dimension, you get roughly 2× speedup on any hardware. Remove 90% of individual weights (unstructured), get 0× speedup on a GPU (but potentially large savings on EIE which skips zeros in hardware).

The same argument applies to memory. A 90%-sparse weight tensor stored as a dense float32 array uses exactly the same bytes as the original — the zeros are still there, taking 4 bytes each. Only compressed storage formats (CSR, 2:4 compressed) actually reduce the memory footprint.

Dense vs unstructured-sparse vs 2:4 vs channel — throughput comparison

Simulated relative throughput for a GEMM on GPU hardware. Toggle sparsity type to see actual speedup each approach delivers. The unstructured bar stays flat at 1.0× — only structured formats benefit.

A team fine-grained-prunes a ResNet-50 layer to 95% sparsity and measures inference latency on an A100 GPU without any special configuration. What do they most likely observe?

~20× speedup because 95% of multiply-accumulates are skipped. Near-zero speedup — the GEMM kernel still processes the same tensor shape with the same number of FMA operations. ~2× speedup because the Ampere sparse tensor cores detect zero weights automatically. ~5× speedup because DRAM bandwidth is reduced by 95%.

Chapter 7: EIE — Skipping Zeros in Silicon

If GPUs can't exploit fine-grained sparsity, we need different hardware. EIE (Efficient Inference Engine, Han et al., ISCA 2016 — the 5th most-cited ISCA paper in 50 years) was designed from scratch to do exactly this. It exploits two types of sparsity simultaneously: weight sparsity (90% of weights are zero after pruning + weight sharing) and activation sparsity (70% of activations are zero after ReLU).

First, EIE stores weights in a compressed format. Consider a weight column with mostly zeros. Instead of storing 4 bytes per weight for all N weights, we only store the non-zero values and their column indices. This is essentially the Compressed Sparse Column (CSC) format. Let's work through the exact encoding on a concrete example.

Take a 4×4 weight matrix for a single processing element:

W = [[0, 0.3, 0, −0.7], [0, 0, 0.5, 0], [0.2, 0, 0, 0.4], [0, −0.1, 0, 0]]

Dense storage: 4×4 = 16 values × 4 bytes = 64 bytes.

CSR (Compressed Sparse Row) encoding of the same matrix:

values = [0.3, −0.7, 0.5, 0.2, 0.4, −0.1] — the 6 non-zero values (4 bytes each = 24 bytes)
col_indices = [1, 3, 2, 0, 3, 1] — which column each value is in (1 byte each for small matrices = 6 bytes)
row_pointers = [0, 2, 3, 5, 6] — prefix sums: row 0 has entries 0..1, row 1 has entry 2, row 2 has entries 3..4, row 3 has entry 5 (5 bytes)

Total CSR: 24 + 6 + 5 = 35 bytes vs 64 dense — a 1.83× memory reduction. For the actual EIE weight matrices with 90%+ sparsity, this ratio becomes 10× or more. EIE additionally quantizes non-zero values to 4 bits (decoded to 16-bit for arithmetic), bringing memory footprint down another 2×.

Activation sparsity — the free lunch from ReLU: After a ReLU layer, roughly 70% of activation values are exactly zero (negative pre-activations were clamped). When computing the next layer's GEMM, multiplying a weight by a zero activation produces zero — a computation that can be entirely skipped. EIE's processing elements (PEs) include a leading non-zero detector: they scan the input activation vector and jump directly to the next non-zero entry. This is what "W*0=0, 0*A=0" means in the EIE dataflow — both weight zeros and activation zeros are handled by skipping.

EIE's 64 PEs run in parallel. Each PE owns a vertical stripe of the weight matrix (a set of columns in CSC format). When an input activation a_j is non-zero, all PEs simultaneously look up their column j in compressed storage and accumulate the corresponding partial sums. When a_j = 0, all PEs skip it together — the leading non-zero detector in the input broadcast logic fires and advances to the next non-zero index. The result: 33× effective FLOP reduction for AlexNet-FC6 (9% weight density × 35% activation density = 3.15% of multiplications actually executed).

EIE stores the weight matrix in CSR/CSC format. A 1024×1024 weight matrix has 95% sparsity (only 5% non-zeros = 52,428 non-zero entries). Compute the memory for dense float32 storage vs CSR storage. Assume 4 bytes per float32 value, 2 bytes per column index, 4 bytes per row pointer.

Dense: 4 MB. CSR: 3.5 MB. Savings: 12%. Dense: 4 MB. CSR: 0.5 MB. Savings: 87.5%. Dense: 4 MB. CSR: ~0.41 MB (values: 209KB + col_idx: 105KB + row_ptrs: 4KB). Savings: ~90%. Dense: 4 MB. CSR: 0.2 MB. Savings: 95%.

Chapter 8: Showcase: 2:4 Sparsity Explorer

NVIDIA's Ampere architecture (A100, RTX 30 series) introduced 2:4 structured sparsity: exactly 2 out of every 4 consecutive weights must be non-zero. This specific pattern is the minimum structure needed to build hardware that guarantees 2× throughput while keeping fine-grained sparsity (not entire channels).

Why 2:4 specifically? The design choice balances two competing forces. More zeros means more potential speedup, but irregular patterns require more indexing overhead. NVIDIA's solution: enforce the constraint on groups of exactly 4. With 4-wide SIMD lanes and 2-bit indices (enough to address 4 positions), the metadata fits neatly: each group of 4 weights compresses to 2 values + 2 bits × 2 = 4 bits of metadata. No wasted bits. Hardware can be designed around this fixed width.

The compressed storage format works as follows. Take a weight row of length C. Divide into C/4 groups of 4. In each group, keep the 2 largest absolute-value weights, zero the other 2. Store only the 2 non-zero values (half the storage) plus a 4-bit metadata "selector" that records which 2 of the 4 positions were kept.

Full row: [w₀, w₁, w₂, w₃] → keep top-2 by |w|

Example: [0.3, −0.7, 0.1, 0.5] → keep w₁=−0.7, w₃=0.5 → values=[−0.7, 0.5], meta=0b1010

Storage savings: R×C floats → R×(C/2) floats + R×(C/4)×4 bits metadata

For FP16 (2 bytes): C=16 weights = 32 bytes → 8 values × 2 bytes + 4 groups × 4 bits = 16 + 2 = 18 bytes (44% savings)

During GEMM, the sparse tensor core receives the compressed A matrix and the dense B matrix. For each group of 4, the 4-bit metadata selects 2 elements from B's corresponding row. Only 2 multiply-accumulates fire per group of 4 — guaranteed 2× throughput. This is what makes 2:4 sparsity different from unstructured: the hardware knows exactly which 2 of every 4 are non-zero and can schedule accordingly without any dynamic zero-detection overhead.

python
import torch

def apply_2_4_sparsity(weight):
    # weight: any shape. Works on last dimension in groups of 4.
    orig_shape = weight.shape
    w = weight.reshape(−1, 4)          # reshape to (-1, 4) groups

    # For each group of 4, keep only the 2 largest by absolute value
    topk_idx = w.abs().topk(2, dim=1).indices   # shape: (-1, 2)

    mask = torch.zeros_like(w, dtype=torch.bool)
    mask.scatter_(1, topk_idx, True)

    w_sparse = w * mask.float()              # zero out the bottom 2
    return w_sparse.reshape(orig_shape), mask.reshape(orig_shape)

# Example
W = torch.tensor([0.3, −0.7, 0.1, 0.5,  0.6, 0.2, −0.9, 0.1])
W_sparse, mask = apply_2_4_sparsity(W)
# W_sparse: [0, -0.7, 0, 0.5,  0.6, 0, -0.9, 0]
# mask:     [F,  T,   F,  T,    T,  F,   T,   F]

2:4 Sparsity Explorer — storage format and throughput

An 8-wide weight vector divided into two groups of 4. Drag the weight values — the canvas shows which 2 survive per group (top-2 by |w|), the 4-bit selector metadata, the compressed storage layout, and the resulting throughput vs dense.

Avg weight magnitude5

For a FP16 weight matrix of shape 1024×1024 stored in 2:4 compressed format, compute the memory used. Assume: non-zero values stored as FP16 (2 bytes each), metadata stored as 4 bits per group of 4.

1 MB (exactly half the 2 MB dense representation, ignoring metadata). 2.5 MB (50% more due to metadata overhead). ~1.065 MB: values = 1024×512×2 = 1,048,576 bytes (1 MB) + metadata = 1024×256×4 bits = 131,072 bytes (128 KB). Total ≈ 1.125 MB vs 2 MB dense. 512 KB (4× compression from 50% sparsity × 2-bit metadata compression).

Chapter 9: Connections & Cheat Sheet

Complete Cheat Sheet: Pruning II

Lottery Ticket Hypothesis — Recipe

Initialize network θ₀ (random). Save this state.
Train to convergence → θ_T.
Prune p% of weights by magnitude → mask m.
Rewind: reset surviving weights to θ₀ (or to θ_k for early rewinding, k ≪ T).
Train the masked subnetwork → if accuracy ≥ full network: winning ticket found.
Repeat from step 3 with higher sparsity target.

Key insight: the mask alone is not the winning ticket. The mask + original initialization together form the ticket.

AMC vs NetAdapt — Decision Guide

Property	AMC	NetAdapt
Algorithm	DDPG (RL agent)	Greedy local search
Search objective	Learned policy (global)	Greedy best-accuracy-per-step (local)
Hardware feedback	FLOP/latency model or LUT	Real latency LUT on target device
Agent training cost	High (RL training required)	None
Fine-tuning cost per search	Low (short eval)	High (10k steps × #layers × #iterations)
Output	Single compressed model	Series of models at different budgets
Layer interaction	Captured (state includes running budget)	Ignored (greedy)
Best for	Maximum accuracy at fixed budget	Device families; real latency required

Sparsity Format × Hardware Support

Sparsity Format	Granularity	Hardware	Speedup	Storage
Dense (baseline)	—	Any GEMM	1×	M×K×dtype
Unstructured (fine-grained)	Individual weights	EIE (ASIC), sparse GEMM libs	0× on GPU, up to 10× on EIE	CSR/CSC: nnz × (val + idx) + ptrs
2:4 structured (N:M)	Groups of 4	NVIDIA Ampere/Hopper sparse tensor cores	~2× (1.5× measured on BERT)	50% values + ~6% metadata
Channel pruning	Entire filters	Any dense hardware	Proportional to channels removed	Proportional reduction
Depth pruning	Entire layers	Any dense hardware	Proportional to layers removed	Proportional reduction

Bridge to Quantization (Lecture 5)

Pruning reduces the number of weights. Quantization reduces the precision of each weight — from float32 (4 bytes) to INT8 (1 byte) or even INT4 (0.5 bytes). The two are orthogonal and compounding: EIE already uses 4-bit weight quantization on top of sparsity. A pruned + quantized network can be dramatically smaller than either technique alone. L5 covers the fundamentals: quantization grid, rounding error, the INT8 symmetric quantization formula, calibration, and QAT (Quantization-Aware Training).

What to know cold: (1) IMP recipe with weight rewinding — 4 steps. (2) AMC state/action/agent/reward — each element has a reason. (3) Why unstructured sparsity doesn't speed up GPU GEMMs. (4) CSR byte count for a given sparsity level. (5) 2:4 storage: R×C/2 values + R×C/4×4 bits metadata.

Related lessons on this site

TinyML L3: Pruning I — granularity, criteria (OBD, magnitude, APoZ), iterative prune-finetune loop, sensitivity analysis foundations.
TinyML L2: NN Building Blocks — the FLOP and param formulas that AMC and NetAdapt optimize over.
TinyML L1: Efficiency Metrics — roofline model and energy hierarchy that explain why DRAM bandwidth bottlenecks CSR-encoded sparse matrices.
Attention Variants — transformer sparsity (token dropping, sparse attention patterns) — a different kind of structural sparsity enabled by the same hardware principles.

Feynman test: Can you explain why the Lottery Ticket Hypothesis implies that large networks are better training vehicles than small ones, even if you ultimately want a small network? The right answer: large networks contain many potential winning tickets; training wide-then-prune is the search procedure that finds one. A small network trained directly from scratch may never stumble onto the right initialization for its restricted capacity. The over-parameterized network is a lottery machine — you need many tickets to reliably win.

You have a 90%-sparse network compressed with EIE's CSR format. A colleague proposes switching to NVIDIA 2:4 sparsity to "get hardware support." What tradeoff are they making?

2:4 gives exactly the same speedup as CSR/EIE on NVIDIA hardware. 2:4 is always better — it runs on any GPU and gives more speedup than EIE. 2:4 constrains sparsity to exactly 50% (not 90%), so you get less compression and potentially higher accuracy. It runs on mainstream GPU hardware (no custom ASIC). EIE achieves higher sparsity at the cost of needing a custom accelerator. 2:4 sparsity requires more memory due to metadata overhead, eliminating compression benefits.