L3 left two open wounds: we never proved which sparse subnetwork to keep, and our sparse model was no faster in practice. This lesson heals both. The Lottery Ticket Hypothesis gives a precise recipe for finding the right subnetwork — and weight rewinding explains why the mask alone is not enough. AMC turns per-layer sparsity allocation into a reinforcement-learning policy. NetAdapt does it greedily with a real latency table. Then we descend into hardware: why unstructured sparsity is nearly free on GPUs, how CSR encoding really works, what EIE's processing elements actually compute, and why NVIDIA's 2:4 pattern was the minimum regularity needed to guarantee 2× throughput. MIT 6.5940 by Song Han.
By the end of Pruning I you could prune AlexNet to 10% of its weights, finetune, and recover near-original accuracy. That is genuinely impressive. But if you tried to deploy that pruned network on a phone and timed inference, you'd find something embarrassing: it ran at roughly the same speed as the full network. And if someone asked you why those 10% survived rather than some other 10%, you wouldn't have a satisfying answer.
These are the two gaps. Gap 1 is theoretical: is there something special about the specific subnetwork that survives magnitude pruning + finetuning, or could any random 10% work just as well? Gap 2 is practical: why doesn't a 90%-sparse network run 10× faster, and what has to change before sparsity translates to actual throughput?
The lesson also covers the practical question of how much to prune each layer — not by hand, but automatically. Chapter 3 shows the manual sensitivity-analysis approach. Chapters 4–5 show two automated alternatives: AMC (an RL agent) and NetAdapt (a greedy hardware-in-the-loop search).
In 2019, Frankle and Carlin ran an experiment that seemed almost too clean to be true. They took a dense network, pruned it with iterative magnitude pruning, and then — instead of finetuning from the pruned weights — they reset every surviving weight back to its value at the beginning of training (its original random initialization). Then they trained just those weights, keeping the rest frozen at zero.
The result: this sparse subnetwork, trained from scratch with its original initialization, matched the accuracy of the full dense network. With far fewer parameters. In far fewer training iterations.
The word "lottery" is deliberate. When you randomly initialize a large network, you are buying many lottery tickets at once. Most tickets are losers — their random starting values don't sit in the right region of weight space to converge well when trained sparsely. But a few are winners. The full training process finds one of them (the surviving weights after pruning happen to be one such winning ticket). The hypothesis says this lucky subset could have trained alone from the start, if only you'd known which one it was.
The hypothesis makes a precise claim about initialization. It is not just saying "a small network can achieve the same accuracy as a large one." It says the right small network is already embedded inside the random initialization of the large one — and the way to find it is to train large, prune, then look at which weights survived.
Why does this matter beyond intellectual curiosity? Three reasons. First, it suggests that the capacity we actually need is much less than what we train with — training wide-then-prune is the practical path to finding a small, well-initialized subnetwork. Second, it gives a theoretical grounding for why iterative magnitude pruning works so well: it is a procedure that identifies winning tickets. Third, it opens a research direction of finding winning tickets earlier (at initialization) without needing to train the full network first.
Three training runs at the same sparsity level: (1) the full dense network, (2) the winning ticket — sparse mask with original initialization, (3) same mask but random re-initialization. Watch how the winning ticket tracks the dense network while random reinit diverges early.
To actually find a winning ticket, Frankle and Carlin used a procedure called Iterative Magnitude Pruning (IMP) combined with weight rewinding. Let's walk through it step by step, with actual numbers.
The procedure has a loop. Start with a randomly initialized dense network. Train to convergence. Prune the p% of weights with the smallest absolute values (set them to zero permanently — the mask is now fixed). Then rewind: reset every surviving weight to the value it had at initialization (step 0). Train again, with the same mask. Repeat: prune more, rewind, train. After k rounds of pruning, each removing p% of the remaining weights, the surviving fraction is (1-p/100)k.
The "rewind" step is the key. The original Frankle-Carlin paper rewound to step 0 (the initial random values). Later work by Frankle et al. (2020) refined this to "early rewinding" — rewinding to a checkpoint taken after a small number of warm-up steps (e.g., after 1-5% of training), not step 0. For large networks, early rewinding works better because the network needs a few steps to escape the chaotic early phase of SGD before the surviving weights "know" which direction to converge.
Here is the IMP algorithm as runnable PyTorch code. The key moves are: save the initial state dict before any training, apply magnitude pruning with torch.nn.utils.prune, and restore the initial values via load_state_dict at the rewinding step.
python import torch, copy import torch.nn.utils.prune as prune def iterative_magnitude_pruning(model, train_fn, n_rounds, prune_pct): # Save original (step-0) initialization init_state = copy.deepcopy(model.state_dict()) masks = {} # will accumulate binary masks per parameter for round_idx in range(n_rounds): # 1. Train to convergence (or fixed steps) train_fn(model) # 2. Compute magnitude-based pruning mask for name, param in model.named_parameters(): if 'weight' in name: flat = param.data.abs().flatten() threshold = torch.quantile(flat, prune_pct / 100.0) masks[name] = (param.data.abs() >= threshold).float() # 3. REWIND — restore weights to step-0 initialization model.load_state_dict(init_state, strict=False) # 4. Apply the accumulated mask (zero out pruned weights) with torch.no_grad(): for name, param in model.named_parameters(): if name in masks: param.data.mul_(masks[name]) pct_remaining = 100 * (1 - prune_pct/100) ** (round_idx + 1) print(f"Round {round_idx+1}: {pct_remaining:.1f}% weights remain") return model, masks # Example: 5 rounds at 20% pruning each → 32.8% remain model, masks = iterative_magnitude_pruning(my_model, train_fn, n_rounds=5, prune_pct=20)
Before we can ask a machine to set per-layer sparsity budgets (chapters 4–5), we should understand what makes a layer sensitive to pruning in the first place. The answer turns out to be visible in a simple sweep.
Pick one layer Li. Prune it at ratios r ∈ {10%, 20%, …, 90%} while leaving all other layers untouched. For each ratio, measure the accuracy drop ΔAccir. Plot accuracy vs pruning ratio for Li. Repeat for every layer. The resulting family of curves is the model's sensitivity profile.
The practical rule: pick a global accuracy-drop threshold T (e.g., "allow at most 2% drop"). For each layer, read off the maximum sparsity that keeps its accuracy drop below T. Assign that as the layer's pruning ratio. Layers that can tolerate 80% pruning get 80%; layers that can only tolerate 20% get 20%. Total parameter savings is the weighted average.
This is better than uniform pruning (pruning every layer to the same ratio) but still has a flaw: it ignores interactions between layers. Pruning L3 changes which activations L4 sees; the sensitivity profile assumes each layer is independent. AMC, coming next, addresses this by treating the whole network jointly.
Six representative layers of VGG-11 on CIFAR-10. Drag the threshold T slider to see which layers survive at each accuracy budget. The intersection points give per-layer sparsity ratios.
Sensitivity analysis is laborious (one sweep per layer per model) and suboptimal (ignores layer interactions). AMC (He et al., ECCV 2018) replaces the human with a reinforcement learning agent that learns a compression policy: given the current layer's statistics, output a sparsity ratio. The agent sees all layers sequentially, building up the compressed network, and gets a reward at the end based on accuracy under a hardware budget constraint.
The RL formulation is precise:
In practice, a softer version is used: R = −Error × log(FLOPcompressed). This penalizes high FLOPs continuously rather than with a hard cutoff, which gives the agent a smoother gradient signal to learn from.
Eight ResNet-style layers. The agent allocates different sparsity to each layer (bars). Drag the FLOP budget slider — the agent dynamically re-allocates across layers to stay under budget. Reward = accuracy (simulated) under the constraint.
AMC is a powerful general policy learner, but it requires training the RL agent — which itself takes significant compute. NetAdapt (Yang et al., ECCV 2018) takes a different approach: instead of learning a policy, it runs a greedy search guided by real latency measurements from the target hardware.
The algorithm is built around a simple idea: at each iteration, we want to reduce total latency by exactly ΔR milliseconds. We try pruning each layer individually to achieve that reduction (using a pre-built latency lookup table to avoid re-measuring hardware every time). For each candidate pruned layer, we do a short fine-tune (10k steps — cheap) and measure accuracy. We pick the layer that gives the highest accuracy after fine-tuning. Then we repeat until total latency meets the constraint.
The greedy nature means NetAdapt doesn't globally optimize. It makes locally optimal decisions (best layer to prune at each step) without lookahead. But empirically it performs well — often within 1-2% accuracy of exhaustive search — because layer sensitivities are roughly independent at each greedy step.
Compared to AMC: AMC learns a global policy (RL training required, models all layer interactions), NetAdapt does iterative local search (no RL training, uses real hardware measurements, slower wall-clock due to many fine-tuning passes). NetAdapt produces a series of models at different latency budgets (one per iteration), which is useful for a family of devices with different speed targets.
Five layers, each with a latency cost from a lookup table. At each step, the greedy algorithm picks the layer giving the best accuracy/latency tradeoff and trims it. Click "Next Step" to advance.
You have pruned 90% of a weight matrix to zero. You expect a 10× speedup. Instead, you get essentially no speedup on a GPU. Let's understand exactly why — and what it takes to change that.
A GPU executes matrix multiplication as a GEMM (General Matrix Multiply). The CUDA kernel receives two matrix pointers and a shape. It tiles the matrices into warps, loads 32 floats at a time into registers, and executes fused multiply-add (FMA) instructions. The key point: the kernel does not inspect individual weight values before executing FMAs. It executes 32-wide SIMD operations regardless. Multiplying by zero is just as costly as multiplying by 0.7. The zero bits are fetched from DRAM (same bandwidth), loaded into registers (same latency), and multiplied in the FMA unit (same cycle count).
Here is how to think about the three regimes:
| Sparsity Type | Shape change? | GPU speedup? | Accuracy impact | Hardware needed |
|---|---|---|---|---|
| Unstructured (random zeros) | No | None on GPU | Minimal at high sparsity | EIE-class custom ASIC |
| 2:4 N:M structured | Compressed representation | Up to 2× on Ampere | ~0.1% drop | Ampere/Hopper sparse tensor cores |
| Channel pruning (structured) | Yes — smaller dense matrix | Proportional to channels removed | Moderate | Any hardware |
The speedup from channel pruning is proportional: remove 50% of output channels, the GEMM shrinks by 50% in that dimension, you get roughly 2× speedup on any hardware. Remove 90% of individual weights (unstructured), get 0× speedup on a GPU (but potentially large savings on EIE which skips zeros in hardware).
The same argument applies to memory. A 90%-sparse weight tensor stored as a dense float32 array uses exactly the same bytes as the original — the zeros are still there, taking 4 bytes each. Only compressed storage formats (CSR, 2:4 compressed) actually reduce the memory footprint.
Simulated relative throughput for a GEMM on GPU hardware. Toggle sparsity type to see actual speedup each approach delivers. The unstructured bar stays flat at 1.0× — only structured formats benefit.
If GPUs can't exploit fine-grained sparsity, we need different hardware. EIE (Efficient Inference Engine, Han et al., ISCA 2016 — the 5th most-cited ISCA paper in 50 years) was designed from scratch to do exactly this. It exploits two types of sparsity simultaneously: weight sparsity (90% of weights are zero after pruning + weight sharing) and activation sparsity (70% of activations are zero after ReLU).
First, EIE stores weights in a compressed format. Consider a weight column with mostly zeros. Instead of storing 4 bytes per weight for all N weights, we only store the non-zero values and their column indices. This is essentially the Compressed Sparse Column (CSC) format. Let's work through the exact encoding on a concrete example.
Take a 4×4 weight matrix for a single processing element:
Dense storage: 4×4 = 16 values × 4 bytes = 64 bytes.
CSR (Compressed Sparse Row) encoding of the same matrix:
values = [0.3, −0.7, 0.5, 0.2, 0.4, −0.1] — the 6 non-zero values (4 bytes each = 24 bytes)col_indices = [1, 3, 2, 0, 3, 1] — which column each value is in (1 byte each for small matrices = 6 bytes)row_pointers = [0, 2, 3, 5, 6] — prefix sums: row 0 has entries 0..1, row 1 has entry 2, row 2 has entries 3..4, row 3 has entry 5 (5 bytes)Total CSR: 24 + 6 + 5 = 35 bytes vs 64 dense — a 1.83× memory reduction. For the actual EIE weight matrices with 90%+ sparsity, this ratio becomes 10× or more. EIE additionally quantizes non-zero values to 4 bits (decoded to 16-bit for arithmetic), bringing memory footprint down another 2×.
EIE's 64 PEs run in parallel. Each PE owns a vertical stripe of the weight matrix (a set of columns in CSC format). When an input activation aj is non-zero, all PEs simultaneously look up their column j in compressed storage and accumulate the corresponding partial sums. When aj = 0, all PEs skip it together — the leading non-zero detector in the input broadcast logic fires and advances to the next non-zero index. The result: 33× effective FLOP reduction for AlexNet-FC6 (9% weight density × 35% activation density = 3.15% of multiplications actually executed).
NVIDIA's Ampere architecture (A100, RTX 30 series) introduced 2:4 structured sparsity: exactly 2 out of every 4 consecutive weights must be non-zero. This specific pattern is the minimum structure needed to build hardware that guarantees 2× throughput while keeping fine-grained sparsity (not entire channels).
Why 2:4 specifically? The design choice balances two competing forces. More zeros means more potential speedup, but irregular patterns require more indexing overhead. NVIDIA's solution: enforce the constraint on groups of exactly 4. With 4-wide SIMD lanes and 2-bit indices (enough to address 4 positions), the metadata fits neatly: each group of 4 weights compresses to 2 values + 2 bits × 2 = 4 bits of metadata. No wasted bits. Hardware can be designed around this fixed width.
The compressed storage format works as follows. Take a weight row of length C. Divide into C/4 groups of 4. In each group, keep the 2 largest absolute-value weights, zero the other 2. Store only the 2 non-zero values (half the storage) plus a 4-bit metadata "selector" that records which 2 of the 4 positions were kept.
During GEMM, the sparse tensor core receives the compressed A matrix and the dense B matrix. For each group of 4, the 4-bit metadata selects 2 elements from B's corresponding row. Only 2 multiply-accumulates fire per group of 4 — guaranteed 2× throughput. This is what makes 2:4 sparsity different from unstructured: the hardware knows exactly which 2 of every 4 are non-zero and can schedule accordingly without any dynamic zero-detection overhead.
python import torch def apply_2_4_sparsity(weight): # weight: any shape. Works on last dimension in groups of 4. orig_shape = weight.shape w = weight.reshape(−1, 4) # reshape to (-1, 4) groups # For each group of 4, keep only the 2 largest by absolute value topk_idx = w.abs().topk(2, dim=1).indices # shape: (-1, 2) mask = torch.zeros_like(w, dtype=torch.bool) mask.scatter_(1, topk_idx, True) w_sparse = w * mask.float() # zero out the bottom 2 return w_sparse.reshape(orig_shape), mask.reshape(orig_shape) # Example W = torch.tensor([0.3, −0.7, 0.1, 0.5, 0.6, 0.2, −0.9, 0.1]) W_sparse, mask = apply_2_4_sparsity(W) # W_sparse: [0, -0.7, 0, 0.5, 0.6, 0, -0.9, 0] # mask: [F, T, F, T, T, F, T, F]
An 8-wide weight vector divided into two groups of 4. Drag the weight values — the canvas shows which 2 survive per group (top-2 by |w|), the 4-bit selector metadata, the compressed storage layout, and the resulting throughput vs dense.
Key insight: the mask alone is not the winning ticket. The mask + original initialization together form the ticket.
| Property | AMC | NetAdapt |
|---|---|---|
| Algorithm | DDPG (RL agent) | Greedy local search |
| Search objective | Learned policy (global) | Greedy best-accuracy-per-step (local) |
| Hardware feedback | FLOP/latency model or LUT | Real latency LUT on target device |
| Agent training cost | High (RL training required) | None |
| Fine-tuning cost per search | Low (short eval) | High (10k steps × #layers × #iterations) |
| Output | Single compressed model | Series of models at different budgets |
| Layer interaction | Captured (state includes running budget) | Ignored (greedy) |
| Best for | Maximum accuracy at fixed budget | Device families; real latency required |
| Sparsity Format | Granularity | Hardware | Speedup | Storage |
|---|---|---|---|---|
| Dense (baseline) | — | Any GEMM | 1× | M×K×dtype |
| Unstructured (fine-grained) | Individual weights | EIE (ASIC), sparse GEMM libs | 0× on GPU, up to 10× on EIE | CSR/CSC: nnz × (val + idx) + ptrs |
| 2:4 structured (N:M) | Groups of 4 | NVIDIA Ampere/Hopper sparse tensor cores | ~2× (1.5× measured on BERT) | 50% values + ~6% metadata |
| Channel pruning | Entire filters | Any dense hardware | Proportional to channels removed | Proportional reduction |
| Depth pruning | Entire layers | Any dense hardware | Proportional to layers removed | Proportional reduction |
Pruning reduces the number of weights. Quantization reduces the precision of each weight — from float32 (4 bytes) to INT8 (1 byte) or even INT4 (0.5 bytes). The two are orthogonal and compounding: EIE already uses 4-bit weight quantization on top of sparsity. A pruned + quantized network can be dramatically smaller than either technique alone. L5 covers the fundamentals: quantization grid, rounding error, the INT8 symmetric quantization formula, calibration, and QAT (Quantization-Aware Training).