TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 3

Pruning I: Making Networks Sparse

AlexNet has 61 million parameters — but 90% of them can be removed with almost no accuracy loss. Which 10% actually matter, and how do you find them? This lesson derives the pruning problem from first principles, walks every granularity from individual weights to whole channels, builds five criteria for deciding what to cut (including the Optimal Brain Damage second-order derivation), and shows the iterative prune–finetune loop that recovers accuracy at extreme sparsity. MIT 6.5940 by Song Han.

Prerequisites: TinyML L1 (efficiency metrics) + TinyML L2 (layer cost formulas). Basic calculus (Taylor series). No new math beyond that.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The 90% You Don't Need

Take AlexNet. 61 million parameters. A weight tensor for the first fully-connected layer alone holds 4096 × 4096 = 16.8 million values. If you print them out and plot a histogram, you'll see something striking: the vast majority of those values hover near zero. Not exactly zero — but close enough that removing them barely changes what the network computes.

That observation, made empirically by Song Han and colleagues in 2015, led to a result that still surprises people: you can prune 90% of AlexNet's connections and, after a brief finetuning pass, restore accuracy to within 0.5% of the original. You can do the same to VGG-16 at 12× compression. The 10% of weights you keep are doing almost all the work. The other 90% are passengers.

This is the core insight behind neural network pruning: trained networks are massively over-parameterized, and most of the excess can be removed without hurting the task they were trained for. The brain does the same thing — a human infant has ~15,000 synapses per neuron; by adolescence that drops to ~7,000 as unused connections are eliminated. The brain prunes based on activity; we prune based on importance criteria.

What pruning actually does: It produces a sparse network — a network where many weight values are exactly zero and can be skipped. Whether that sparsity translates to speedup depends on granularity: zeroing individual weights (fine-grained/unstructured) requires sparse-matrix hardware to accelerate; zeroing entire channels (structured) produces a smaller dense network that runs faster on any hardware. The central tension of this lesson is that finer-grained pruning compresses more, but coarser pruning accelerates more.

Here is the weight magnitude histogram for a toy 4×4 fully-connected layer — representative of what you see in practice. Notice how most weights cluster near zero while a few have large magnitude. A naive threshold wipes out the near-zero mass while preserving the large ones. The canvas below lets you visualize this matrix and sweep a sparsity threshold.

Weight matrix heat map — magnitude distribution

A simulated 8×8 weight matrix. Darker cells = larger |w|. The threshold line (drag slider) zeroes out weights below it. Watch how many cells go dark at 50% vs 90% sparsity — yet the few large weights survive.

Sparsity threshold50%

The tour this lesson takes: (1) formalize what pruning is as an optimization problem; (2) survey the spectrum of granularities — from individual weights up to whole channels — and what each buys you on real hardware; (3) study five criteria for scoring which weights to remove, including the elegant second-order method from Optimal Brain Damage (LeCun, 1989); (4) understand the prune–finetune–repeat loop that recovers accuracy at extreme sparsity; and (5) see how different layers tolerate wildly different sparsity levels.

Ch 1: The problem
argmin loss s.t. ‖W‖₀ ≤ N — the NP-hard combinatorial formulation and why we relax it
Ch 2–3: Granularity
Fine-grained → N:M → vector → kernel → channel/filter: compression ratio vs HW speedup tradeoff
Ch 4–6: Criteria
Magnitude (L1/L2), scaling factor (BN γ), APoZ, and the OBD Hessian saliency derivation
Ch 7–8: Training
One-shot vs iterative; the persistent-mask gotcha; per-layer sensitivity; sparsity explorer
Ch 9: Connections
Granularity × criteria cheat sheet; bridge to Pruning II (lottery ticket, AMC, system support)
AlexNet is pruned from 61M to 6.7M parameters (9× compression). The accuracy drop after retraining is negligible. What does this tell us about the trained network?

Chapter 1: The Pruning Problem (Formal)

Let's be precise about what we're actually trying to do. You have a trained network with weights W and an objective (loss) function L(x; W). You want to find a pruned set of weights WP that minimizes the same objective while having few non-zero entries:

argminWP L(x; WP)   subject to   ‖WP0 ≤ N

Here ‖WP0 counts the number of non-zero elements in WP — the L0 "norm" (it's not actually a norm, but it counts non-zeros). N is your target sparsity budget: how many connections the pruned network is allowed to have.

This formulation is clean but the solution is not. Exactly minimizing this is an NP-hard combinatorial search — you'd have to evaluate every possible subset of weights of size N and pick the best one. For a 61M-parameter network, that's more subsets than atoms in the observable universe.

Why we don't solve it exactly: Finding the globally optimal pruned network is NP-hard. Instead, every practical pruning algorithm is a heuristic approximation: it uses a proxy criterion (magnitude, Hessian, scaling factor, etc.) to score each weight's importance and then removes the least-important ones. The trick is choosing a proxy that correlates well with the true objective change δL.

The practical approach is a three-step decomposition:

  1. Score every weight (or group of weights) with an importance metric — a proxy for how much the loss would increase if we removed it.
  2. Rank and threshold — remove the k% least important weights.
  3. Finetune — run a few epochs of gradient descent with the surviving weights to recover accuracy.

Two design choices dominate the rest of this lesson: what you prune (granularity — individual weights? entire channels?) and why you prune it (criterion — magnitude? curvature? activation statistics?). These are independent decisions: you can use magnitude as a criterion for either fine-grained or channel pruning.

The biological connection. The brain prunes synapses that fire rarely and strengthens synapses that fire together. Neural pruning mirrors this: a weight that rarely contributes to the output (small magnitude, or small curvature × magnitude product) is a safe candidate for removal. Song Han's lab found that 90% of connections in AlexNet are below a threshold that, if removed, leaves the task unchanged — the remaining 10% carry nearly all the signal.

Worked numbers — the compression table from Han et al. 2015:

NetworkBeforeAfterCompressionMAC reduction
AlexNet61M6.7M
VGG-16138M10.3M12×
GoogLeNet7M2.0M3.5×
ResNet-5026M7.47M3.4×6.3×
SqueezeNet1M0.38M3.2×3.5×

Note: compression ratio > MAC reduction for unstructured pruning because zero weights still occupy compute cycles on dense GPU kernels. Real speedup requires sparse hardware support or structured pruning.

Key asymmetry: Parameter compression ratio is always ≥ MAC reduction ratio for fine-grained pruning on standard hardware. If you need actual latency reduction, you need either: (a) structured pruning (channels, filters) that produces a smaller dense network, or (b) hardware with sparse execution (NVIDIA Ampere 2:4 sparsity, or custom ASICs like EIE).
The pruning objective is “argmin L(x; W_P) subject to ‖W_P‖₀ ≤ N”. Why can't we solve this exactly for a 61M-parameter network?

Chapter 2: Granularity I — Fine-Grained & N:M Sparsity

Before you decide which weights to prune, you must decide at what scale to prune. This is granularity: how big is the unit you remove? The spectrum runs from individual scalar weights at one end, to entire filters (all weights in an output channel) at the other.

Fine-grained pruning (also called unstructured pruning) removes individual weights. For a 2D weight matrix, you get a pattern that looks random — a few surviving non-zeros scattered irregularly across the tensor. This is the most flexible option: you can remove exactly the weights with the smallest importance score, regardless of where they live. The result is a sparse matrix.

The problem: regular hardware runs on dense matrix operations (GEMM). A sparse weight tensor still occupies the same memory slots as its dense counterpart — you'd need to explicitly zero out entries and skip them in the multiply-accumulate pipeline. On a standard GPU, zeroing out 90% of a matrix doesn't make inference 10× faster; it might be the same speed or slower, because the hardware still executes dense vector operations and just multiplies by 0. To actually accelerate fine-grained sparse inference, you need:

Misconception alert: “Unstructured pruning reduces parameters but NOT latency without sparse-kernel hardware support.” This is one of the most common mistakes in efficient ML. On a standard GPU, a 90%-sparse network runs at essentially the same latency as the dense original. The compression benefit is in model storage and memory bandwidth when loading weights — not in arithmetic throughput.

N:M sparsity is a structured-irregular middle ground. For every contiguous M elements in the weight tensor, exactly N are kept and M−N are zeroed. The classic case is 2:4 sparsity: in every group of 4 weights, exactly 2 are non-zero (50% sparsity). NVIDIA's Ampere architecture (A100) supports 2:4 sparsity natively in its Sparse Tensor Cores: weights are stored in a compressed format (non-zero values + 2-bit indices), and the hardware skips zero multiplications automatically, delivering up to 2× throughput. Accuracy tests across BERT, ResNet, and ViT show that 2:4 sparsity typically recovers to within 0.5% of the dense baseline after sparse-aware training.

Granularity visualizer — toggle between pruning modes

A 6×8 weight matrix. Switch modes to see which cells are zeroed (gray) and what pattern emerges. Note how fine-grained produces random scatter, N:M produces a regular local pattern, and channel pruning removes entire rows.

Why 2:4 was chosen: The 2-bit index per non-zero (4 possible positions in a group of 4) is the minimum overhead that delivers near-unstructured flexibility. At 2:4 each compressed block stores 2 values + 2 × 2 bits = 2 values + 4 bits overhead, versus the 4 values in the dense form. Memory footprint is exactly halved. The hardware can then schedule sparse-dense matmuls with a dedicated "Sparse GEMM" kernel that reads the compressed matrix and uses the indices to select matching dense columns — achieving 2× throughput with no accuracy-to-sparsity tradeoff beyond what fine-grained 50% pruning would give.

An engineer prunes 80% of a ResNet-50's weights using fine-grained magnitude pruning, then measures inference latency on an A100 GPU without enabling sparse tensor cores. What do they observe?

Chapter 3: Granularity II — Structured Pruning

Structured pruning removes entire groups of weights that correspond to a unit the hardware naturally processes: a row, a column, a kernel (one 3×3 slice), or a channel (all kernels feeding one output). The result is a smaller dense network — no sparse formats needed, no special hardware, just the same GPU GEMM running on a tensor with fewer dimensions.

For a convolutional layer, the weight tensor has shape (Cout, Cin, kh, kw). The granularity hierarchy:

GranularityUnit removedResultHW-friendly?Compression ratio
Fine-grainedIndividual wSparse tensorNo (needs sparse HW)Highest
Pattern (N:M)Contiguous groupsStructured sparseYes (A100+)High (fixed 50%)
VectorA row of kernelIrregular sparsePartialMedium
KernelOne k×k filter sliceIrregular sparsePartialMedium
ChannelEntire output channelSmaller dense tensorYes (always)Lower
FilterAll filters feeding a layerSmaller dense tensorYes (always)Lower

Channel pruning is the most popular structured method. If a layer has Cout = 512 output channels and you prune 50% of them, you get Cout = 256. The next layer's input channels must also shrink from 512 to 256 — you remove the corresponding input-channel slices in that layer too. The network literally becomes smaller: every tensor in the pruned region is narrower. No zero-skipping needed; plain dense convolution runs faster because the tensor is smaller.

Channel pruning as "width reduction": Channel pruning is equivalent to training a narrower network directly — it just finds which channels matter rather than picking widths randomly. The Llama 2 70B MLPerf submission used exactly this: width pruning reduced intermediate FFN dimensions from 28,762 to 14,336 (~2×), and depth pruning removed whole transformer layers (80 → 32). Together: 2.5× speedup, 99% accuracy retained.

Worked numbers — channel pruning a conv layer:

Layer spec: Cin=256, Cout=512, k=3×3, output 14×14.

Params = Cout × Cin × k2 = 512 × 256 × 9 = 1,179,648
MACs = Cout × Cin × k2 × H × W = 512 × 256 × 9 × 196 = 231,211,008

After 50% channel pruning (Cout → 256):

Params' = 256 × 256 × 9 = 589,824  (2× reduction)
MACs' = 256 × 256 × 9 × 196 = 115,605,504  (2× reduction)

For channel pruning at sparsity s: params and MACs both scale as (1−s), because the output tensor shrinks by (1−s) in the channel dimension and the next layer also shrinks its input side. This is the key advantage: real latency reduction without sparse-execution hardware.

Channel pruning's downside: You have much less flexibility. Instead of removing any individual weight (fine-grained), you must remove an entire channel at once — even if only a few weights in that channel are actually small. The minimum removal unit is all Cin×k×k = 9×256 = 2,304 weights at once for a 3×3 conv (for the channel example above). This is why channel pruning achieves lower compression ratios than fine-grained pruning at the same accuracy drop.
A ResNet-50 layer is channel-pruned at 40% sparsity (remove 40% of output channels). If it originally had C_out=256, C_in=128, k=3, output 28×28, what are the new parameter count and the speedup ratio?

Chapter 4: Criteria I — Magnitude-Based Pruning

You've decided on a granularity. Now you need to score every weight (or group of weights) with an importance value. The simplest and most widely-used criterion is magnitude: larger absolute value = more important. The intuition is direct — a weight of 0.001 contributes almost nothing to the output, while a weight of 5.0 strongly amplifies or suppresses its input. Remove the small ones.

For element-wise (fine-grained) pruning, the importance of a single weight wi is:

Importance(wi) = |wi|

You sort all weights by |w|, and remove the bottom-k%. This is the approach in Han et al. 2015 — what they called "learning connections" (find a threshold, zero out everything below it, retrain with a frozen mask).

For row-wise (structured) pruning, you need a single scalar importance for a whole group of weights. Two natural choices:

L1-norm: Importance(S) = ∑i ∈ S |wi|
L2-norm: Importance(S) = (∑i ∈ S wi2)1/2 = ‖W(S)2

Worked example with numbers: Take a 2×4 weight matrix:

# Weight matrix (2 rows, 4 cols)
W = [[3, -2, 0, 1],
     [-5, 0, 1, -0.2]]

# L1-norm row importance
row0_L1 = |3| + |-2| + |0| + |1| = 6
row1_L1 = |-5| + |0| + |1| + |-0.2| = 6.2

# L2-norm row importance
row0_L2 = sqrt(9 + 4 + 0 + 1) = sqrt(14) ≈ 3.74
row1_L2 = sqrt(25 + 0 + 1 + 0.04) = sqrt(26.04) ≈ 5.10

# Both norms agree: row1 is more important than row0.
# At 50% row sparsity, row0 would be pruned.
Magnitude pruning — weight histogram with threshold sweep

A simulated weight distribution (Gaussian, matching real layer statistics). Drag the sparsity slider to sweep the threshold. Bars below threshold are shown in red (pruned). The accuracy proxy curve shows how accuracy typically degrades.

Sparsity50%
Magnitude ≠ importance in general. Magnitude is a heuristic proxy. A weight of 0.01 in a high-curvature region of the loss surface may be far more important than a weight of 5.0 in a flat region. The magnitude criterion works well on average for large networks, but it can systematically mislead near saddle points or when parameter scales differ across layers. Chapter 6 (OBD) shows the principled fix.

Per-layer magnitude pruning in PyTorch:

import torch

def magnitude_prune(weight: torch.Tensor, sparsity: float) -> torch.Tensor:
    """Return a binary mask: 1=keep, 0=prune. Keeps top (1-sparsity) fraction."""
    flat = weight.abs().view(-1)
    k = int(sparsity * flat.numel())  # number to prune
    threshold, _ = torch.kthvalue(flat, k)  # k-th smallest
    mask = (weight.abs() > threshold).float()
    return mask

# Example: 4x4 weight matrix, 50% sparsity
W = torch.randn(4, 4)
mask = magnitude_prune(W, sparsity=0.5)
W_pruned = W * mask  # zero out pruned weights
print(f"Nonzero weights: {mask.sum().int()}/16")  # → 8
The persistent-mask gotcha: During finetuning, the optimizer will update all weights — including the ones you just pruned to zero. After each gradient step, you must re-apply the mask: weight.data *= mask. If you forget this, gradient descent will gradually re-inflate the pruned weights, undoing your sparsity. This is why production pruning frameworks (torch.nn.utils.prune) register a forward hook that applies the mask on every forward pass.
You apply magnitude pruning at 90% sparsity to a weight tensor. You then run 10 epochs of finetuning WITHOUT re-applying the mask after each optimizer step. What happens to the sparsity after finetuning?

Chapter 5: Criteria II — Scaling Factors & APoZ

Magnitude-of-weights is a natural criterion when weights are all at the same scale. But two other signals carry more direct information about channel importance: the BatchNorm scaling factor (γ) and the average percentage of zero activations (APoZ). Each exploits information that is simply not available in the weight magnitude alone.

Scaling-based pruning — BN γ as channel importance

In a convolutional network with BatchNorm, each output channel j has a learned scaling factor γj. The BatchNorm output is:

zj = γj · (xj − μj) / σj + βj

If γj is very small, channel j barely contributes to the next layer's input — regardless of what the convolutional weights themselves look like. A γ near zero says: "this channel's output is being globally suppressed by training." That's a much more reliable importance signal than raw weight magnitude, because it reflects the network's learned decision about which channels matter.

The Network Slimming technique (Liu et al. ICCV 2017) adds an L1 sparsity penalty on all γ values during training:

Ltotal = Ltask + λ ∑jj|

This regularization nudges unimportant channel scaling factors toward zero. After training, pruning is trivial: sort channels by |γj|, remove the bottom-k%, and retrain.

BN scaling is a free importance signal: If your network already has BatchNorm (almost every modern CNN does), the γ values are trained alongside the weights. No extra computation needed. The pruning criterion is a byproduct of normal training — you just read out γ after the fact.
Channelγ valueAction
Filter 01.17Keep
Filter 10.10Prune
Filter 20.29Prune
Filter 30.82Keep
Filter N-10.56Keep

APoZ — percentage of zero activations

A different approach looks not at weights but at activations. ReLU networks produce many zeros in activation maps — any negative pre-activation becomes exactly zero. If a neuron (or channel) produces zero for the vast majority of inputs, it is contributing almost nothing to downstream computation. APoZ — Average Percentage of Zero activations — quantifies this.

APoZj = (1 / (N × H × W)) ∑i,h,w ϕ( Aj(xi)h,w = 0 )

where ϕ is an indicator that equals 1 when the activation is exactly zero, N is the number of evaluation samples, and H×W is the spatial dimension of the feature map.

APoZ is computed on a calibration dataset (typically a few thousand training examples) by running a forward pass and recording which channels are zero. A channel with APoZ = 95% is dead for 95% of inputs — an obvious pruning candidate.

APoZ is data-dependent. Unlike magnitude or γ, APoZ changes with the input distribution. A channel critical for one type of input (say, detecting fur texture) may show high APoZ on a dataset of vehicles. This is a feature (you can adapt pruning to a specific deployment domain) and a risk (over-pruning domain-specific features). Always compute APoZ on a representative calibration set.
import torch

def compute_apoz(model, dataloader, layer_name):
    """Compute Average Percentage of Zeros for each channel in a layer."""
    zero_counts = None
    total = 0
    def hook(module, inp, output):
        nonlocal zero_counts, total
        # output: (N, C, H, W)
        zeros = (output == 0).float().sum(dim=[0, 2, 3])  # sum over N,H,W → (C,)
        count = output.shape[0] * output.shape[2] * output.shape[3]
        if zero_counts is None:
            zero_counts = zeros
        else:
            zero_counts += zeros
        total += count
    h = model.get_submodule(layer_name).register_forward_hook(hook)
    model.eval()
    with torch.no_grad():
        for x, _ in dataloader:
            model(x)
    h.remove()
    return zero_counts / total  # APoZ per channel
You are channel-pruning a network using BN γ values. Channel A has |γ_A| = 0.05; Channel B has |γ_B| = 2.3. Before pruning, what should you do to make sure your threshold generalizes across all layers?

Chapter 6: Criteria III — Second-Order Pruning (OBD)

Magnitude tells you how big a weight is. But what you really want to know is: how much does the loss change if I remove this weight? For a well-trained network, removing weight wi means setting δwi = wi (deleting it is equivalent to perturbing it by its current value). The change in loss is δL.

Step 1: Taylor expansion of δL. Expand L(x; W − δW) around the current weights W:

δL = ∑i gi δwi + ½ ∑i hii δwi2 + ½ ∑i≠j hij δwi δwj + O(‖δW‖3)

where gi = ∂L/∂wi (gradient) and hij = ∂2L/(∂wi∂wj) (Hessian entry).

Step 2: Three OBD assumptions.

  1. Quadratic objective: Drop the cubic and higher terms → δL = ∑ giδwi + ½∑ hiiδwi2 + cross terms.
  2. Training has converged: At a local minimum, gradients are approximately zero (gi ≈ 0 for all i). Drop the first-order term → δL = ½∑ hiiδwi2 + cross terms.
  3. Independent errors: Assume pruning one weight doesn't affect the contribution of others (cross terms ≈ 0). Drop off-diagonal Hessian → δL = ½∑ hiiδwi2.

Step 3: The OBD saliency. Now set δwi = wi (pruning weight i means removing it, i.e., the perturbation equals the weight value) and δwj = 0 for all j ≠ i:

δLi ≈ ½ hii wi2

This is the OBD saliency — the estimated loss increase from removing weight i. Weights with small saliency are safe to prune. The key difference from magnitude: saliency weighs the weight by its curvature hii. A small weight in a high-curvature region (hii large) can be crucial; a large weight in a flat region (hii small) can be safely removed.

OBD saliency vs magnitude — 2D loss surface toy example

Two-weight model. Contours show the loss surface. Current weights: w1=0.1, w2=3.0. Note that w1 sits in a steep valley (high curvature), while w2 is in a flat ridge. Magnitude prunes w1; OBD correctly prunes w2. Drag w2 to explore.

w2 curvature (h22)0.01
w1 curvature (h11)200

Worked numerical example: Two weights, w1 = 0.1 and w2 = 3.0.

S1 = ½ h11 w12 = ½ × 200 × 0.01 = 1.0
S2 = ½ h22 w22 = ½ × 0.01 × 9.0 = 0.045

OBD decision: Prune w2 (saliency 0.045 < 1.0 — costs only 0.045 loss units). Magnitude decision: Prune w1 (|w1| = 0.1 < |w2| = 3.0). OBD avoids the expensive mistake: w1 sits on a steep ridge of the loss surface (h11=200 means moving w1 by 0.1 costs 1.0 loss units), while w2=3.0 sits in a flat valley (h22=0.01 means removing it barely changes the loss).

The catch with OBD: Computing hii for a network with millions of parameters requires a second pass through the data and storing the full diagonal Hessian. The off-diagonal Hessian (cross terms we dropped) can be significant — Optimal Brain Surgeon (OBS, 1993) keeps all cross terms and is more accurate but requires O(n²) memory for n parameters. SparseGPT (2023) uses an efficient Hessian approximation to prune LLMs in one shot without retraining, using the OBS framework.

Taylor expansion pruning (1st order): A simpler compromise. Keep the gradient term and approximate saliency as |gi × wi|. This is equivalent to the first-order Taylor series and is much cheaper than computing second derivatives. It works well for small pruning steps (when the network isn't far from the original optimum). Used in Molchanov et al. CVPR 2019.

In the OBD derivation, why is the first-order gradient term dropped (g_i = 0)?

Chapter 7: The Prune–Finetune Loop

You've picked a granularity and a criterion. Now comes the most practical question: do you prune the entire network at once (one-shot), or do you alternate between pruning and finetuning in small increments (iterative)?

The intuition for why iterative wins: imagine you're hiking and you want to remove 90% of your gear. If you dump 90% immediately, you've made a huge random change to how you carry load — you might drop critical items and the remaining 10% might not form a coherent system. But if you remove 10% at a time, evaluating and re-packing after each step, you can carefully select what to discard and the remaining gear stays coherent.

In network terms: one-shot pruning at 90% sparsity simultaneously removes many interdependent weights. The loss surface shifts sharply; the surviving weights were optimized for a different parameter configuration and finetuning may not recover accuracy. Iterative pruning at 10% per step makes a small change, finetunes to re-adapt the surviving weights, then repeats. Each finetuning step keeps the network near a good minimum before the next prune.

Misconception: “Finetuning after one-shot pruning gives the same result as iterative pruning, given the same total finetuning compute.” False. The quality of the final sparse model depends on the path taken through the sparsity-accuracy landscape. One-shot takes a large step off the manifold of good solutions; iterative stays near it. The accuracy recovery from finetuning is limited by how far the pruning moved you from a good minimum.

The standard loop:

# Iterative magnitude pruning with PyTorch
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

model = YourModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
target_sparsity = 0.90
steps = 9            # prune 10% → 20% → ... → 90%
step_size = target_sparsity / steps  # 0.10 per step

for step in range(steps):
    current_sparsity = (step + 1) * step_size
    # --- Prune each linear/conv layer ---
    for name, module in model.named_modules():
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            prune.l1_unstructured(module, name='weight', amount=step_size)
    # --- Finetune for a few epochs ---
    for epoch in range(finetune_epochs):
        train_one_epoch(model, optimizer, dataloader)
    print(f"Step {step+1}: sparsity={current_sparsity:.0%}")

# Remove the pruning hooks and make sparsity permanent
for name, module in model.named_modules():
    if hasattr(module, 'weight_mask'):
        prune.remove(module, 'weight')

Per-layer sensitivity

Not all layers are created equal. The first convolutional layer in a CNN processes raw pixels — its small number of learned edge detectors is hard to replace. If you prune 80% of them, the whole network sees degraded features. The last fully-connected layer, by contrast, often has massive redundancy and can tolerate 80–90% sparsity with negligible accuracy loss.

Sensitivity analysis: for each layer independently, measure the accuracy drop as a function of sparsity. Plot it as a curve (accuracy vs sparsity per layer). This tells you the maximum safe sparsity for each layer and motivates non-uniform sparsity: sensitive layers get low sparsity (10–30%); robust layers get high sparsity (70–90%). The overall compression is a weighted average.

Accuracy vs sparsity — one-shot vs iterative

Drag the sparsity slider. The orange curve shows one-shot pruning; the teal curve shows iterative pruning + finetuning. Notice how iterative recovers far more accuracy at the same final sparsity. Click “Run iterative step” to animate each prune–finetune cycle.

Target sparsity60%
You want to prune a network to 90% sparsity. Which approach typically gives higher final accuracy?

Chapter 8: Showcase — Sparsity Explorer

This showcase puts together everything from this lesson: granularity, criterion, and the per-layer sensitivity landscape. You're the ML engineer. You have a small CNN with five layers. Each layer has a different sensitivity to pruning — the first conv is fragile, the FC layers are robust. Your tools are: choose the granularity (fine or channel), choose the criterion (magnitude or OBD proxy), choose per-layer sparsity, and see how accuracy and MAC savings evolve.

Per-layer sensitivity bars

Each bar shows accuracy drop vs sparsity for that layer (measured independently). Taller bars = more sensitive = set lower target sparsity. The tool recommends a safe uniform vs adaptive budget allocation.

Channel pruning — tensor shrink visualizer

Watch the conv weight tensor shrink as you increase channel sparsity. The gray cells disappear, and the next layer's input dimension (left side) shrinks too. Slide to see how params and MACs scale linearly with retained channels.

Channel sparsity30%

Worked MAC savings at different sparsity targets:

SparsityRemaining paramsRemaining MACsNotes
0%1,179,648231.2MBaseline (C_in=256, C_out=512, 3×3, 14×14)
30%825,754161.8MSafe for most layers; 1.43× speedup
50%589,824115.6MAggressive but recoverable with finetuning; 2× speedup
70%353,89469.4MHigh sparsity; only for robust later layers; 3.3× speedup
90%117,96523.1MExtreme; only final FC layers can typically tolerate this

Channel pruning gives exact proportional scaling: 50% channel sparsity ⇒ 50% param reduction ⇒ 50% MAC reduction ⇒ ~50% wall-clock speedup (varies with memory bandwidth and other overheads).

The full picture: The optimal pruning strategy for a real deployment combines: (1) sensitivity analysis to assign per-layer sparsity budgets; (2) a good criterion (OBD/Taylor for precision, magnitude for speed); (3) iterative pruning with finetuning after each step; (4) the right granularity for your target hardware (channel for mobile/MCU, N:M for A100 GPU). These choices are not independent — the Pruning II lesson (Lecture 4) covers Automated Model Compression (AMC), which uses reinforcement learning to find the optimal per-layer sparsity budget automatically.
Layer 1 (first conv) has 80% accuracy drop at 50% sparsity; Layer 4 (last FC) has 2% drop at 80% sparsity. You have a total sparsity budget of 60% across the network. Which non-uniform allocation is better?

Chapter 9: Connections & Cheat Sheet

You now have the core vocabulary of pruning. Here's the full picture in one table, followed by what comes next.

Granularity × Criteria Cheat Sheet

GranularityBest criteriaHardware targetWhen to use
Fine-grainedOBD saliency, magnitudeCustom ASIC (EIE), A100 Sparse CoresMaximum compression ratio; edge accelerators with sparse support
N:M (2:4)Magnitude within groupsNVIDIA Ampere (A100, RTX 30+)GPU deployment needing real speedup at 50% sparsity
Vector/kernelL1/L2 group normPartially regular HWRare — middle ground between fine and channel
Channel/filterBN γ, L1-norm of filter, Taylor criterionAny GPU/CPU/MCULatency-critical deployment; no sparse HW available
CriterionComplexityWhen it excelsLimitation
Magnitude (|W|)O(n)Large networks; fast iterations; works well on averageIgnores curvature; can misprioritize in steep regions
L1/L2 group normO(n)Structured/channel pruning; natural for filter rankingSame blind spot as magnitude
BN γO(C) freeNetworks with BatchNorm; training-time regularizationOnly for channel pruning; requires retraining with L1-γ reg
APoZO(n × data)Detecting dead neurons; domain-specific pruningData-dependent; must run calibration set forward pass
OBD saliencyO(n) + Hessian diagWhen you want principled pruning near a minimumDiagonal Hessian approximation; off-diagonal errors

What this lesson covered

Bridge to Pruning II (Lecture 4)

This lesson covered the foundations. Pruning II dives into three open frontiers:

Related lessons on this site

Feynman test: Can you explain to a non-ML engineer why removing 90% of a network's weights doesn't destroy it? The right analogy: a symphony orchestra has 80 musicians but most pieces could be performed accurately by 30. The others are redundant — they reinforce notes that would be heard anyway. Training a neural network creates similar redundancy: many weights encode the same information in slightly different ways. Pruning finds the 10% that actually carry the signal.
You want to deploy a pruned ResNet-50 on a mobile phone CPU (ARM NEON, no sparse tensor core support). Which pruning granularity and approach gives you the best wall-clock speedup?