Pruning I: Making Networks Sparse

Chapter 0: The 90% You Don't Need

Take AlexNet. 61 million parameters. A weight tensor for the first fully-connected layer alone holds 4096 × 4096 = 16.8 million values. If you print them out and plot a histogram, you'll see something striking: the vast majority of those values hover near zero. Not exactly zero — but close enough that removing them barely changes what the network computes.

That observation, made empirically by Song Han and colleagues in 2015, led to a result that still surprises people: you can prune 90% of AlexNet's connections and, after a brief finetuning pass, restore accuracy to within 0.5% of the original. You can do the same to VGG-16 at 12× compression. The 10% of weights you keep are doing almost all the work. The other 90% are passengers.

This is the core insight behind neural network pruning: trained networks are massively over-parameterized, and most of the excess can be removed without hurting the task they were trained for. The brain does the same thing — a human infant has ~15,000 synapses per neuron; by adolescence that drops to ~7,000 as unused connections are eliminated. The brain prunes based on activity; we prune based on importance criteria.

What pruning actually does: It produces a sparse network — a network where many weight values are exactly zero and can be skipped. Whether that sparsity translates to speedup depends on granularity: zeroing individual weights (fine-grained/unstructured) requires sparse-matrix hardware to accelerate; zeroing entire channels (structured) produces a smaller dense network that runs faster on any hardware. The central tension of this lesson is that finer-grained pruning compresses more, but coarser pruning accelerates more.

Here is the weight magnitude histogram for a toy 4×4 fully-connected layer — representative of what you see in practice. Notice how most weights cluster near zero while a few have large magnitude. A naive threshold wipes out the near-zero mass while preserving the large ones. The canvas below lets you visualize this matrix and sweep a sparsity threshold.

Weight matrix heat map — magnitude distribution

A simulated 8×8 weight matrix. Darker cells = larger |w|. The threshold line (drag slider) zeroes out weights below it. Watch how many cells go dark at 50% vs 90% sparsity — yet the few large weights survive.

Sparsity threshold50%

The tour this lesson takes: (1) formalize what pruning is as an optimization problem; (2) survey the spectrum of granularities — from individual weights up to whole channels — and what each buys you on real hardware; (3) study five criteria for scoring which weights to remove, including the elegant second-order method from Optimal Brain Damage (LeCun, 1989); (4) understand the prune–finetune–repeat loop that recovers accuracy at extreme sparsity; and (5) see how different layers tolerate wildly different sparsity levels.

Ch 1: The problem

argmin loss s.t. ‖W‖₀ ≤ N — the NP-hard combinatorial formulation and why we relax it

↓

Ch 2–3: Granularity

Fine-grained → N:M → vector → kernel → channel/filter: compression ratio vs HW speedup tradeoff

↓

Ch 4–6: Criteria

Magnitude (L1/L2), scaling factor (BN γ), APoZ, and the OBD Hessian saliency derivation

↓

Ch 7–8: Training

One-shot vs iterative; the persistent-mask gotcha; per-layer sensitivity; sparsity explorer

↓

Ch 9: Connections

Granularity × criteria cheat sheet; bridge to Pruning II (lottery ticket, AMC, system support)

AlexNet is pruned from 61M to 6.7M parameters (9× compression). The accuracy drop after retraining is negligible. What does this tell us about the trained network?

The network was badly trained and had many useless layers. The pruned 54M parameters were actually harmful and removing them helped. The trained network is massively over-parameterized: the task only requires ~10% of the connections. Accuracy is not sensitive to the number of parameters at all.

Chapter 1: The Pruning Problem (Formal)

Let's be precise about what we're actually trying to do. You have a trained network with weights W and an objective (loss) function L(x; W). You want to find a pruned set of weights W_P that minimizes the same objective while having few non-zero entries:

argmin_{W_P} L(x; W_P) subject to ‖W_P‖₀ ≤ N

Here ‖W_P‖₀ counts the number of non-zero elements in W_P — the L0 "norm" (it's not actually a norm, but it counts non-zeros). N is your target sparsity budget: how many connections the pruned network is allowed to have.

This formulation is clean but the solution is not. Exactly minimizing this is an NP-hard combinatorial search — you'd have to evaluate every possible subset of weights of size N and pick the best one. For a 61M-parameter network, that's more subsets than atoms in the observable universe.

Why we don't solve it exactly: Finding the globally optimal pruned network is NP-hard. Instead, every practical pruning algorithm is a heuristic approximation: it uses a proxy criterion (magnitude, Hessian, scaling factor, etc.) to score each weight's importance and then removes the least-important ones. The trick is choosing a proxy that correlates well with the true objective change δL.

The practical approach is a three-step decomposition:

Score every weight (or group of weights) with an importance metric — a proxy for how much the loss would increase if we removed it.
Rank and threshold — remove the k% least important weights.
Finetune — run a few epochs of gradient descent with the surviving weights to recover accuracy.

Two design choices dominate the rest of this lesson: what you prune (granularity — individual weights? entire channels?) and why you prune it (criterion — magnitude? curvature? activation statistics?). These are independent decisions: you can use magnitude as a criterion for either fine-grained or channel pruning.

The biological connection. The brain prunes synapses that fire rarely and strengthens synapses that fire together. Neural pruning mirrors this: a weight that rarely contributes to the output (small magnitude, or small curvature × magnitude product) is a safe candidate for removal. Song Han's lab found that 90% of connections in AlexNet are below a threshold that, if removed, leaves the task unchanged — the remaining 10% carry nearly all the signal.

Worked numbers — the compression table from Han et al. 2015:

Network	Before	After	Compression	MAC reduction
AlexNet	61M	6.7M	9×	3×
VGG-16	138M	10.3M	12×	5×
GoogLeNet	7M	2.0M	3.5×	5×
ResNet-50	26M	7.47M	3.4×	6.3×
SqueezeNet	1M	0.38M	3.2×	3.5×

Note: compression ratio > MAC reduction for unstructured pruning because zero weights still occupy compute cycles on dense GPU kernels. Real speedup requires sparse hardware support or structured pruning.

Key asymmetry: Parameter compression ratio is always ≥ MAC reduction ratio for fine-grained pruning on standard hardware. If you need actual latency reduction, you need either: (a) structured pruning (channels, filters) that produces a smaller dense network, or (b) hardware with sparse execution (NVIDIA Ampere 2:4 sparsity, or custom ASICs like EIE).

The pruning objective is “argmin L(x; W_P) subject to ‖W_P‖₀ ≤ N”. Why can't we solve this exactly for a 61M-parameter network?

The loss function L is non-differentiable with respect to the binary pruning mask. It requires evaluating all (61M choose N) subsets — a combinatorial search that is NP-hard. The L0 norm is not convex, so gradient descent diverges. The pruned weights change the gradient of the surviving weights, invalidating the objective.

Chapter 2: Granularity I — Fine-Grained & N:M Sparsity

Before you decide which weights to prune, you must decide at what scale to prune. This is granularity: how big is the unit you remove? The spectrum runs from individual scalar weights at one end, to entire filters (all weights in an output channel) at the other.

Fine-grained pruning (also called unstructured pruning) removes individual weights. For a 2D weight matrix, you get a pattern that looks random — a few surviving non-zeros scattered irregularly across the tensor. This is the most flexible option: you can remove exactly the weights with the smallest importance score, regardless of where they live. The result is a sparse matrix.

The problem: regular hardware runs on dense matrix operations (GEMM). A sparse weight tensor still occupies the same memory slots as its dense counterpart — you'd need to explicitly zero out entries and skip them in the multiply-accumulate pipeline. On a standard GPU, zeroing out 90% of a matrix doesn't make inference 10× faster; it might be the same speed or slower, because the hardware still executes dense vector operations and just multiplies by 0. To actually accelerate fine-grained sparse inference, you need:

Sparse tensor cores (NVIDIA A100 supports 2:4 sparsity with ~2× actual throughput)
Custom ASICs (EIE by Han et al. — pointer-jumping CSR format)
Encoding overhead: you must store the non-zero values plus their indices

Misconception alert: “Unstructured pruning reduces parameters but NOT latency without sparse-kernel hardware support.” This is one of the most common mistakes in efficient ML. On a standard GPU, a 90%-sparse network runs at essentially the same latency as the dense original. The compression benefit is in model storage and memory bandwidth when loading weights — not in arithmetic throughput.

N:M sparsity is a structured-irregular middle ground. For every contiguous M elements in the weight tensor, exactly N are kept and M−N are zeroed. The classic case is 2:4 sparsity: in every group of 4 weights, exactly 2 are non-zero (50% sparsity). NVIDIA's Ampere architecture (A100) supports 2:4 sparsity natively in its Sparse Tensor Cores: weights are stored in a compressed format (non-zero values + 2-bit indices), and the hardware skips zero multiplications automatically, delivering up to 2× throughput. Accuracy tests across BERT, ResNet, and ViT show that 2:4 sparsity typically recovers to within 0.5% of the dense baseline after sparse-aware training.

Granularity visualizer — toggle between pruning modes

A 6×8 weight matrix. Switch modes to see which cells are zeroed (gray) and what pattern emerges. Note how fine-grained produces random scatter, N:M produces a regular local pattern, and channel pruning removes entire rows.

Why 2:4 was chosen: The 2-bit index per non-zero (4 possible positions in a group of 4) is the minimum overhead that delivers near-unstructured flexibility. At 2:4 each compressed block stores 2 values + 2 × 2 bits = 2 values + 4 bits overhead, versus the 4 values in the dense form. Memory footprint is exactly halved. The hardware can then schedule sparse-dense matmuls with a dedicated "Sparse GEMM" kernel that reads the compressed matrix and uses the indices to select matching dense columns — achieving 2× throughput with no accuracy-to-sparsity tradeoff beyond what fine-grained 50% pruning would give.

An engineer prunes 80% of a ResNet-50's weights using fine-grained magnitude pruning, then measures inference latency on an A100 GPU without enabling sparse tensor cores. What do they observe?

~5× speedup because 80% fewer multiply-accumulates are needed. ~2× speedup because parameter loading from DRAM is halved. Essentially no speedup — the GPU still executes dense GEMM operations over the full tensor shape. A slowdown, because zeroing operations add overhead to the CUDA kernel.

Chapter 3: Granularity II — Structured Pruning

Structured pruning removes entire groups of weights that correspond to a unit the hardware naturally processes: a row, a column, a kernel (one 3×3 slice), or a channel (all kernels feeding one output). The result is a smaller dense network — no sparse formats needed, no special hardware, just the same GPU GEMM running on a tensor with fewer dimensions.

For a convolutional layer, the weight tensor has shape (C_out, C_in, k_h, k_w). The granularity hierarchy:

Granularity	Unit removed	Result	HW-friendly?	Compression ratio
Fine-grained	Individual w	Sparse tensor	No (needs sparse HW)	Highest
Pattern (N:M)	Contiguous groups	Structured sparse	Yes (A100+)	High (fixed 50%)
Vector	A row of kernel	Irregular sparse	Partial	Medium
Kernel	One k×k filter slice	Irregular sparse	Partial	Medium
Channel	Entire output channel	Smaller dense tensor	Yes (always)	Lower
Filter	All filters feeding a layer	Smaller dense tensor	Yes (always)	Lower

Channel pruning is the most popular structured method. If a layer has C_out = 512 output channels and you prune 50% of them, you get C_out = 256. The next layer's input channels must also shrink from 512 to 256 — you remove the corresponding input-channel slices in that layer too. The network literally becomes smaller: every tensor in the pruned region is narrower. No zero-skipping needed; plain dense convolution runs faster because the tensor is smaller.

Channel pruning as "width reduction": Channel pruning is equivalent to training a narrower network directly — it just finds which channels matter rather than picking widths randomly. The Llama 2 70B MLPerf submission used exactly this: width pruning reduced intermediate FFN dimensions from 28,762 to 14,336 (~2×), and depth pruning removed whole transformer layers (80 → 32). Together: 2.5× speedup, 99% accuracy retained.

Worked numbers — channel pruning a conv layer:

Layer spec: C_in=256, C_out=512, k=3×3, output 14×14.

Params = C_out × C_in × k² = 512 × 256 × 9 = 1,179,648

MACs = C_out × C_in × k² × H × W = 512 × 256 × 9 × 196 = 231,211,008

After 50% channel pruning (C_out → 256):

Params' = 256 × 256 × 9 = 589,824 (2× reduction)

MACs' = 256 × 256 × 9 × 196 = 115,605,504 (2× reduction)

For channel pruning at sparsity s: params and MACs both scale as (1−s), because the output tensor shrinks by (1−s) in the channel dimension and the next layer also shrinks its input side. This is the key advantage: real latency reduction without sparse-execution hardware.

Channel pruning's downside: You have much less flexibility. Instead of removing any individual weight (fine-grained), you must remove an entire channel at once — even if only a few weights in that channel are actually small. The minimum removal unit is all C_in×k×k = 9×256 = 2,304 weights at once for a 3×3 conv (for the channel example above). This is why channel pruning achieves lower compression ratios than fine-grained pruning at the same accuracy drop.

A ResNet-50 layer is channel-pruned at 40% sparsity (remove 40% of output channels). If it originally had C_out=256, C_in=128, k=3, output 28×28, what are the new parameter count and the speedup ratio?

Params stay the same; only MACs reduce by 40%. Both params and MACs reduce by 40%: from 294,912 → 176,947 params, ~1.67× speedup. Only params reduce by 40%; MACs are unchanged (hardware runs same tensor ops). Params reduce by 40% but MACs reduce by 40%² = 16% because both output and input channels shrink.

Chapter 4: Criteria I — Magnitude-Based Pruning

You've decided on a granularity. Now you need to score every weight (or group of weights) with an importance value. The simplest and most widely-used criterion is magnitude: larger absolute value = more important. The intuition is direct — a weight of 0.001 contributes almost nothing to the output, while a weight of 5.0 strongly amplifies or suppresses its input. Remove the small ones.

For element-wise (fine-grained) pruning, the importance of a single weight w_i is:

Importance(w_i) = |w_i|

You sort all weights by |w|, and remove the bottom-k%. This is the approach in Han et al. 2015 — what they called "learning connections" (find a threshold, zero out everything below it, retrain with a frozen mask).

For row-wise (structured) pruning, you need a single scalar importance for a whole group of weights. Two natural choices:

L1-norm: Importance(S) = ∑_{i ∈ S} |w_i|

L2-norm: Importance(S) = (∑_{i ∈ S} w_i²)^1/2 = ‖W^(S)‖₂

Worked example with numbers: Take a 2×4 weight matrix:

# Weight matrix (2 rows, 4 cols)
W = [[3, -2, 0, 1],
     [-5, 0, 1, -0.2]]

# L1-norm row importance
row0_L1 = |3| + |-2| + |0| + |1| = 6
row1_L1 = |-5| + |0| + |1| + |-0.2| = 6.2

# L2-norm row importance
row0_L2 = sqrt(9 + 4 + 0 + 1) = sqrt(14) ≈ 3.74
row1_L2 = sqrt(25 + 0 + 1 + 0.04) = sqrt(26.04) ≈ 5.10

# Both norms agree: row1 is more important than row0.
# At 50% row sparsity, row0 would be pruned.

Magnitude pruning — weight histogram with threshold sweep

A simulated weight distribution (Gaussian, matching real layer statistics). Drag the sparsity slider to sweep the threshold. Bars below threshold are shown in red (pruned). The accuracy proxy curve shows how accuracy typically degrades.

Sparsity50%

Magnitude ≠ importance in general. Magnitude is a heuristic proxy. A weight of 0.01 in a high-curvature region of the loss surface may be far more important than a weight of 5.0 in a flat region. The magnitude criterion works well on average for large networks, but it can systematically mislead near saddle points or when parameter scales differ across layers. Chapter 6 (OBD) shows the principled fix.

Per-layer magnitude pruning in PyTorch:

import torch

def magnitude_prune(weight: torch.Tensor, sparsity: float) -> torch.Tensor:
    """Return a binary mask: 1=keep, 0=prune. Keeps top (1-sparsity) fraction."""
    flat = weight.abs().view(-1)
    k = int(sparsity * flat.numel())  # number to prune
    threshold, _ = torch.kthvalue(flat, k)  # k-th smallest
    mask = (weight.abs() > threshold).float()
    return mask

# Example: 4x4 weight matrix, 50% sparsity
W = torch.randn(4, 4)
mask = magnitude_prune(W, sparsity=0.5)
W_pruned = W * mask  # zero out pruned weights
print(f"Nonzero weights: {mask.sum().int()}/16")  # → 8

The persistent-mask gotcha: During finetuning, the optimizer will update all weights — including the ones you just pruned to zero. After each gradient step, you must re-apply the mask: weight.data *= mask. If you forget this, gradient descent will gradually re-inflate the pruned weights, undoing your sparsity. This is why production pruning frameworks (torch.nn.utils.prune) register a forward hook that applies the mask on every forward pass.

You apply magnitude pruning at 90% sparsity to a weight tensor. You then run 10 epochs of finetuning WITHOUT re-applying the mask after each optimizer step. What happens to the sparsity after finetuning?

Sparsity increases further because the optimizer naturally drives small weights to zero. Sparsity stays at 90% because the pruned weights have gradient = 0 (they're already at zero). Sparsity decreases — the optimizer will push pruned zeros to non-zero values to minimize loss. Sparsity is irrelevant during finetuning; only accuracy matters.

Chapter 5: Criteria II — Scaling Factors & APoZ

Magnitude-of-weights is a natural criterion when weights are all at the same scale. But two other signals carry more direct information about channel importance: the BatchNorm scaling factor (γ) and the average percentage of zero activations (APoZ). Each exploits information that is simply not available in the weight magnitude alone.

Scaling-based pruning — BN γ as channel importance

In a convolutional network with BatchNorm, each output channel j has a learned scaling factor γ_j. The BatchNorm output is:

z_j = γ_j · (x_j − μ_j) / σ_j + β_j

If γ_j is very small, channel j barely contributes to the next layer's input — regardless of what the convolutional weights themselves look like. A γ near zero says: "this channel's output is being globally suppressed by training." That's a much more reliable importance signal than raw weight magnitude, because it reflects the network's learned decision about which channels matter.

The Network Slimming technique (Liu et al. ICCV 2017) adds an L1 sparsity penalty on all γ values during training:

L_total = L_task + λ ∑_j |γ_j|

This regularization nudges unimportant channel scaling factors toward zero. After training, pruning is trivial: sort channels by |γ_j|, remove the bottom-k%, and retrain.

BN scaling is a free importance signal: If your network already has BatchNorm (almost every modern CNN does), the γ values are trained alongside the weights. No extra computation needed. The pruning criterion is a byproduct of normal training — you just read out γ after the fact.

Channel	γ value	Action
Filter 0	1.17	Keep
Filter 1	0.10	Prune
Filter 2	0.29	Prune
Filter 3	0.82	Keep
Filter N-1	0.56	Keep

APoZ — percentage of zero activations

A different approach looks not at weights but at activations. ReLU networks produce many zeros in activation maps — any negative pre-activation becomes exactly zero. If a neuron (or channel) produces zero for the vast majority of inputs, it is contributing almost nothing to downstream computation. APoZ — Average Percentage of Zero activations — quantifies this.

APoZ_j = (1 / (N × H × W)) ∑_i,h,w ϕ( A_j(x_i)_h,w = 0 )

where ϕ is an indicator that equals 1 when the activation is exactly zero, N is the number of evaluation samples, and H×W is the spatial dimension of the feature map.

APoZ is computed on a calibration dataset (typically a few thousand training examples) by running a forward pass and recording which channels are zero. A channel with APoZ = 95% is dead for 95% of inputs — an obvious pruning candidate.

APoZ is data-dependent. Unlike magnitude or γ, APoZ changes with the input distribution. A channel critical for one type of input (say, detecting fur texture) may show high APoZ on a dataset of vehicles. This is a feature (you can adapt pruning to a specific deployment domain) and a risk (over-pruning domain-specific features). Always compute APoZ on a representative calibration set.

import torch

def compute_apoz(model, dataloader, layer_name):
    """Compute Average Percentage of Zeros for each channel in a layer."""
    zero_counts = None
    total = 0
    def hook(module, inp, output):
        nonlocal zero_counts, total
        # output: (N, C, H, W)
        zeros = (output == 0).float().sum(dim=[0, 2, 3])  # sum over N,H,W → (C,)
        count = output.shape[0] * output.shape[2] * output.shape[3]
        if zero_counts is None:
            zero_counts = zeros
        else:
            zero_counts += zeros
        total += count
    h = model.get_submodule(layer_name).register_forward_hook(hook)
    model.eval()
    with torch.no_grad():
        for x, _ in dataloader:
            model(x)
    h.remove()
    return zero_counts / total  # APoZ per channel

You are channel-pruning a network using BN γ values. Channel A has |γ_A| = 0.05; Channel B has |γ_B| = 2.3. Before pruning, what should you do to make sure your threshold generalizes across all layers?

Prune any channel with |γ| < 0.5 globally — this threshold works everywhere. Compare |γ| within each layer only, since different layers have different γ distributions. Divide each γ by the layer's maximum |γ| to normalize, then apply a relative threshold. Apply L1 sparsity regularization on γ during training so they are already on a common scale by pruning time.

Chapter 6: Criteria III — Second-Order Pruning (OBD)

Magnitude tells you how big a weight is. But what you really want to know is: how much does the loss change if I remove this weight? For a well-trained network, removing weight w_i means setting δw_i = w_i (deleting it is equivalent to perturbing it by its current value). The change in loss is δL.

Step 1: Taylor expansion of δL. Expand L(x; W − δW) around the current weights W:

δL = ∑_i g_i δw_i + ½ ∑_i h_ii δw_i² + ½ ∑_i≠j h_ij δw_i δw_j + O(‖δW‖³)

where g_i = ∂L/∂w_i (gradient) and h_ij = ∂²L/(∂w_i∂w_j) (Hessian entry).

Step 2: Three OBD assumptions.

Quadratic objective: Drop the cubic and higher terms → δL = ∑ g_iδw_i + ½∑ h_iiδw_i² + cross terms.
Training has converged: At a local minimum, gradients are approximately zero (g_i ≈ 0 for all i). Drop the first-order term → δL = ½∑ h_iiδw_i² + cross terms.
Independent errors: Assume pruning one weight doesn't affect the contribution of others (cross terms ≈ 0). Drop off-diagonal Hessian → δL = ½∑ h_iiδw_i².

Step 3: The OBD saliency. Now set δw_i = w_i (pruning weight i means removing it, i.e., the perturbation equals the weight value) and δw_j = 0 for all j ≠ i:

δL_i ≈ ½ h_ii w_i²

This is the OBD saliency — the estimated loss increase from removing weight i. Weights with small saliency are safe to prune. The key difference from magnitude: saliency weighs the weight by its curvature h_ii. A small weight in a high-curvature region (h_ii large) can be crucial; a large weight in a flat region (h_ii small) can be safely removed.

OBD saliency vs magnitude — 2D loss surface toy example

Two-weight model. Contours show the loss surface. Current weights: w₁=0.1, w₂=3.0. Note that w₁ sits in a steep valley (high curvature), while w₂ is in a flat ridge. Magnitude prunes w₁; OBD correctly prunes w₂. Drag w₂ to explore.

w₂ curvature (h₂₂)0.01

w₁ curvature (h₁₁)200

Worked numerical example: Two weights, w₁ = 0.1 and w₂ = 3.0.

S₁ = ½ h₁₁ w₁² = ½ × 200 × 0.01 = 1.0

S₂ = ½ h₂₂ w₂² = ½ × 0.01 × 9.0 = 0.045

OBD decision: Prune w₂ (saliency 0.045 < 1.0 — costs only 0.045 loss units). Magnitude decision: Prune w₁ (|w₁| = 0.1 < |w₂| = 3.0). OBD avoids the expensive mistake: w₁ sits on a steep ridge of the loss surface (h₁₁=200 means moving w₁ by 0.1 costs 1.0 loss units), while w₂=3.0 sits in a flat valley (h₂₂=0.01 means removing it barely changes the loss).

The catch with OBD: Computing h_ii for a network with millions of parameters requires a second pass through the data and storing the full diagonal Hessian. The off-diagonal Hessian (cross terms we dropped) can be significant — Optimal Brain Surgeon (OBS, 1993) keeps all cross terms and is more accurate but requires O(n²) memory for n parameters. SparseGPT (2023) uses an efficient Hessian approximation to prune LLMs in one shot without retraining, using the OBS framework.

Taylor expansion pruning (1st order): A simpler compromise. Keep the gradient term and approximate saliency as |g_i × w_i|. This is equivalent to the first-order Taylor series and is much cheaper than computing second derivatives. It works well for small pruning steps (when the network isn't far from the original optimum). Used in Molchanov et al. CVPR 2019.

In the OBD derivation, why is the first-order gradient term dropped (g_i = 0)?

The gradient is always zero for neural networks at any point during training. At a training minimum, the network has converged so the gradient with respect to each weight is approximately zero. We drop it to make the math simpler, not because it's actually small. The gradient is zero for pruned weights only (those set to zero have no gradient).

Chapter 7: The Prune–Finetune Loop

You've picked a granularity and a criterion. Now comes the most practical question: do you prune the entire network at once (one-shot), or do you alternate between pruning and finetuning in small increments (iterative)?

The intuition for why iterative wins: imagine you're hiking and you want to remove 90% of your gear. If you dump 90% immediately, you've made a huge random change to how you carry load — you might drop critical items and the remaining 10% might not form a coherent system. But if you remove 10% at a time, evaluating and re-packing after each step, you can carefully select what to discard and the remaining gear stays coherent.

In network terms: one-shot pruning at 90% sparsity simultaneously removes many interdependent weights. The loss surface shifts sharply; the surviving weights were optimized for a different parameter configuration and finetuning may not recover accuracy. Iterative pruning at 10% per step makes a small change, finetunes to re-adapt the surviving weights, then repeats. Each finetuning step keeps the network near a good minimum before the next prune.

Misconception: “Finetuning after one-shot pruning gives the same result as iterative pruning, given the same total finetuning compute.” False. The quality of the final sparse model depends on the path taken through the sparsity-accuracy landscape. One-shot takes a large step off the manifold of good solutions; iterative stays near it. The accuracy recovery from finetuning is limited by how far the pruning moved you from a good minimum.

The standard loop:

# Iterative magnitude pruning with PyTorch
import torch
import torch.nn as nn
import torch.nn.utils.prune as prune

model = YourModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
target_sparsity = 0.90
steps = 9            # prune 10% → 20% → ... → 90%
step_size = target_sparsity / steps  # 0.10 per step

for step in range(steps):
    current_sparsity = (step + 1) * step_size
    # --- Prune each linear/conv layer ---
    for name, module in model.named_modules():
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            prune.l1_unstructured(module, name='weight', amount=step_size)
    # --- Finetune for a few epochs ---
    for epoch in range(finetune_epochs):
        train_one_epoch(model, optimizer, dataloader)
    print(f"Step {step+1}: sparsity={current_sparsity:.0%}")

# Remove the pruning hooks and make sparsity permanent
for name, module in model.named_modules():
    if hasattr(module, 'weight_mask'):
        prune.remove(module, 'weight')

Per-layer sensitivity

Not all layers are created equal. The first convolutional layer in a CNN processes raw pixels — its small number of learned edge detectors is hard to replace. If you prune 80% of them, the whole network sees degraded features. The last fully-connected layer, by contrast, often has massive redundancy and can tolerate 80–90% sparsity with negligible accuracy loss.

Sensitivity analysis: for each layer independently, measure the accuracy drop as a function of sparsity. Plot it as a curve (accuracy vs sparsity per layer). This tells you the maximum safe sparsity for each layer and motivates non-uniform sparsity: sensitive layers get low sparsity (10–30%); robust layers get high sparsity (70–90%). The overall compression is a weighted average.

Accuracy vs sparsity — one-shot vs iterative

Drag the sparsity slider. The orange curve shows one-shot pruning; the teal curve shows iterative pruning + finetuning. Notice how iterative recovers far more accuracy at the same final sparsity. Click “Run iterative step” to animate each prune–finetune cycle.

Target sparsity60%

You want to prune a network to 90% sparsity. Which approach typically gives higher final accuracy?

One-shot: prune to 90% immediately, then finetune for 10 epochs. Iterative: prune 10% at a time over 9 steps, finetuning after each step. It doesn't matter — given equal total finetuning compute, both approaches converge to the same result. One-shot — fewer finetuning passes means less overfitting to the pruned structure.

Chapter 8: Showcase — Sparsity Explorer

This showcase puts together everything from this lesson: granularity, criterion, and the per-layer sensitivity landscape. You're the ML engineer. You have a small CNN with five layers. Each layer has a different sensitivity to pruning — the first conv is fragile, the FC layers are robust. Your tools are: choose the granularity (fine or channel), choose the criterion (magnitude or OBD proxy), choose per-layer sparsity, and see how accuracy and MAC savings evolve.

Per-layer sensitivity bars

Each bar shows accuracy drop vs sparsity for that layer (measured independently). Taller bars = more sensitive = set lower target sparsity. The tool recommends a safe uniform vs adaptive budget allocation.

Channel pruning — tensor shrink visualizer

Watch the conv weight tensor shrink as you increase channel sparsity. The gray cells disappear, and the next layer's input dimension (left side) shrinks too. Slide to see how params and MACs scale linearly with retained channels.

Channel sparsity30%

Worked MAC savings at different sparsity targets:

Sparsity	Remaining params	Remaining MACs	Notes
0%	1,179,648	231.2M	Baseline (C_in=256, C_out=512, 3×3, 14×14)
30%	825,754	161.8M	Safe for most layers; 1.43× speedup
50%	589,824	115.6M	Aggressive but recoverable with finetuning; 2× speedup
70%	353,894	69.4M	High sparsity; only for robust later layers; 3.3× speedup
90%	117,965	23.1M	Extreme; only final FC layers can typically tolerate this

Channel pruning gives exact proportional scaling: 50% channel sparsity ⇒ 50% param reduction ⇒ 50% MAC reduction ⇒ ~50% wall-clock speedup (varies with memory bandwidth and other overheads).

The full picture: The optimal pruning strategy for a real deployment combines: (1) sensitivity analysis to assign per-layer sparsity budgets; (2) a good criterion (OBD/Taylor for precision, magnitude for speed); (3) iterative pruning with finetuning after each step; (4) the right granularity for your target hardware (channel for mobile/MCU, N:M for A100 GPU). These choices are not independent — the Pruning II lesson (Lecture 4) covers Automated Model Compression (AMC), which uses reinforcement learning to find the optimal per-layer sparsity budget automatically.

Layer 1 (first conv) has 80% accuracy drop at 50% sparsity; Layer 4 (last FC) has 2% drop at 80% sparsity. You have a total sparsity budget of 60% across the network. Which non-uniform allocation is better?

Apply 60% uniformly to all layers — simplest and reproducible. Apply 80% to Layer 1 (largest param count) and less elsewhere to maximize compression. Apply ~10% to Layer 1 (sensitive) and ~80–90% to Layer 4 (robust) — allocate sparsity inversely to sensitivity. Apply 60% only to Layer 4 — it's robust, so it absorbs all the compression budget.

Chapter 9: Connections & Cheat Sheet

You now have the core vocabulary of pruning. Here's the full picture in one table, followed by what comes next.

Granularity × Criteria Cheat Sheet

Granularity	Best criteria	Hardware target	When to use
Fine-grained	OBD saliency, magnitude	Custom ASIC (EIE), A100 Sparse Cores	Maximum compression ratio; edge accelerators with sparse support
N:M (2:4)	Magnitude within groups	NVIDIA Ampere (A100, RTX 30+)	GPU deployment needing real speedup at 50% sparsity
Vector/kernel	L1/L2 group norm	Partially regular HW	Rare — middle ground between fine and channel
Channel/filter	BN γ, L1-norm of filter, Taylor criterion	Any GPU/CPU/MCU	Latency-critical deployment; no sparse HW available

Criterion	Complexity	When it excels	Limitation
Magnitude (\|W\|)	O(n)	Large networks; fast iterations; works well on average	Ignores curvature; can misprioritize in steep regions
L1/L2 group norm	O(n)	Structured/channel pruning; natural for filter ranking	Same blind spot as magnitude
BN γ	O(C) free	Networks with BatchNorm; training-time regularization	Only for channel pruning; requires retraining with L1-γ reg
APoZ	O(n × data)	Detecting dead neurons; domain-specific pruning	Data-dependent; must run calibration set forward pass
OBD saliency	O(n) + Hessian diag	When you want principled pruning near a minimum	Diagonal Hessian approximation; off-diagonal errors

What this lesson covered

The pruning formulation: argmin L(x; W_P) s.t. ‖W_P‖₀ ≤ N — NP-hard, solved by heuristics.
Granularity spectrum: fine-grained (max compression, needs sparse HW) → channel (less compression, always accelerates).
N:M 2:4 sparsity: NVIDIA Ampere native support, 2× throughput at 50% sparsity.
Magnitude criterion: sort by |W|, threshold, persistent mask during finetune.
BN γ criterion: training-time L1 reg on scaling factors, free importance signal.
APoZ: activation-based, data-dependent, catches dead neurons magnitude misses.
OBD derivation: δL ≈ ½ h_iiw_i² after 3 assumptions (quadratic, converged, independent).
Iterative pruning beats one-shot by keeping network near good minima throughout.
Per-layer sensitivity: first conv sensitive, last FC robust → non-uniform sparsity budgets.

Bridge to Pruning II (Lecture 4)

This lesson covered the foundations. Pruning II dives into three open frontiers:

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019): sparse sub-networks ("winning tickets") that, when trained in isolation from scratch with the right random initialization, can match the full network's accuracy. This reframes pruning as sub-network discovery, not just weight removal.
AMC — Automated Model Compression (He et al., ECCV 2018): use reinforcement learning to automatically determine the per-layer sparsity budget, replacing the hand-crafted sensitivity analysis with a learned policy. Input: layer spec. Output: optimal sparsity. 2× better latency-accuracy tradeoff than manual tuning.
System support: how to represent sparse models efficiently in memory (CSR, compressed sparse column, 2:4 compressed format), and what the hardware actually does with them (NVIDIA sparse tensor cores, EIE, SpArch).

Related lessons on this site

TinyML L1: Why Efficiency & Metrics — the roofline model, energy costs, and why DRAM access (640 pJ) dominates computation (0.9 pJ). Pruning reduces model size and DRAM loads.
TinyML L2: NN Building Blocks & Compute — layer-level param and MAC formulas used throughout this lesson. Channel pruning shrinks C_out; the formulas give exact savings.
LoRA & PEFT — a complementary compression direction: instead of removing weights, add a small low-rank adapter. Often combined with pruning for maximum compression.
Quantization (TinyML L5) — reduce bitwidth rather than count. Orthogonal to pruning: you can quantize a pruned network for compounding savings.

Feynman test: Can you explain to a non-ML engineer why removing 90% of a network's weights doesn't destroy it? The right analogy: a symphony orchestra has 80 musicians but most pieces could be performed accurately by 30. The others are redundant — they reinforce notes that would be heard anyway. Training a neural network creates similar redundancy: many weights encode the same information in slightly different ways. Pruning finds the 10% that actually carry the signal.

You want to deploy a pruned ResNet-50 on a mobile phone CPU (ARM NEON, no sparse tensor core support). Which pruning granularity and approach gives you the best wall-clock speedup?

Fine-grained magnitude pruning at 80% sparsity — highest parameter reduction. N:M 2:4 sparsity — hardware-accelerated on Ampere GPUs. Channel pruning — produces a smaller dense network that runs faster on any hardware. OBD-based fine-grained pruning — more principled criterion gives better accuracy.