You have a convolutional net that's accurate — but it won't fit on your phone. To shrink it, you must first know WHERE its cost is. This lesson derives exact parameter and MAC formulas for every major layer type (Linear, Conv2d, grouped conv, depthwise separable), shows the MobileNet FLOP reduction trick from first principles, covers BatchNorm folding, the Transformer cost breakdown, and builds five interactive tools so you can measure any architecture before you touch a single weight. MIT 6.5940 by Song Han.
You're handed a ResNet-50. It scores 76% top-1 on ImageNet — solid accuracy. Your target device is a Cortex-M4 microcontroller with 512 KB of SRAM and a requirement to run inference in under 100 ms. You open up the model. There are 25 million parameters. The model at FP32 weighs 100 MB. The chip has 0.5 MB. Something has to give — but what?
The naive answer is "compress everything equally." But that's like telling a plumber to replace all the pipes when only one is leaking. Different layers have radically different cost profiles. The first conv layer in ResNet-50 has barely any parameters — but it processes the full 224×224 input and accounts for a massive fraction of activation memory. The final FC layer has millions of parameters — but processes a single vector, so MACs are low. Compress them the same way and you'll waste effort on cheap layers while leaving expensive ones untouched.
This lesson gives you the X-ray vision to see inside any architecture. By the end you'll be able to look at a layer spec — "3×3 Conv, C_in=64, C_out=128, output 56×56" — and instantly compute its parameter count, MAC count, activation size, and arithmetic intensity. That measurement is step zero before any compression technique in this course.
We'll cover every major layer: Linear (fully-connected), Conv2d, grouped convolution, depthwise separable convolution (the MobileNet trick), pooling, BatchNorm (and how to fold it into Conv at inference), activation functions, residual/skip connections, and the Transformer block (QKV projections, attention, FFN). Every formula will be derived, not memorized, and plugged with real numbers.
The formulas are few and simple. The insight is learning which term matters for each hardware target. On a server GPU: MACs are the bottleneck for batch inference. On a mobile CPU: memory bandwidth is the bottleneck for single-sample inference. On a microcontroller: SRAM capacity (activation memory) is the bottleneck, often tighter than either MACs or parameter storage. The same architecture can be optimal in one setting and completely infeasible in another — not because it changed, but because the bottleneck shifted.
By the end of this lesson you will know: (1) every layer type's exact formula; (2) where params, MACs, and activations peak in a CNN and a Transformer; (3) what each efficiency technique targets; (4) how to use Python hooks or manual calculation to profile any model. These are the skills every efficient ML practitioner needs before touching compression code.
Here is a map of what you'll know after each chapter:
The Linear layer (also called fully-connected, or FC) is the simplest building block. Every output neuron connects to every input neuron. If the input has Cin features and the output has Cout features, then every output yj is computed as:
The weight tensor W has shape (Cout, Cin). The bias b has shape (Cout,). Count the elements:
For the MAC count, each output neuron requires Cin multiply-accumulate operations (one per input × weight pair), and there are Cout output neurons:
Concrete example — BERT's final projection: Cin=768, Cout=768.
Notice params ≈ MACs for a Linear layer — one MAC per parameter, because each weight is used exactly once per forward pass (no spatial reuse). This is the key difference from Conv, where each weight gets reused across spatial positions.
AlexNet's two giant FC layers (4096→4096 and 4096→1000) together hold 4096²+4096×1000 = 20,873,216 params — 34% of AlexNet's total 61M — yet do only 20.9M MACs, 2.9% of AlexNet's 724M total. Params are expensive on storage (41 MB at FP32) but cheap on compute. A 4× weight quantization from FP32→INT8 saves 31 MB while leaving MACs unchanged. This is why quantization disproportionately benefits models with large FC layers.
Drag the sliders to set layer dimensions. The canvas shows the weight matrix shape and live counts.
A 2D convolution layer takes an input feature map of shape (Cin, Hin, Win) and produces an output of shape (Cout, Hout, Wout). The kernel has spatial size Kh × Kw. There is one kernel per (input channel, output channel) pair — so the weight tensor has shape (Cout, Cin, Kh, Kw).
Count the parameters: Cout output channels, each needing a kernel of size Cin × Kh × Kw, plus one bias per output channel:
For MACs: each of the Hout × Wout output positions requires computing Cout output values, each needing Cin × Kh × Kw multiply-accumulates:
The crucial insight: params don't depend on spatial size, but MACs do. The same weight tensor is reused at every output position. A layer with output spatial Hout×Wout = 56×56 = 3,136 performs 3,136× more MACs than a layer with output 1×1 — with identical parameter count.
Given input height Hin, kernel Kh, padding p, and stride s, the output height is:
Common cases: "same" padding (output = input size): p = (K−1)/2 with s=1. "Valid" padding (no padding, p=0): Hout = Hin − K + 1. Stride-2 halving: p=1, s=2, K=3 → Hout = Hin/2. For the AlexNet first conv: Hin=224, K=11, p=2, s=4 → Hout = ⌊(224+4−11)/4⌋+1 = ⌊217/4⌋+1 = 54+1 = 55. ✓
This formula matters for counting MACs: if you miscompute Hout, your MAC count is wrong. The padding and stride choices also affect how many times each input pixel contributes to outputs — relevant for understanding why strided conv reduces activations without reducing params.
Now apply the formula to AlexNet's first layer: 3→96 channels, 11×11 kernel, output 55×55 (stride 4 applied to 224×224 input):
That first layer alone costs 105M MACs — 14.5% of AlexNet's total 724M MACs — yet holds only 34K of AlexNet's 61M parameters (0.06%). Classic early-layer profile: cheap on params, expensive on compute due to large spatial maps.
python def conv2d_cost(c_in, c_out, kh, kw, h_out, w_out): """Compute parameter count and MACs for a Conv2d layer.""" params = c_out * c_in * kh * kw + c_out # weights + bias macs = c_in * c_out * kh * kw * h_out * w_out return params, macs # AlexNet layer 1: 3→96, 11×11 kernel, 55×55 output p, m = conv2d_cost(3, 96, 11, 11, 55, 55) print(f"Params: {p:,} MACs: {m:,}") # Params: 34,944 MACs: 105,415,200 # 3×3 conv 64→128 on 56×56 (common ResNet block) p, m = conv2d_cost(64, 128, 3, 3, 56, 56) print(f"Params: {p:,} MACs: {m:,}") # Params: 73,856 MACs: 231,211,008
Here is the complete AlexNet MAC and activation profile, derived from the formula above. Study the per-layer numbers — they reveal patterns you'll see in every CNN:
| Layer | Output shape | Params | MACs | Activation (elem) |
|---|---|---|---|---|
| Conv1 3→96, 11×11, s4 | 96×55×55 | 34,848 | 105,415,200 | 290,400 |
| MaxPool 3×3, s2 | 96×27×27 | 0 | ~0 | 69,984 |
| Conv2 96→256, 5×5, g2 | 256×27×27 | 307,200 | 223,948,800 | 186,624 |
| MaxPool 3×3, s2 | 256×13×13 | 0 | ~0 | 43,264 |
| Conv3 256→384, 3×3 | 384×13×13 | 884,736 | 149,520,384 | 64,896 |
| Conv4 384→384, 3×3, g2 | 384×13×13 | 663,552 | 112,140,288 | 64,896 |
| Conv5 384→256, 3×3, g2 | 256×13×13 | 442,368 | 74,760,192 | 43,264 |
| FC1 9216→4096 | 4096 | 37,748,736 | 37,748,736 | 4,096 |
| FC2 4096→4096 | 4096 | 16,777,216 | 16,777,216 | 4,096 |
| FC3 4096→1000 | 1000 | 4,096,000 | 4,096,000 | 1,000 |
| Total | — | 60,954,656 ≈ 61M | 723,406,816 ≈ 724M | 932,264 total |
Key observations: (1) 62% of params are in FC1+FC2 alone. (2) 85% of MACs are in the five conv layers. (3) Activation element count peaks at the input (150,528) — not shown, but the input + first output together = 440,928 elements, which is the peak activation memory AlexNet requires simultaneously.
The receptive field of an output neuron is the region of the original input that influences its value. A single 3×3 conv layer sees a 3×3 patch. Stack two 3×3 conv layers and the output sees a 5×5 patch of the original input. Stack three and it sees 7×7. The formula for L layers with kernel size k (and stride 1) is:
For L=3, k=3: RF = 3×2+1 = 7×7. That's the same receptive field as a single 7×7 conv — but with dramatically fewer parameters. A single 7×7 conv with C channels has 7²=49 weights per (in,out) channel pair; three stacked 3×3 convs have 3×3²=27 — 45% fewer params, with an identical view of the input. This is why VGG replaced large kernels with stacked 3×3s.
Receptive field matters for efficiency because large RF is needed for scene understanding — but achieving it with a single large kernel is parameter-inefficient. Stacking small kernels also introduces more non-linearity (one ReLU per layer), which improves representational power. The tradeoff: more layers = more activation memory, since each intermediate activation map must be kept live for the residual connection or gradient computation.
VGG vs single large kernel: VGG-16 uses 13 layers of 3×3 convolutions to achieve an effective RF that would require a single 13×13 or larger kernel. The 3×3 stack has 13×(2×3²×C²) = 23.4C² params per unit output; a single 13×13 layer would need 13²×C² = 169C² params — 7.2× more for the same RF. Plus each 3×3 layer introduces a nonlinearity, giving VGG's stacked 3×3s better discriminative power per parameter. This "stack small kernels" insight is now standard across most CNN architectures. Dilated (atrous) convolutions offer another approach: use K=3 with dilation rate r to sample input at positions separated by r pixels, achieving RF = K+(K-1)(r-1) without any increase in params or compute.
Add conv layers and watch the receptive field (orange square) grow in the input grid. Formula: RF = L×(k−1)+1.
Standard convolution treats all input channels jointly for all output channels. What if we could break that coupling? Grouped convolution divides the Cin input channels into g equal groups and applies a separate, narrower convolution within each group. Each group sees only Cin/g input channels and produces Cout/g output channels:
Both params and MACs shrink by exactly g×. AlexNet used g=2 to split across two GPUs in 2012 — it was an engineering hack that became a principled design choice.
Worked example — AlexNet Conv2, g=2: Cin=96, Cout=256, K=5×5. Standard conv params would be 256×96×25 = 614,400. With g=2: params = 614,400/2 = 307,200. MACs similarly halved: 96×256×25×27×27/2 = 223,948,800 (vs 447,897,600 for standard). The g=2 grouping effectively creates two parallel conv towers processing 48 input channels each and producing 128 output channels each — they never interact except at the next layer.
Take grouped convolution to the extreme: set g = Cin = Cout = C. Now each channel gets its own independent K×K spatial filter. This is a depthwise convolution. It captures spatial structure within each channel but never mixes channels. To mix channels afterward, apply a 1×1 pointwise convolution. Together they form depthwise separable convolution.
Step 1 — Depthwise conv (C channels, K×K kernel, output Hout×Wout):
Step 2 — Pointwise conv (Cin=C → Cout, 1×1 kernel):
Total depthwise separable (Cin=Cout=C for simplicity):
Standard conv (same dimensions):
The FLOP reduction ratio:
For typical values: K=3, C=128 → ratio = 1/128 + 1/9 ≈ 0.119 — roughly 8× fewer MACs. For K=3, C=256 → ratio ≈ 0.115 — still ~9×. The reduction is dominated by 1/K² for small channels but by 1/C for large channels.
MobileNetV1 architecture (Howard et al. 2017) replaces every standard 3×3 conv in a VGG-like network with depthwise separable conv. Result: 28× fewer MACs than VGG-16 (569M vs 15.3B), only 1% top-1 accuracy drop on ImageNet. This single architectural change — one formula substitution — made real-time vision on smartphones feasible. MobileNetV2 and V3 further add inverted residuals and hard-swish activations for even better efficiency.
python def depthwise_sep_cost(c, k, h_out, w_out, c_out=None): """Depthwise separable conv: DW + PW.""" if c_out is None: c_out = c # Depthwise: one k×k filter per channel params_dw = c * k * k macs_dw = c * k * k * h_out * w_out # Pointwise: 1×1 conv mixing channels params_pw = c_out * c macs_pw = c * c_out * h_out * w_out return params_dw + params_pw, macs_dw + macs_pw def std_conv_cost(c_in, c_out, k, h_out, w_out): return c_out * c_in * k * k, c_in * c_out * k * k * h_out * w_out c, k, h = 128, 3, 56 p_std, m_std = std_conv_cost(c, c, k, h, h) p_ds, m_ds = depthwise_sep_cost(c, k, h, h) print(f"Standard: params={p_std:,} MACs={m_std:,}") print(f"DS: params={p_ds:,} MACs={m_ds:,}") print(f"Reduction: {m_std/m_ds:.1f}× MACs {p_std/p_ds:.1f}× params") # Standard: params=147,456 MACs=462,422,016 # DS: params=17,536 MACs=54,992,896 # Reduction: 8.4× MACs 8.4× params
Drag the sliders to set Cin, Cout, kernel size K, and output spatial size H. The canvas computes and compares parameters and MACs side-by-side for standard conv, grouped conv (g=4), and depthwise separable conv. Watch how the relative costs shift as you change the parameters.
The three architectures represent three points on the parameter-accuracy tradeoff frontier:
Pooling downsamples the spatial dimensions of a feature map. Unlike conv, it has no learnable parameters — it applies a fixed operation (max or average) over a K×K window, typically with stride = K so windows don't overlap. For an input of shape (C, H, W) and pool size K, the output is (C, H/K, W/K), reducing the spatial area by K².
Pooling is cheap — no weights to store or load — but it destroys spatial information irreversibly. A 2×2 max pool after a 256×56×56 feature map produces 256×28×28, cutting activation size by 4× and making all subsequent layers cheaper.
Global Average Pooling (GAP) is the modern alternative to large FC layers for classification. Instead of flattening the final feature map (e.g., 512×7×7 = 25,088 values → 25,088-dimensional FC), GAP computes the average of each channel's spatial map → one value per channel (512 values). The subsequent FC is then 512→1000 (512K params) rather than 25,088→1000 (25.1M params). GAP was popularized by NiN and used in all MobileNets. Cost: C additions per spatial position, no parameters. The spatial structure is lost but the channel-level summary is preserved.
Strided convolution vs pooling: Rather than conv + pool, modern architectures often use a strided conv (stride=2) to downsample. A stride-2 3×3 conv produces the same output spatial size as conv + 2×2 pool, but the conv layer has learned parameters and can learn how to downsample rather than applying a fixed max/avg. ResNet uses stride-2 convolutions; VGG uses max pooling. The tradeoff: strided conv has more params but is potentially more expressive and eliminates a separate memory pass for the pool layer.
Every layer's output flows through an activation function — a non-linearity that gives deep networks their expressive power. Without activations, a 100-layer network would collapse to a single linear transformation. The cost: one additional operation per activation value, but zero learnable parameters for most activations.
ReLU (Rectified Linear Unit) is the default: y = max(0, x). One comparison per activation, no parameters, no exp() calls. For a 128×56×56 feature map, that's 401,408 comparisons — trivial. ReLU6 (y = min(max(0,x), 6)) is used in MobileNet to prevent very large activations, which aids fixed-point quantization (values stay in [0,6]).
Swish (y = x/(1+e-x)) and GELU (used in Transformers) are more expensive — each requires an exp() call, ~4–8× more expensive than ReLU. Hard Swish (y = x·(x+3)/6 for x ∈ [−3,3], clamped otherwise) approximates Swish with only integer ops — used in MobileNetV3 for hardware efficiency.
Residual connections (introduced in ResNet, He et al. 2016) add the input of a block directly to its output: y = F(x) + x. The function F is typically two or three conv layers. This lets the network learn residuals — small corrections to the identity — rather than full transformations, making deep networks trainable.
Cost of residual connections: zero additional parameters and zero additional MACs (just an element-wise add). But they require that input and output have the same shape. When the spatial size or channel count changes (e.g., stride-2 downsampling), a 1×1 conv projection shortcut is used to match dimensions — adding params = Cout×Cin and MACs = Cout×Cin×Hout×Wout.
Projection shortcut worked example (ResNet-50 first block): The block downsamples 64→256 channels at stride 2 (28×28 output). A 1×1 projection conv is needed: Cin=64, Cout=256, K=1, H=28, W=28.
For comparison, the 3×3 conv in the same block (64→64): MACs = 64×64×9×28×28 = 28,901,376 ≈ 29M. The projection shortcut adds 44% of the main-path compute — not free! But it only occurs once per stage change (typically 4 times in ResNet-50), so its aggregate cost is small relative to the full network's 4B MACs.
Batch Normalization normalizes the activations within a mini-batch to have zero mean and unit variance, then applies a learned scale γ and shift β per channel. During training for a conv feature map of shape (N, C, H, W), BN computes the mean and variance across (N, H, W) for each of the C channels:
This adds 2C trainable parameters (γ and β, one scalar per channel) — tiny. But at inference time, μ and σ are fixed (running statistics from training). The operation becomes a simple linear transform: multiply by (γ/σ) and add (β − γμ/σ). And a linear transform after a convolution is itself a convolution — so BatchNorm can be folded into the preceding conv layer.
LayerNorm (used in Transformers) normalizes across the feature dimension (d) for each token independently, rather than across the batch. For a token vector of size d: 2d parameters (γ, β), computed per token. Unlike BatchNorm, LayerNorm statistics change per-sample — it cannot be folded the same way. Its inference cost is 5 ops per element (subtract mean, divide std, scale, shift + the mean/var computation). For a Transformer with N=512 tokens and d=768: LayerNorm costs 512×768×5 ≈ 2M ops per block — negligible compared to the FFN's 2.4B MACs.
After training, the conv + BN sequence is:
Rearranging into a single convolution with new weights W' and bias b':
The result: zero additional MACs at inference. The BN layer simply disappears into the conv's weights. You save both the normalization arithmetic AND the memory reads for γ, β, μ, σ. Most inference engines (TensorRT, ONNX Runtime, TFLite) do this automatically as a graph optimization pass.
Toggle to see the Conv+BN pipeline (training) vs the folded-BN single conv (inference). Green checkmarks = operations eliminated.
The Transformer block has two sub-components: Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (FFN), each preceded by LayerNorm. For a sequence of N tokens with model dimension d (e.g. BERT-base: N=512, d=768), let's derive where every MAC and every parameter lives.
Three Linear projections map input X (shape N×d) to Queries, Keys, and Values — each of shape N×d. Three weight matrices, each d×d:
For BERT-base (d=768, N=512): ParamsQKV = 3×768²=1,769,472; MACs = 3×512×768²=905,969,664 ≈ 906M per block.
Attention scores are Q × KT (shape N×N), requiring N² × d multiply-accumulates. Then softmax over the N scores, then weighted sum V (another N² × d MACs):
For N=512, d=768: MACsattn = 2×512²×768 = 402,653,184 ≈ 403M. Notice: these MACs scale as N² — double the sequence length, quadruple the attention cost. This is the quadratic scaling problem of standard Transformers.
Another d×d linear map merges all attention heads: Params = d², MACs = N×d². For BERT-base: Params = 589,824; MACs = 512×768² = 301,989,888 ≈ 302M.
Combining all four attention-related components (QKV + attention + O-proj) for one head with d=768, N=512:
At N=512, d=768: MACs = 4×512×768² + 2×512²×768 = 1,207,959,552 + 402,653,184 = 1,610,612,736 ≈ 1.6B MACs.
Two linear layers: d → 4d → d (the 4× expansion is standard). Followed by a nonlinearity (GELU in BERT):
For BERT-base (d=768, N=512): ParamsFFN = 8×768² = 4,718,592 ≈ 4.7M; MACsFFN = 8×512×768² = 2,415,919,104 ≈ 2.4B per block.
Multi-head attention (MHA) vs single-head: Standard Transformers split attention into h heads (h=12 for BERT-base). Each head uses dimension d/h=64. The QKV weight matrices are still each d×d (just logically partitioned into h blocks). So multi-head attention has the same parameter count and MACs as single-head — the split is conceptual, not a cost change. The benefit is multiple attention patterns (different heads learn different relationships) without additional compute.
Drag the sequence length slider and watch attention MACs (∝ N²) overtake FFN MACs (∝ N). Model dimension d is fixed at 768.
Formulas are one thing; seeing them in a real architecture is another. Let's profile a simplified CNN (ResNet-style) and a Transformer block side-by-side, computing params, MACs, and activation memory per layer. The pattern that emerges is the foundation of every efficiency decision in the rest of this course.
Consider a simplified CNN: Input 3×224×224 → Conv 3→64, 7×7, stride 4 → Conv 64→128, 3×3 → Conv 128→256, 3×3 → Conv 256→512, 3×3, stride 2 → GlobalAvgPool → FC 512→1000. Spatial sizes shrink as we go deeper.
The output spatial size at each layer: 224→56 (stride 4), 56→56, 56→28 (via stride 2 in Conv3), 28→14, 14→1 (GlobalAvgPool). Channel counts grow to compensate: 3→64→128→256→512. This is the canonical ResNet-style "halve spatial, double channels" design.
| Layer | Output shape | Params | MACs | Act. size (FP16) |
|---|---|---|---|---|
| Conv1 7×7 3→64 s4 | 64×56×56 | 9,408 | 29.6M | 401 KB |
| Conv2 3×3 64→128 | 128×56×56 | 73,728 | 231M | 802 KB |
| Conv3 3×3 128→256 | 256×28×28 | 294,912 | 231M | 401 KB |
| Conv4 3×3 256→512 s2 | 512×14×14 | 1,179,648 | 231M | 201 KB |
| FC 512→1000 | 1000 | 512,000 | 512K | 2 KB |
The pattern is clear: early layers — huge activation maps (401–802 KB), few params (9K–74K), big spatial MACs. Late conv layers — many params (1.2M for Conv4), smaller spatial maps (14×14), same total MACs (spatial shrinks as channels expand). FC layer — half-million params but trivial MACs (no spatial dimension).
Now consider the activation memory peak. The first conv takes a 3×224×224 input (302 KB at FP16) and produces a 64×56×56 output (401 KB). During computation, both must be live simultaneously — that's 703 KB just for that one layer. On a 256 KB MCU, this is impossible without a patched/tiled inference approach. This is exactly the MCUNet problem: MobileNetV2 reduces parameters dramatically but its peak activation memory is still >1 MB.
A BERT-base Transformer has 12 identical blocks. Each block's cost (N=512):
| Component | Params | MACs (N=512) | % of block MACs |
|---|---|---|---|
| QKV projections (3×d²) | 1,769,472 | 906M | 22% |
| Attention (2N²d) | 0 | 403M | 10% |
| Output projection (d²) | 589,824 | 302M | 7% |
| FFN total (8d²) | 4,718,592 | 2,416M | 60% |
| LayerNorm (×2, 2d params) | 3,072 | ~8M | <1% |
ResNet-50 full model: 25M params, 4B MACs, 8 MB activation memory at inference (FP32). A MobileNetV2 with ~70% top-1 accuracy: 3.4M params, 300M MACs — 8× fewer MACs, 7× fewer params. The accuracy drop is only ~6 points. This is the efficiency-accuracy frontier: depthwise separable convolutions buy you 8× on the MAC axis at a small accuracy cost. Future techniques (NAS, knowledge distillation) push further.
Key takeaway: FFN holds 67% of block parameters and 60% of MACs at N=512. Attention holds 0% of parameters but grows to dominate MACs as N increases. This is why GPT-style compression research typically focuses on FFN sparsity (sparse attention, MoE routing) for params, and FlashAttention / linear attention for long-context compute.
Click a metric to highlight it. Notice how params peak late, activations peak early, and MACs are distributed across all conv layers.
This insight — that MACs stay roughly constant across stages when channels double and spatial size halves — is why "stage-wise" network design is so common. ResNet, EfficientNet, and MobileNet all use it. It means the network's total compute is distributed roughly evenly across all stages, rather than concentrated in one place. For efficiency engineers, this is convenient: no single stage dominates, so you can apply a uniform compression ratio across all stages without unbalancing the pipeline.
You now have the measurement vocabulary for every layer type in modern deep learning. Here is the complete reference — every formula, plugged with real numbers, and mapped to the efficiency techniques that target each cost type.
| Layer | Params | MACs | Example (numbers) |
|---|---|---|---|
| Linear Cin→Cout | Cin·Cout + Cout | Cin·Cout | 768→768: 590K params, 590K MACs |
| Conv2d Cin→Cout, K×K, H×W out | Cout·Cin·K² | Cin·Cout·K²·H·W | 64→128, 3×3, 56²: 74K params, 231M MACs |
| Grouped Conv g groups | Cout·Cin·K²/g | Cin·Cout·K²·H·W/g | g=4: 4× fewer params and MACs |
| Depthwise Conv C channels, K×K | C·K² | C·K²·H·W | C=128, K=3, 56²: 1,152 params, 3.6M MACs |
| Pointwise Conv C→Cout | Cout·C | C·Cout·H·W | 128→128, 56²: 16K params, 51M MACs |
| Depthwise Sep (DW+PW) | C·K² + Cout·C | C·K²·HW + C·Cout·HW | Reduction ≈ 1/K²+1/C vs standard |
| Pooling K×K, stride K | 0 | C·(H/K)·(W/K)·K² | No params; MACs ~ spatial comparisons |
| BatchNorm C channels | 2C (γ,β) | ≈4C·H·W train; 0 inference (folded) | 256 ch: 512 params; folded at inference |
| Transformer QKV d-dim | 3·d² | 3·N·d² | d=768, N=512: 1.8M params, 906M MACs |
| Transformer Attn N×d | 0 (no weights) | 2·N²·d | N=512, d=768: 0 params, 403M MACs |
| Transformer FFN d→4d→d | 8·d² | 8·N·d² | d=768, N=512: 4.7M params, 2.4B MACs |
| Cost type | Technique (upcoming lessons) | Target layer | Lesson |
|---|---|---|---|
| Parameter count / model size | Pruning, Quantization, Low-rank decomposition | Late conv, FC, FFN | TinyML L4–L6 |
| MAC count (compute) | Depthwise sep conv, NAS, operator fusion | Mid-network wide convs | TinyML L3, L8 |
| Activation memory (peak) | Gradient checkpointing, in-place ops, TinyTL, MCUNet tiling | Early layers (large H×W) | TinyML L10 |
| N² attention cost (long context) | FlashAttention, linear attention, sparse attn, GQA | Transformer attention block | TinyML L12 |
| Inference memory bandwidth | Quantization (INT8/INT4/FP8), weight compression | All weight-loading ops | TinyML L5–L6 |
| Energy / power budget | Mixed-precision, hardware-aware NAS, co-design | Full model | TinyML L9, L11 |
The lesson that motivated each row of this table is the formula you now know. Pruning targets params — you know exactly how many each layer has. Quantization targets model size (params × bitwidth) — you can compute the MB savings. Depthwise separation targets MACs — you derived the 8–9× reduction. MCUNet targets peak activation memory — you know which layers dominate (early, large H×W). FlashAttention targets the N²·d attention term — you derived why it grows quadratically. Every efficient ML paper starts with a cost measurement, using exactly these formulas.
python """Per-layer profiler: compute params, MACs, activation size for a model spec.""" def profile_model(layers): """layers: list of dicts with type and dimension specs.""" total_params, total_macs, peak_act = 0, 0, 0 print(f"{'Layer':<30} {'Params':>10} {'MACs':>12} {'Act(KB)':>8}") print("-"*62) for L in layers: t = L['type'] if t == 'linear': p = L['ci'] * L['co'] + L['co'] m = L['ci'] * L['co'] a = L['co'] * 2 # FP16 bytes elif t == 'conv': p = L['co']*L['ci']*L['k']**2 + L['co'] m = L['ci']*L['co']*L['k']**2*L['h']*L['w'] a = L['co']*L['h']*L['w']*2 elif t == 'dw_sep': p = L['c']*L['k']**2 + L['co']*L['c'] m = (L['c']*L['k']**2 + L['c']*L['co'])*L['h']*L['w'] a = L['co']*L['h']*L['w']*2 total_params += p; total_macs += m; peak_act = max(peak_act, a) print(f"{L['name']:<30} {p:>10,} {m:>12,} {a/1024:>7.1f}") print(f"\nTotal params: {total_params:,} Total MACs: {total_macs:,} Peak act: {peak_act/1024:.1f} KB") # Example: simplified MobileNet-style backbone profile_model([ {'name':'Conv1 3→32', 'type':'conv', 'ci':3, 'co':32, 'k':3, 'h':112, 'w':112}, {'name':'DWSep 32→64', 'type':'dw_sep', 'c':32, 'co':64, 'k':3, 'h':56, 'w':56}, {'name':'DWSep 64→128', 'type':'dw_sep', 'c':64, 'co':128, 'k':3, 'h':28, 'w':28}, {'name':'DWSep 128→256', 'type':'dw_sep', 'c':128, 'co':256, 'k':3, 'h':14, 'w':14}, {'name':'FC 256→1000', 'type':'linear', 'ci':256, 'co':1000}, ])
python """Full per-layer profiler with PyTorch hook-based MAC counting. Works on any nn.Module — registers forward hooks to intercept layer shapes.""" import torch import torch.nn as nn def count_layer_macs(module, input, output): """Forward hook that appends MAC count to module._macs.""" t = type(module) macs = 0 if isinstance(module, nn.Linear): # MACs = batch × C_in × C_out macs = input[0].numel() * module.out_features elif isinstance(module, nn.Conv2d): # MACs = C_in × C_out × K² × H_out × W_out / groups n, c_in, h_in, w_in = input[0].shape n, c_out, h_out, w_out = output.shape kh, kw = module.kernel_size macs = (c_in // module.groups) * c_out * kh * kw * h_out * w_out module._macs = module._macs + macs if hasattr(module, '_macs') else macs def profile_model_torch(model, input_shape=(1, 3, 224, 224)): """Profile a PyTorch model: params and MACs per layer.""" hooks = [] for m in model.modules(): if isinstance(m, (nn.Linear, nn.Conv2d)): hooks.append(m.register_forward_hook(count_layer_macs)) x = torch.zeros(*input_shape) with torch.no_grad(): model(x) print(f"{'Layer':<35} {'Params':>10} {'MACs':>12}") print("-"*60) total_p, total_m = 0, 0 for name, m in model.named_modules(): if hasattr(m, '_macs'): p = sum(x.numel() for x in m.parameters()) total_p += p; total_m += m._macs print(f"{name:<35} {p:>10,} {m._macs:>12,}") print(f"\nTotal: {total_p:,} params {total_m:,} MACs") for h in hooks: h.remove() # Example: profile a 3-block MobileNet-style backbone class DWSep(nn.Module): def __init__(self, cin, cout, stride=1): super().__init__() self.dw = nn.Conv2d(cin, cin, 3, stride=stride, padding=1, groups=cin, bias=False) self.pw = nn.Conv2d(cin, cout, 1, bias=False) def forward(self, x): return self.pw(self.dw(x)) backbone = nn.Sequential( nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False), DWSep(32, 64), DWSep(64, 128, stride=2), DWSep(128, 256, stride=2), DWSep(256, 512, stride=2), nn.AdaptiveAvgPool2d(1), nn.Flatten(), nn.Linear(512, 1000) ) profile_model_torch(backbone)
BERT-large total parameters (ignoring embeddings): per block = 4d² (attention projections) + 8d² (FFN) = 12d². Times 24 blocks: 24×12×1024² = 301,989,888 ≈ 302M. Plus embeddings (30,522 vocab × 1024 = 31M). Total ≈ 333M — matches the published 340M (small discrepancy from biases and LN).
For comparison, GPT-2 medium (d=1024, 24 layers, same architecture as BERT-large but causal): same param count (~340M) and similar MACs per forward pass. The difference is inference mode — BERT processes all N tokens in parallel (one pass), while GPT-2 generates autoregressively (one token at a time, with the KV cache). At generation time, GPT-2's batch=1 MACs per token are dominated by the FFN weights loaded from DRAM (AI ≈ 0.5 FLOPs/byte), making it memory-bandwidth-bound rather than compute-bound. Same formulas, completely different bottleneck.
BERT-large total MACs (N=512): per block = (4Nd² + 2N²d) + 8Nd² = 12Nd² + 2N²d. Block MACs = 12×512×1024² + 2×512²×1024 = 6,442,450,944 + 537,919,488 = 6,980,370,432 ≈ 7B. Times 24 blocks = 167.5B MACs for one forward pass at N=512. This is why BERT is expensive to run — 167B MACs × 2 FLOPs/MAC = 335 GFLOPs, vs ResNet-50's 4 GFLOPs. LLM inference at long contexts is even more expensive, driven by the N² term growing with context window.
When evaluating a new architecture or a compressed model, run this three-step mental audit:
The efficiency techniques you'll learn in subsequent lessons map directly to this table. Structured pruning (L3) removes entire filters from Conv/Linear layers — directly reducing the Cout dimension, shrinking both params and MACs. Quantization (L5) reduces bits per weight — shrinking model size without touching the MAC formula, but reducing the bytes-loaded per weight and enabling INT8 SIMD parallelism that gives ~4× throughput. Knowledge distillation (L7) reduces the overall network width d — replacing BERT-large (d=1024) with BERT-small (d=256) gives 16× param reduction (d² shrinks 16×) and 16× MAC reduction. Neural Architecture Search (L8) automates the tradeoff — searching the space of Cin, Cout, K, and layer count to minimize MACs subject to an accuracy constraint.
One more mental model: arithmetic intensity = MACs / bytes loaded. For a Linear layer (batch=1): AI = Cin·Cout MACs / (Cin·Cout·2 bytes) = 0.5 FLOPs/byte → deeply memory-bound. For a 3×3 Conv on 56×56: AI = 73K×3K MACs / (73K×2 bytes) = 3,136 FLOPs/byte → compute-bound on any modern GPU. This is why convolutions run efficiently on GPUs (high AI) but LLM autoregressive decoding does not (batch=1 Linear → AI=0.5).
"What I cannot create, I do not understand." — Richard Feynman. If you can compute the parameter count and MAC cost of any layer from its shape alone, you understand that layer. Everything else in efficient deep learning is a consequence of those two numbers.