TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 2

NN Building Blocks & Where Compute Lives

You have a convolutional net that's accurate — but it won't fit on your phone. To shrink it, you must first know WHERE its cost is. This lesson derives exact parameter and MAC formulas for every major layer type (Linear, Conv2d, grouped conv, depthwise separable), shows the MobileNet FLOP reduction trick from first principles, covers BatchNorm folding, the Transformer cost breakdown, and builds five interactive tools so you can measure any architecture before you touch a single weight. MIT 6.5940 by Song Han.

Prerequisites: TinyML L1 (MACs, params, activation memory) or equivalent. No new math required beyond multiplication.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: You Can't Optimize What You Can't Measure

You're handed a ResNet-50. It scores 76% top-1 on ImageNet — solid accuracy. Your target device is a Cortex-M4 microcontroller with 512 KB of SRAM and a requirement to run inference in under 100 ms. You open up the model. There are 25 million parameters. The model at FP32 weighs 100 MB. The chip has 0.5 MB. Something has to give — but what?

The naive answer is "compress everything equally." But that's like telling a plumber to replace all the pipes when only one is leaking. Different layers have radically different cost profiles. The first conv layer in ResNet-50 has barely any parameters — but it processes the full 224×224 input and accounts for a massive fraction of activation memory. The final FC layer has millions of parameters — but processes a single vector, so MACs are low. Compress them the same way and you'll waste effort on cheap layers while leaving expensive ones untouched.

This lesson gives you the X-ray vision to see inside any architecture. By the end you'll be able to look at a layer spec — "3×3 Conv, C_in=64, C_out=128, output 56×56" — and instantly compute its parameter count, MAC count, activation size, and arithmetic intensity. That measurement is step zero before any compression technique in this course.

The key asymmetry you'll discover: In a CNN, early layers have enormous activation maps and modest parameter counts; late layers have huge parameter counts and tiny activation maps. In a Transformer, the FFN holds most parameters but attention dominates compute at long sequences. Every efficiency technique targets a specific cost type — pruning targets params, quantization targets model size and memory bandwidth, knowledge distillation targets accuracy/param tradeoff, and operator fusion targets activation memory. Know the cost; pick the right tool.

We'll cover every major layer: Linear (fully-connected), Conv2d, grouped convolution, depthwise separable convolution (the MobileNet trick), pooling, BatchNorm (and how to fold it into Conv at inference), activation functions, residual/skip connections, and the Transformer block (QKV projections, attention, FFN). Every formula will be derived, not memorized, and plugged with real numbers.

The formulas are few and simple. The insight is learning which term matters for each hardware target. On a server GPU: MACs are the bottleneck for batch inference. On a mobile CPU: memory bandwidth is the bottleneck for single-sample inference. On a microcontroller: SRAM capacity (activation memory) is the bottleneck, often tighter than either MACs or parameter storage. The same architecture can be optimal in one setting and completely infeasible in another — not because it changed, but because the bottleneck shifted.

By the end of this lesson you will know: (1) every layer type's exact formula; (2) where params, MACs, and activations peak in a CNN and a Transformer; (3) what each efficiency technique targets; (4) how to use Python hooks or manual calculation to profile any model. These are the skills every efficient ML practitioner needs before touching compression code.

Here is a map of what you'll know after each chapter:

Ch 1: Linear
Params = C_in·C_out; MACs same; reuse = 1×; quantization target
Ch 2: Conv2d
Params = C_out·C_in·K²; MACs adds H·W spatial reuse; AlexNet full breakdown
Ch 3: Receptive Field
RF = L·(k−1)+1; stack 3×3s for efficiency; residual memory cost
Ch 4: Depthwise Separable
Ratio = 1/K² + 1/C; MobileNet 8–9× reduction; memory-bandwidth caveat
Ch 6–7: Normalization + Transformer
BN folding → 0 inference cost; FFN params vs attention N²·d MACs
Ch 8–9: Real networks
CNN/Transformer per-layer cost maps; compression technique targeting
A ResNet-50's first conv layer (3→64 channels, 7×7 kernel, output 112×112) has very few parameters but huge activation memory. Its last FC layer (2048→1000) has millions of parameters but tiny activation memory. Which compression technique is most naturally matched to shrinking the last FC layer's weight storage?

Chapter 1: Linear Layer: Params & MACs

The Linear layer (also called fully-connected, or FC) is the simplest building block. Every output neuron connects to every input neuron. If the input has Cin features and the output has Cout features, then every output yj is computed as:

yj = ∑i Wji · xi + bj

The weight tensor W has shape (Cout, Cin). The bias b has shape (Cout,). Count the elements:

Parameters = Cin · Cout + Cout = Cout(Cin + 1)

For the MAC count, each output neuron requires Cin multiply-accumulate operations (one per input × weight pair), and there are Cout output neurons:

MACs = Cin · Cout   (bias adds are negligible)

Concrete example — BERT's final projection: Cin=768, Cout=768.

Params = 768 × 768 + 768 = 590,592 ≈ 590K    MACs = 768 × 768 = 589,824

Notice params ≈ MACs for a Linear layer — one MAC per parameter, because each weight is used exactly once per forward pass (no spatial reuse). This is the key difference from Conv, where each weight gets reused across spatial positions.

AlexNet's two giant FC layers (4096→4096 and 4096→1000) together hold 4096²+4096×1000 = 20,873,216 params — 34% of AlexNet's total 61M — yet do only 20.9M MACs, 2.9% of AlexNet's 724M total. Params are expensive on storage (41 MB at FP32) but cheap on compute. A 4× weight quantization from FP32→INT8 saves 31 MB while leaving MACs unchanged. This is why quantization disproportionately benefits models with large FC layers.

The reuse gap. A Linear layer has arithmetic intensity ≈ 1 MAC/param. A Conv layer with 56×56 output has 56×56 = 3,136 MACs/param. This is why convolution is so much more compute-efficient than FC for spatial data — the same weight serves thousands of positions.
Linear layer parameter and MAC counter

Drag the sliders to set layer dimensions. The canvas shows the weight matrix shape and live counts.

C_in768
C_out768
A Linear layer maps 2048 inputs to 1000 outputs (like AlexNet's final classifier). Ignoring bias, how many parameters does it have?

Chapter 2: Conv2d: The Workhorse

A 2D convolution layer takes an input feature map of shape (Cin, Hin, Win) and produces an output of shape (Cout, Hout, Wout). The kernel has spatial size Kh × Kw. There is one kernel per (input channel, output channel) pair — so the weight tensor has shape (Cout, Cin, Kh, Kw).

Count the parameters: Cout output channels, each needing a kernel of size Cin × Kh × Kw, plus one bias per output channel:

Params = Cout · Cin · Kh · Kw + Cout

For MACs: each of the Hout × Wout output positions requires computing Cout output values, each needing Cin × Kh × Kw multiply-accumulates:

MACs = Cin · Cout · Kh · Kw · Hout · Wout

The crucial insight: params don't depend on spatial size, but MACs do. The same weight tensor is reused at every output position. A layer with output spatial Hout×Wout = 56×56 = 3,136 performs 3,136× more MACs than a layer with output 1×1 — with identical parameter count.

Output spatial size formula

Given input height Hin, kernel Kh, padding p, and stride s, the output height is:

Hout = ⌊(Hin + 2p − Kh) / s⌋ + 1

Common cases: "same" padding (output = input size): p = (K−1)/2 with s=1. "Valid" padding (no padding, p=0): Hout = Hin − K + 1. Stride-2 halving: p=1, s=2, K=3 → Hout = Hin/2. For the AlexNet first conv: Hin=224, K=11, p=2, s=4 → Hout = ⌊(224+4−11)/4⌋+1 = ⌊217/4⌋+1 = 54+1 = 55. ✓

This formula matters for counting MACs: if you miscompute Hout, your MAC count is wrong. The padding and stride choices also affect how many times each input pixel contributes to outputs — relevant for understanding why strided conv reduces activations without reducing params.

Worked example — standard 3×3 conv, 64→128 channels, output 56×56:
Params (ignoring bias) = 128 × 64 × 3 × 3 = 73,728
MACs = 64 × 128 × 3 × 3 × 56 × 56 = 73,728 × 3,136 = 231,211,008 ≈ 231M MACs
Reuse factor = 3,136× — every weight is used 3,136 times. This is why CNNs are compute-efficient.

Now apply the formula to AlexNet's first layer: 3→96 channels, 11×11 kernel, output 55×55 (stride 4 applied to 224×224 input):

Params = 96 × 3 × 11 × 11 = 34,848   MACs = 3 × 96 × 11 × 11 × 55 × 55 = 105,415,200

That first layer alone costs 105M MACs — 14.5% of AlexNet's total 724M MACs — yet holds only 34K of AlexNet's 61M parameters (0.06%). Classic early-layer profile: cheap on params, expensive on compute due to large spatial maps.

python
def conv2d_cost(c_in, c_out, kh, kw, h_out, w_out):
    """Compute parameter count and MACs for a Conv2d layer."""
    params = c_out * c_in * kh * kw + c_out  # weights + bias
    macs = c_in * c_out * kh * kw * h_out * w_out
    return params, macs

# AlexNet layer 1: 3→96, 11×11 kernel, 55×55 output
p, m = conv2d_cost(3, 96, 11, 11, 55, 55)
print(f"Params: {p:,}  MACs: {m:,}")
# Params: 34,944  MACs: 105,415,200

# 3×3 conv 64→128 on 56×56 (common ResNet block)
p, m = conv2d_cost(64, 128, 3, 3, 56, 56)
print(f"Params: {p:,}  MACs: {m:,}")
# Params: 73,856  MACs: 231,211,008

Here is the complete AlexNet MAC and activation profile, derived from the formula above. Study the per-layer numbers — they reveal patterns you'll see in every CNN:

LayerOutput shapeParamsMACsActivation (elem)
Conv1 3→96, 11×11, s496×55×5534,848105,415,200290,400
MaxPool 3×3, s296×27×270~069,984
Conv2 96→256, 5×5, g2256×27×27307,200223,948,800186,624
MaxPool 3×3, s2256×13×130~043,264
Conv3 256→384, 3×3384×13×13884,736149,520,38464,896
Conv4 384→384, 3×3, g2384×13×13663,552112,140,28864,896
Conv5 384→256, 3×3, g2256×13×13442,36874,760,19243,264
FC1 9216→4096409637,748,73637,748,7364,096
FC2 4096→4096409616,777,21616,777,2164,096
FC3 4096→100010004,096,0004,096,0001,000
Total60,954,656 ≈ 61M723,406,816 ≈ 724M932,264 total

Key observations: (1) 62% of params are in FC1+FC2 alone. (2) 85% of MACs are in the five conv layers. (3) Activation element count peaks at the input (150,528) — not shown, but the input + first output together = 440,928 elements, which is the peak activation memory AlexNet requires simultaneously.

A 3×3 Conv2d layer has C_in=256, C_out=256, and output spatial size 13×13 (like AlexNet layer 3). Ignoring bias, how many parameters does it have — and how many MACs?
AlexNet full parameter breakdown — by derivation:
Layer 1: 3→96, 11×11, params = 96×3×121 = 34,848
Layer 2: 96→256, 5×5, g=2, params = 256×96×25/2 = 307,200
Layer 3: 256→384, 3×3, params = 384×256×9 = 884,736
Layer 4: 384→384, 3×3, g=2, params = 384×384×9/2 = 663,552
Layer 5: 384→256, 3×3, g=2, params = 256×384×9/2 = 442,368
FC1: 256×6×6→4096, params = 9,216×4096 = 37,748,736
FC2: 4096→4096, params = 4096×4096 = 16,777,216
FC3: 4096→1000, params = 4096×1000 = 4,096,000
Total ≈ 61M — 62% of which are in the two FC layers. This is why pruning and quantization of FC layers gives large model size reduction with limited accuracy loss.

Chapter 3: Receptive Field & Spatial Reach

The receptive field of an output neuron is the region of the original input that influences its value. A single 3×3 conv layer sees a 3×3 patch. Stack two 3×3 conv layers and the output sees a 5×5 patch of the original input. Stack three and it sees 7×7. The formula for L layers with kernel size k (and stride 1) is:

RF = L · (k − 1) + 1

For L=3, k=3: RF = 3×2+1 = 7×7. That's the same receptive field as a single 7×7 conv — but with dramatically fewer parameters. A single 7×7 conv with C channels has 7²=49 weights per (in,out) channel pair; three stacked 3×3 convs have 3×3²=27 — 45% fewer params, with an identical view of the input. This is why VGG replaced large kernels with stacked 3×3s.

Receptive field matters for efficiency because large RF is needed for scene understanding — but achieving it with a single large kernel is parameter-inefficient. Stacking small kernels also introduces more non-linearity (one ReLU per layer), which improves representational power. The tradeoff: more layers = more activation memory, since each intermediate activation map must be kept live for the residual connection or gradient computation.

Stride multiplies RF growth. A stride-2 conv doubles the effective RF reach in the next layer. Modern networks use strided convolutions or pooling to rapidly expand RF while reducing spatial resolution — shrinking activation memory at the cost of spatial precision.

VGG vs single large kernel: VGG-16 uses 13 layers of 3×3 convolutions to achieve an effective RF that would require a single 13×13 or larger kernel. The 3×3 stack has 13×(2×3²×C²) = 23.4C² params per unit output; a single 13×13 layer would need 13²×C² = 169C² params — 7.2× more for the same RF. Plus each 3×3 layer introduces a nonlinearity, giving VGG's stacked 3×3s better discriminative power per parameter. This "stack small kernels" insight is now standard across most CNN architectures. Dilated (atrous) convolutions offer another approach: use K=3 with dilation rate r to sample input at positions separated by r pixels, achieving RF = K+(K-1)(r-1) without any increase in params or compute.

Receptive field growth visualizer

Add conv layers and watch the receptive field (orange square) grow in the input grid. Formula: RF = L×(k−1)+1.

Layers L2
Kernel k3
You stack 5 layers of 3×3 convolutions (stride 1, no pooling). What is the receptive field size at the final output?

Chapter 4: Grouped & Depthwise Separable Conv

Standard convolution treats all input channels jointly for all output channels. What if we could break that coupling? Grouped convolution divides the Cin input channels into g equal groups and applies a separate, narrower convolution within each group. Each group sees only Cin/g input channels and produces Cout/g output channels:

Paramsgrouped = Cout · Cin · Kh · Kw / g     MACsgrouped = Cin · Cout · Kh · Kw · Hout · Wout / g

Both params and MACs shrink by exactly g×. AlexNet used g=2 to split across two GPUs in 2012 — it was an engineering hack that became a principled design choice.

Worked example — AlexNet Conv2, g=2: Cin=96, Cout=256, K=5×5. Standard conv params would be 256×96×25 = 614,400. With g=2: params = 614,400/2 = 307,200. MACs similarly halved: 96×256×25×27×27/2 = 223,948,800 (vs 447,897,600 for standard). The g=2 grouping effectively creates two parallel conv towers processing 48 input channels each and producing 128 output channels each — they never interact except at the next layer.

When does grouping hurt accuracy? Grouped convolutions prevent cross-group information mixing. With g=4, channels 0–31 never influence channels 32–63 at this layer. In practice, this hurts accuracy less than expected because later pointwise (1×1) convolutions can re-mix all channels. ShuffleNet exploits this explicitly — after a grouped conv, it "shuffles" the channels across groups before the next layer, giving cross-group communication at nearly zero cost.

Depthwise Separable Convolution — the MobileNet trick

Take grouped convolution to the extreme: set g = Cin = Cout = C. Now each channel gets its own independent K×K spatial filter. This is a depthwise convolution. It captures spatial structure within each channel but never mixes channels. To mix channels afterward, apply a 1×1 pointwise convolution. Together they form depthwise separable convolution.

Step 1 — Depthwise conv (C channels, K×K kernel, output Hout×Wout):

ParamsDW = C · K2   MACsDW = C · K2 · Hout · Wout

Step 2 — Pointwise conv (Cin=C → Cout, 1×1 kernel):

ParamsPW = Cout · C · 1 · 1 = Cout · C   MACsPW = C · Cout · Hout · Wout

Total depthwise separable (Cin=Cout=C for simplicity):

MACsDS = C · K2 · H · W + C2 · H · W = C · H · W · (K2 + C)

Standard conv (same dimensions):

MACsstd = C2 · K2 · H · W

The FLOP reduction ratio:

ratio = MACsDS / MACsstd = (K2 + C) / (C · K2) = 1/C + 1/K2

For typical values: K=3, C=128 → ratio = 1/128 + 1/9 ≈ 0.119 — roughly 8× fewer MACs. For K=3, C=256 → ratio ≈ 0.115 — still ~9×. The reduction is dominated by 1/K² for small channels but by 1/C for large channels.

Worked example — standard vs depthwise separable, C=128, K=3, 56×56:
Standard: MACs = 128 × 128 × 9 × 56 × 56 = 462,422,016 ≈ 462M
Depthwise: MACsDW = 128 × 9 × 3136 = 3,612,672
Pointwise: MACsPW = 128 × 128 × 3136 = 51,380,224
DS total: 54,992,896 ≈ 55M — 8.4× reduction. MobileNets achieve near-ResNet accuracy at this cost.

MobileNetV1 architecture (Howard et al. 2017) replaces every standard 3×3 conv in a VGG-like network with depthwise separable conv. Result: 28× fewer MACs than VGG-16 (569M vs 15.3B), only 1% top-1 accuracy drop on ImageNet. This single architectural change — one formula substitution — made real-time vision on smartphones feasible. MobileNetV2 and V3 further add inverted residuals and hard-swish activations for even better efficiency.

Misconception: depthwise conv is always faster in wall-clock time. Depthwise conv has fewer FLOPs, but its arithmetic intensity is very low — one MAC per weight loaded, no cross-channel reuse. On hardware with highly-optimized GEMM (GPUs, NPUs), a standard conv packs data efficiently into matrix multiplications. A depthwise conv does not. In practice, a depthwise layer can be memory-bandwidth-bound even though it does fewer FLOPs. Dedicated hardware support (like Google Edge TPU's custom depthwise kernel) is needed to fully realize the speedup.
python
def depthwise_sep_cost(c, k, h_out, w_out, c_out=None):
    """Depthwise separable conv: DW + PW."""
    if c_out is None: c_out = c
    # Depthwise: one k×k filter per channel
    params_dw = c * k * k
    macs_dw = c * k * k * h_out * w_out
    # Pointwise: 1×1 conv mixing channels
    params_pw = c_out * c
    macs_pw = c * c_out * h_out * w_out
    return params_dw + params_pw, macs_dw + macs_pw

def std_conv_cost(c_in, c_out, k, h_out, w_out):
    return c_out * c_in * k * k, c_in * c_out * k * k * h_out * w_out

c, k, h = 128, 3, 56
p_std, m_std = std_conv_cost(c, c, k, h, h)
p_ds, m_ds = depthwise_sep_cost(c, k, h, h)
print(f"Standard:  params={p_std:,}  MACs={m_std:,}")
print(f"DS:        params={p_ds:,}   MACs={m_ds:,}")
print(f"Reduction: {m_std/m_ds:.1f}× MACs  {p_std/p_ds:.1f}× params")
# Standard:  params=147,456  MACs=462,422,016
# DS:        params=17,536   MACs=54,992,896
# Reduction: 8.4× MACs  8.4× params
The FLOP reduction ratio of depthwise separable vs standard conv is 1/C + 1/K². For K=3 and C=512, approximately what is the reduction factor (i.e., how many times fewer MACs)?

Chapter 5: Showcase: Conv Cost Comparator

Drag the sliders to set Cin, Cout, kernel size K, and output spatial size H. The canvas computes and compares parameters and MACs side-by-side for standard conv, grouped conv (g=4), and depthwise separable conv. Watch how the relative costs shift as you change the parameters.

The three architectures represent three points on the parameter-accuracy tradeoff frontier:

What to try: Set C=32 (narrow network) vs C=512 (wide). Notice that for narrow channels, the 1/K² term dominates and depthwise separation is less dramatic. At C=512, the 1/C term kicks in and separation is nearly 9×. This is why depthwise separable works best in wide networks. Also try K=1 (pointwise conv) — all three become identical, since there's no spatial operation to separate.
Standard vs Grouped (g=4) vs Depthwise Separable — live comparison
C_in = C_out128
Kernel K3
Output size H=W56
You move from standard conv to depthwise separable with K=3, C_in=C_out=64. The showcase canvas shows approximately 8× MAC reduction. If you then double the channel count to C=128 (keeping K=3), the reduction factor will:

Chapter 6: Pooling, Normalization & BN Folding

Pooling

Pooling downsamples the spatial dimensions of a feature map. Unlike conv, it has no learnable parameters — it applies a fixed operation (max or average) over a K×K window, typically with stride = K so windows don't overlap. For an input of shape (C, H, W) and pool size K, the output is (C, H/K, W/K), reducing the spatial area by K².

Paramspool = 0    MACs ≈ C · (H/K) · (W/K) · K2 comparisons (max) or adds (avg)

Pooling is cheap — no weights to store or load — but it destroys spatial information irreversibly. A 2×2 max pool after a 256×56×56 feature map produces 256×28×28, cutting activation size by 4× and making all subsequent layers cheaper.

Global Average Pooling (GAP) is the modern alternative to large FC layers for classification. Instead of flattening the final feature map (e.g., 512×7×7 = 25,088 values → 25,088-dimensional FC), GAP computes the average of each channel's spatial map → one value per channel (512 values). The subsequent FC is then 512→1000 (512K params) rather than 25,088→1000 (25.1M params). GAP was popularized by NiN and used in all MobileNets. Cost: C additions per spatial position, no parameters. The spatial structure is lost but the channel-level summary is preserved.

Strided convolution vs pooling: Rather than conv + pool, modern architectures often use a strided conv (stride=2) to downsample. A stride-2 3×3 conv produces the same output spatial size as conv + 2×2 pool, but the conv layer has learned parameters and can learn how to downsample rather than applying a fixed max/avg. ResNet uses stride-2 convolutions; VGG uses max pooling. The tradeoff: strided conv has more params but is potentially more expressive and eliminates a separate memory pass for the pool layer.

Activation Functions

Every layer's output flows through an activation function — a non-linearity that gives deep networks their expressive power. Without activations, a 100-layer network would collapse to a single linear transformation. The cost: one additional operation per activation value, but zero learnable parameters for most activations.

ReLU (Rectified Linear Unit) is the default: y = max(0, x). One comparison per activation, no parameters, no exp() calls. For a 128×56×56 feature map, that's 401,408 comparisons — trivial. ReLU6 (y = min(max(0,x), 6)) is used in MobileNet to prevent very large activations, which aids fixed-point quantization (values stay in [0,6]).

Swish (y = x/(1+e-x)) and GELU (used in Transformers) are more expensive — each requires an exp() call, ~4–8× more expensive than ReLU. Hard Swish (y = x·(x+3)/6 for x ∈ [−3,3], clamped otherwise) approximates Swish with only integer ops — used in MobileNetV3 for hardware efficiency.

Activation cost is rarely the bottleneck — a ReLU over 401K activations is negligible compared to the 231M MACs of the preceding conv. But on quantized inference engines, activations are also quantized (INT8 range), and the choice of activation affects the quantization range and precision. ReLU6 and Hard Swish clip to a known range, enabling tighter quantization without calibration.

Residual / Skip Connections

Residual connections (introduced in ResNet, He et al. 2016) add the input of a block directly to its output: y = F(x) + x. The function F is typically two or three conv layers. This lets the network learn residuals — small corrections to the identity — rather than full transformations, making deep networks trainable.

Cost of residual connections: zero additional parameters and zero additional MACs (just an element-wise add). But they require that input and output have the same shape. When the spatial size or channel count changes (e.g., stride-2 downsampling), a 1×1 conv projection shortcut is used to match dimensions — adding params = Cout×Cin and MACs = Cout×Cin×Hout×Wout.

Projection shortcut worked example (ResNet-50 first block): The block downsamples 64→256 channels at stride 2 (28×28 output). A 1×1 projection conv is needed: Cin=64, Cout=256, K=1, H=28, W=28.

Paramsproj = 256 × 64 × 1 × 1 = 16,384    MACsproj = 256 × 64 × 28 × 28 = 12,845,056 ≈ 12.8M

For comparison, the 3×3 conv in the same block (64→64): MACs = 64×64×9×28×28 = 28,901,376 ≈ 29M. The projection shortcut adds 44% of the main-path compute — not free! But it only occurs once per stage change (typically 4 times in ResNet-50), so its aggregate cost is small relative to the full network's 4B MACs.

Activation memory penalty of residuals. A residual connection requires keeping the input x alive in memory until after F(x) completes — you can't free it early. For a ResNet block with 256×28×28 input at FP16, that's 401 KB held in SRAM for the duration of the block. Deep networks with many residuals accumulate this: on an MCU, this can make activation memory the primary bottleneck even more than the conv outputs themselves.

Batch Normalization (training)

Batch Normalization normalizes the activations within a mini-batch to have zero mean and unit variance, then applies a learned scale γ and shift β per channel. During training for a conv feature map of shape (N, C, H, W), BN computes the mean and variance across (N, H, W) for each of the C channels:

μc = mean over (N,H,W)    σc = std over (N,H,W)
yc = γc · (xc − μc) / (σc + ε) + βc

This adds 2C trainable parameters (γ and β, one scalar per channel) — tiny. But at inference time, μ and σ are fixed (running statistics from training). The operation becomes a simple linear transform: multiply by (γ/σ) and add (β − γμ/σ). And a linear transform after a convolution is itself a convolution — so BatchNorm can be folded into the preceding conv layer.

LayerNorm (used in Transformers) normalizes across the feature dimension (d) for each token independently, rather than across the batch. For a token vector of size d: 2d parameters (γ, β), computed per token. Unlike BatchNorm, LayerNorm statistics change per-sample — it cannot be folded the same way. Its inference cost is 5 ops per element (subtract mean, divide std, scale, shift + the mean/var computation). For a Transformer with N=512 tokens and d=768: LayerNorm costs 512×768×5 ≈ 2M ops per block — negligible compared to the FFN's 2.4B MACs.

BN vs LN at a glance:
BatchNorm: normalizes over (N,H,W), parameters = 2C, folds into conv at inference → 0 inference cost.
LayerNorm: normalizes over d (feature dim), parameters = 2d, does NOT fold into linear, always present at inference.
GroupNorm: normalizes over (H,W) within groups — works with batch size 1, used in detection/segmentation.
InstanceNorm: normalizes over (H,W) per sample per channel — used in style transfer.

BatchNorm Folding (the inference speedup)

After training, the conv + BN sequence is:

y = γ · (W * x + b − μ) / σ + β

Rearranging into a single convolution with new weights W' and bias b':

W'j = Wj · (γj / σj)    b'j = (bj − μj) · (γj / σj) + βj

The result: zero additional MACs at inference. The BN layer simply disappears into the conv's weights. You save both the normalization arithmetic AND the memory reads for γ, β, μ, σ. Most inference engines (TensorRT, ONNX Runtime, TFLite) do this automatically as a graph optimization pass.

BatchNorm folding requires frozen statistics. It only works at inference because μ and σ are running averages frozen after training. During training (or fine-tuning), BN must remain separate because μ and σ update with each batch. Always export with eval() mode in PyTorch before folding.
BatchNorm folding — before vs after (ops at inference)

Toggle to see the Conv+BN pipeline (training) vs the folded-BN single conv (inference). Green checkmarks = operations eliminated.

A Conv2d + BatchNorm block has 256 output channels. BatchNorm has 4 parameter tensors: γ (256), β (256), running_mean (256), running_var (256) — total 1,024 values. After folding BN into conv at inference, how many of these BN parameters remain as separate computations?

Chapter 7: The Transformer Block

The Transformer block has two sub-components: Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (FFN), each preceded by LayerNorm. For a sequence of N tokens with model dimension d (e.g. BERT-base: N=512, d=768), let's derive where every MAC and every parameter lives.

QKV Projections

Three Linear projections map input X (shape N×d) to Queries, Keys, and Values — each of shape N×d. Three weight matrices, each d×d:

ParamsQKV = 3 × d2    MACsQKV = 3 × N × d2

For BERT-base (d=768, N=512): ParamsQKV = 3×768²=1,769,472; MACs = 3×512×768²=905,969,664 ≈ 906M per block.

Attention computation

Attention scores are Q × KT (shape N×N), requiring N² × d multiply-accumulates. Then softmax over the N scores, then weighted sum V (another N² × d MACs):

MACsattn ≈ 2 × N2 × d    Paramsattn = 0  (no weights in the matrix multiply)

For N=512, d=768: MACsattn = 2×512²×768 = 402,653,184 ≈ 403M. Notice: these MACs scale as N² — double the sequence length, quadruple the attention cost. This is the quadratic scaling problem of standard Transformers.

Output projection

Another d×d linear map merges all attention heads: Params = d², MACs = N×d². For BERT-base: Params = 589,824; MACs = 512×768² = 301,989,888 ≈ 302M.

Combining all four attention-related components (QKV + attention + O-proj) for one head with d=768, N=512:

Paramsattn block = 3d2 + d2 = 4d2 = 4 × 7682 = 2,359,296 ≈ 2.4M
MACsattn block = 3Nd2 + 2N2d + Nd2 = 4Nd2 + 2N2d

At N=512, d=768: MACs = 4×512×768² + 2×512²×768 = 1,207,959,552 + 402,653,184 = 1,610,612,736 ≈ 1.6B MACs.

Feed-Forward Network (FFN)

Two linear layers: d → 4d → d (the 4× expansion is standard). Followed by a nonlinearity (GELU in BERT):

ParamsFFN = d × 4d + 4d × d = 8d2    MACsFFN = 2 × N × 4d2 = 8 × N × d2

For BERT-base (d=768, N=512): ParamsFFN = 8×768² = 4,718,592 ≈ 4.7M; MACsFFN = 8×512×768² = 2,415,919,104 ≈ 2.4B per block.

Multi-head attention (MHA) vs single-head: Standard Transformers split attention into h heads (h=12 for BERT-base). Each head uses dimension d/h=64. The QKV weight matrices are still each d×d (just logically partitioned into h blocks). So multi-head attention has the same parameter count and MACs as single-head — the split is conceptual, not a cost change. The benefit is multiple attention patterns (different heads learn different relationships) without additional compute.

The key asymmetry in Transformers: FFN holds most of the parameters (~66% of a block: 8d² vs 4d² for attention projections). But at long sequences, attention dominates MACs because it scales N² while FFN scales N. For BERT at N=512, FFN MACs ≈ 2.4B vs attention ≈ 403M + 906M = 1.3B. Flip this: at N=4096, attention MACs grow to 2×4096²×768 ≈ 25.8B while FFN grows linearly to 8×4096×768² ≈ 19.3B — attention now dominates. This crossover is why long-context efficiency research (FlashAttention, linear attention) focuses on the N² term.
Transformer block cost breakdown — FFN vs attention (sequence length slider)

Drag the sequence length slider and watch attention MACs (∝ N²) overtake FFN MACs (∝ N). Model dimension d is fixed at 768.

Sequence length N512
For a single Transformer block with d=1024 and N=2048, which component has the most parameters (ignoring LayerNorm and biases)?

Chapter 8: Where Cost Lives in Real Networks

Formulas are one thing; seeing them in a real architecture is another. Let's profile a simplified CNN (ResNet-style) and a Transformer block side-by-side, computing params, MACs, and activation memory per layer. The pattern that emerges is the foundation of every efficiency decision in the rest of this course.

The CNN asymmetry

Consider a simplified CNN: Input 3×224×224 → Conv 3→64, 7×7, stride 4 → Conv 64→128, 3×3 → Conv 128→256, 3×3 → Conv 256→512, 3×3, stride 2 → GlobalAvgPool → FC 512→1000. Spatial sizes shrink as we go deeper.

The output spatial size at each layer: 224→56 (stride 4), 56→56, 56→28 (via stride 2 in Conv3), 28→14, 14→1 (GlobalAvgPool). Channel counts grow to compensate: 3→64→128→256→512. This is the canonical ResNet-style "halve spatial, double channels" design.

LayerOutput shapeParamsMACsAct. size (FP16)
Conv1 7×7 3→64 s464×56×569,40829.6M401 KB
Conv2 3×3 64→128128×56×5673,728231M802 KB
Conv3 3×3 128→256256×28×28294,912231M401 KB
Conv4 3×3 256→512 s2512×14×141,179,648231M201 KB
FC 512→10001000512,000512K2 KB

The pattern is clear: early layers — huge activation maps (401–802 KB), few params (9K–74K), big spatial MACs. Late conv layers — many params (1.2M for Conv4), smaller spatial maps (14×14), same total MACs (spatial shrinks as channels expand). FC layer — half-million params but trivial MACs (no spatial dimension).

Now consider the activation memory peak. The first conv takes a 3×224×224 input (302 KB at FP16) and produces a 64×56×56 output (401 KB). During computation, both must be live simultaneously — that's 703 KB just for that one layer. On a 256 KB MCU, this is impossible without a patched/tiled inference approach. This is exactly the MCUNet problem: MobileNetV2 reduces parameters dramatically but its peak activation memory is still >1 MB.

Compression targets differ by layer: To save activation memory (bottleneck on MCUs) → target early layers or reduce input resolution. To save parameter storage → target late conv and FC layers. To save inference latency on compute-bound hardware → target the layers with most MACs (often the mid-network convolutions with large channels AND spatial dimensions).

The Transformer per-layer breakdown (BERT-base, d=768)

A BERT-base Transformer has 12 identical blocks. Each block's cost (N=512):

ComponentParamsMACs (N=512)% of block MACs
QKV projections (3×d²)1,769,472906M22%
Attention (2N²d)0403M10%
Output projection (d²)589,824302M7%
FFN total (8d²)4,718,5922,416M60%
LayerNorm (×2, 2d params)3,072~8M<1%

ResNet-50 full model: 25M params, 4B MACs, 8 MB activation memory at inference (FP32). A MobileNetV2 with ~70% top-1 accuracy: 3.4M params, 300M MACs — 8× fewer MACs, 7× fewer params. The accuracy drop is only ~6 points. This is the efficiency-accuracy frontier: depthwise separable convolutions buy you 8× on the MAC axis at a small accuracy cost. Future techniques (NAS, knowledge distillation) push further.

Key takeaway: FFN holds 67% of block parameters and 60% of MACs at N=512. Attention holds 0% of parameters but grows to dominate MACs as N increases. This is why GPT-style compression research typically focuses on FFN sparsity (sparse attention, MoE routing) for params, and FlashAttention / linear attention for long-context compute.

Per-layer cost breakdown of a CNN — params vs MACs vs activation memory

Click a metric to highlight it. Notice how params peak late, activations peak early, and MACs are distributed across all conv layers.

This insight — that MACs stay roughly constant across stages when channels double and spatial size halves — is why "stage-wise" network design is so common. ResNet, EfficientNet, and MobileNet all use it. It means the network's total compute is distributed roughly evenly across all stages, rather than concentrated in one place. For efficiency engineers, this is convenient: no single stage dominates, so you can apply a uniform compression ratio across all stages without unbalancing the pipeline.

In the CNN table above, Conv1 has 9,408 parameters but Conv4 has 1,179,648 parameters — 125× more. Yet both Conv2 and Conv4 have approximately 231M MACs. Why do they have the same MAC count despite very different parameter counts?

Chapter 9: Connections & Cheat Sheet

You now have the measurement vocabulary for every layer type in modern deep learning. Here is the complete reference — every formula, plugged with real numbers, and mapped to the efficiency techniques that target each cost type.

Complete Parameter & MAC Formula Reference

LayerParamsMACsExample (numbers)
Linear Cin→CoutCin·Cout + CoutCin·Cout768→768: 590K params, 590K MACs
Conv2d Cin→Cout, K×K, H×W outCout·Cin·K²Cin·Cout·K²·H·W64→128, 3×3, 56²: 74K params, 231M MACs
Grouped Conv g groupsCout·Cin·K²/gCin·Cout·K²·H·W/gg=4: 4× fewer params and MACs
Depthwise Conv C channels, K×KC·K²C·K²·H·WC=128, K=3, 56²: 1,152 params, 3.6M MACs
Pointwise Conv C→CoutCout·CC·Cout·H·W128→128, 56²: 16K params, 51M MACs
Depthwise Sep (DW+PW)C·K² + Cout·CC·K²·HW + C·Cout·HWReduction ≈ 1/K²+1/C vs standard
Pooling K×K, stride K0C·(H/K)·(W/K)·K²No params; MACs ~ spatial comparisons
BatchNorm C channels2C (γ,β)≈4C·H·W train; 0 inference (folded)256 ch: 512 params; folded at inference
Transformer QKV d-dim3·d²3·N·d²d=768, N=512: 1.8M params, 906M MACs
Transformer Attn N×d0 (no weights)2·N²·dN=512, d=768: 0 params, 403M MACs
Transformer FFN d→4d→d8·d²8·N·d²d=768, N=512: 4.7M params, 2.4B MACs

Which efficiency technique attacks which cost?

Cost typeTechnique (upcoming lessons)Target layerLesson
Parameter count / model sizePruning, Quantization, Low-rank decompositionLate conv, FC, FFNTinyML L4–L6
MAC count (compute)Depthwise sep conv, NAS, operator fusionMid-network wide convsTinyML L3, L8
Activation memory (peak)Gradient checkpointing, in-place ops, TinyTL, MCUNet tilingEarly layers (large H×W)TinyML L10
N² attention cost (long context)FlashAttention, linear attention, sparse attn, GQATransformer attention blockTinyML L12
Inference memory bandwidthQuantization (INT8/INT4/FP8), weight compressionAll weight-loading opsTinyML L5–L6
Energy / power budgetMixed-precision, hardware-aware NAS, co-designFull modelTinyML L9, L11

The lesson that motivated each row of this table is the formula you now know. Pruning targets params — you know exactly how many each layer has. Quantization targets model size (params × bitwidth) — you can compute the MB savings. Depthwise separation targets MACs — you derived the 8–9× reduction. MCUNet targets peak activation memory — you know which layers dominate (early, large H×W). FlashAttention targets the N²·d attention term — you derived why it grows quadratically. Every efficient ML paper starts with a cost measurement, using exactly these formulas.

python
"""Per-layer profiler: compute params, MACs, activation size for a model spec."""

def profile_model(layers):
    """layers: list of dicts with type and dimension specs."""
    total_params, total_macs, peak_act = 0, 0, 0
    print(f"{'Layer':<30} {'Params':>10} {'MACs':>12} {'Act(KB)':>8}")
    print("-"*62)
    for L in layers:
        t = L['type']
        if t == 'linear':
            p = L['ci'] * L['co'] + L['co']
            m = L['ci'] * L['co']
            a = L['co'] * 2  # FP16 bytes
        elif t == 'conv':
            p = L['co']*L['ci']*L['k']**2 + L['co']
            m = L['ci']*L['co']*L['k']**2*L['h']*L['w']
            a = L['co']*L['h']*L['w']*2
        elif t == 'dw_sep':
            p = L['c']*L['k']**2 + L['co']*L['c']
            m = (L['c']*L['k']**2 + L['c']*L['co'])*L['h']*L['w']
            a = L['co']*L['h']*L['w']*2
        total_params += p; total_macs += m; peak_act = max(peak_act, a)
        print(f"{L['name']:<30} {p:>10,} {m:>12,} {a/1024:>7.1f}")
    print(f"\nTotal params: {total_params:,}  Total MACs: {total_macs:,}  Peak act: {peak_act/1024:.1f} KB")

# Example: simplified MobileNet-style backbone
profile_model([
    {'name':'Conv1 3→32',    'type':'conv',   'ci':3,   'co':32,  'k':3, 'h':112, 'w':112},
    {'name':'DWSep 32→64',   'type':'dw_sep', 'c':32,  'co':64,  'k':3, 'h':56,  'w':56},
    {'name':'DWSep 64→128',  'type':'dw_sep', 'c':64,  'co':128, 'k':3, 'h':28,  'w':28},
    {'name':'DWSep 128→256', 'type':'dw_sep', 'c':128, 'co':256, 'k':3, 'h':14,  'w':14},
    {'name':'FC 256→1000',   'type':'linear', 'ci':256, 'co':1000},
])
python
"""Full per-layer profiler with PyTorch hook-based MAC counting.
Works on any nn.Module — registers forward hooks to intercept layer shapes."""
import torch
import torch.nn as nn

def count_layer_macs(module, input, output):
    """Forward hook that appends MAC count to module._macs."""
    t = type(module)
    macs = 0
    if isinstance(module, nn.Linear):
        # MACs = batch × C_in × C_out
        macs = input[0].numel() * module.out_features
    elif isinstance(module, nn.Conv2d):
        # MACs = C_in × C_out × K² × H_out × W_out / groups
        n, c_in, h_in, w_in = input[0].shape
        n, c_out, h_out, w_out = output.shape
        kh, kw = module.kernel_size
        macs = (c_in // module.groups) * c_out * kh * kw * h_out * w_out
    module._macs = module._macs + macs if hasattr(module, '_macs') else macs

def profile_model_torch(model, input_shape=(1, 3, 224, 224)):
    """Profile a PyTorch model: params and MACs per layer."""
    hooks = []
    for m in model.modules():
        if isinstance(m, (nn.Linear, nn.Conv2d)):
            hooks.append(m.register_forward_hook(count_layer_macs))

    x = torch.zeros(*input_shape)
    with torch.no_grad():
        model(x)

    print(f"{'Layer':<35} {'Params':>10} {'MACs':>12}")
    print("-"*60)
    total_p, total_m = 0, 0
    for name, m in model.named_modules():
        if hasattr(m, '_macs'):
            p = sum(x.numel() for x in m.parameters())
            total_p += p; total_m += m._macs
            print(f"{name:<35} {p:>10,} {m._macs:>12,}")
    print(f"\nTotal: {total_p:,} params  {total_m:,} MACs")
    for h in hooks: h.remove()

# Example: profile a 3-block MobileNet-style backbone
class DWSep(nn.Module):
    def __init__(self, cin, cout, stride=1):
        super().__init__()
        self.dw = nn.Conv2d(cin, cin, 3, stride=stride, padding=1, groups=cin, bias=False)
        self.pw = nn.Conv2d(cin, cout, 1, bias=False)
    def forward(self, x): return self.pw(self.dw(x))

backbone = nn.Sequential(
    nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
    DWSep(32, 64), DWSep(64, 128, stride=2),
    DWSep(128, 256, stride=2), DWSep(256, 512, stride=2),
    nn.AdaptiveAvgPool2d(1), nn.Flatten(),
    nn.Linear(512, 1000)
)
profile_model_torch(backbone)
Related lessons: TinyML L1 — Why Efficiency & Metrics covers MACs, FLOPs, model size, and activation memory in depth. Transformer Architecture covers attention mechanisms in detail. Attention Variants covers FlashAttention, linear attention, and sparse attention. Attention in Transformers walks through QKV mechanics. Next in this series: TinyML L3 — Pruning.
The 1×1 conv (pointwise) is just a Linear layer applied spatially. A 1×1 conv with Cin=128 and Cout=256 on a 56×56 feature map: params = 256×128 = 32,768. MACs = 256×128×56×56 = 102,760,448 ≈ 103M. Same params as a Linear(128, 256) layer, but 56²=3,136× more MACs because it's applied at every spatial position. This is why a pointwise conv after a depthwise conv is cheap on params but still carries spatial MACs — the entire cost of depthwise separable is in the 1×1 pointwise step for wide networks (large C).

Quick formula check — BERT-large (d=1024, 24 blocks, N=512)

BERT-large total parameters (ignoring embeddings): per block = 4d² (attention projections) + 8d² (FFN) = 12d². Times 24 blocks: 24×12×1024² = 301,989,888 ≈ 302M. Plus embeddings (30,522 vocab × 1024 = 31M). Total ≈ 333M — matches the published 340M (small discrepancy from biases and LN).

For comparison, GPT-2 medium (d=1024, 24 layers, same architecture as BERT-large but causal): same param count (~340M) and similar MACs per forward pass. The difference is inference mode — BERT processes all N tokens in parallel (one pass), while GPT-2 generates autoregressively (one token at a time, with the KV cache). At generation time, GPT-2's batch=1 MACs per token are dominated by the FFN weights loaded from DRAM (AI ≈ 0.5 FLOPs/byte), making it memory-bandwidth-bound rather than compute-bound. Same formulas, completely different bottleneck.

BERT-large total MACs (N=512): per block = (4Nd² + 2N²d) + 8Nd² = 12Nd² + 2N²d. Block MACs = 12×512×1024² + 2×512²×1024 = 6,442,450,944 + 537,919,488 = 6,980,370,432 ≈ 7B. Times 24 blocks = 167.5B MACs for one forward pass at N=512. This is why BERT is expensive to run — 167B MACs × 2 FLOPs/MAC = 335 GFLOPs, vs ResNet-50's 4 GFLOPs. LLM inference at long contexts is even more expensive, driven by the N² term growing with context window.

How to use this knowledge in practice

When evaluating a new architecture or a compressed model, run this three-step mental audit:

Step 1: Count
Compute total params, total MACs, and peak activation memory. Flag any layer that's disproportionately expensive in each dimension.
Step 2: Classify
Is the bottleneck params (late conv/FC/FFN)? MACs (early/mid conv, or long-context attention)? Activation memory (early layers with large H×W on MCUs)?
Step 3: Match
Apply the technique that targets the bottleneck. Don't quantize the first conv layer to save params (it has almost none). Don't try FlashAttention on a short-sequence BERT (N²·d is not the bottleneck). Match tool to cost type.

The efficiency techniques you'll learn in subsequent lessons map directly to this table. Structured pruning (L3) removes entire filters from Conv/Linear layers — directly reducing the Cout dimension, shrinking both params and MACs. Quantization (L5) reduces bits per weight — shrinking model size without touching the MAC formula, but reducing the bytes-loaded per weight and enabling INT8 SIMD parallelism that gives ~4× throughput. Knowledge distillation (L7) reduces the overall network width d — replacing BERT-large (d=1024) with BERT-small (d=256) gives 16× param reduction (d² shrinks 16×) and 16× MAC reduction. Neural Architecture Search (L8) automates the tradeoff — searching the space of Cin, Cout, K, and layer count to minimize MACs subject to an accuracy constraint.

One more mental model: arithmetic intensity = MACs / bytes loaded. For a Linear layer (batch=1): AI = Cin·Cout MACs / (Cin·Cout·2 bytes) = 0.5 FLOPs/byte → deeply memory-bound. For a 3×3 Conv on 56×56: AI = 73K×3K MACs / (73K×2 bytes) = 3,136 FLOPs/byte → compute-bound on any modern GPU. This is why convolutions run efficiently on GPUs (high AI) but LLM autoregressive decoding does not (batch=1 Linear → AI=0.5).

"What I cannot create, I do not understand." — Richard Feynman. If you can compute the parameter count and MAC cost of any layer from its shape alone, you understand that layer. Everything else in efficient deep learning is a consequence of those two numbers.
A 3×3 standard conv with 256 input and 256 output channels is replaced by depthwise separable (DW+PW). The FLOP ratio formula is 1/K²+1/C = 1/9+1/256. Which part of this formula comes from the depthwise step and which from the pointwise?