NN Building Blocks & Where Compute Lives

Chapter 0: You Can't Optimize What You Can't Measure

You're handed a ResNet-50. It scores 76% top-1 on ImageNet — solid accuracy. Your target device is a Cortex-M4 microcontroller with 512 KB of SRAM and a requirement to run inference in under 100 ms. You open up the model. There are 25 million parameters. The model at FP32 weighs 100 MB. The chip has 0.5 MB. Something has to give — but what?

The naive answer is "compress everything equally." But that's like telling a plumber to replace all the pipes when only one is leaking. Different layers have radically different cost profiles. The first conv layer in ResNet-50 has barely any parameters — but it processes the full 224×224 input and accounts for a massive fraction of activation memory. The final FC layer has millions of parameters — but processes a single vector, so MACs are low. Compress them the same way and you'll waste effort on cheap layers while leaving expensive ones untouched.

This lesson gives you the X-ray vision to see inside any architecture. By the end you'll be able to look at a layer spec — "3×3 Conv, C_in=64, C_out=128, output 56×56" — and instantly compute its parameter count, MAC count, activation size, and arithmetic intensity. That measurement is step zero before any compression technique in this course.

The key asymmetry you'll discover: In a CNN, early layers have enormous activation maps and modest parameter counts; late layers have huge parameter counts and tiny activation maps. In a Transformer, the FFN holds most parameters but attention dominates compute at long sequences. Every efficiency technique targets a specific cost type — pruning targets params, quantization targets model size and memory bandwidth, knowledge distillation targets accuracy/param tradeoff, and operator fusion targets activation memory. Know the cost; pick the right tool.

We'll cover every major layer: Linear (fully-connected), Conv2d, grouped convolution, depthwise separable convolution (the MobileNet trick), pooling, BatchNorm (and how to fold it into Conv at inference), activation functions, residual/skip connections, and the Transformer block (QKV projections, attention, FFN). Every formula will be derived, not memorized, and plugged with real numbers.

The formulas are few and simple. The insight is learning which term matters for each hardware target. On a server GPU: MACs are the bottleneck for batch inference. On a mobile CPU: memory bandwidth is the bottleneck for single-sample inference. On a microcontroller: SRAM capacity (activation memory) is the bottleneck, often tighter than either MACs or parameter storage. The same architecture can be optimal in one setting and completely infeasible in another — not because it changed, but because the bottleneck shifted.

By the end of this lesson you will know: (1) every layer type's exact formula; (2) where params, MACs, and activations peak in a CNN and a Transformer; (3) what each efficiency technique targets; (4) how to use Python hooks or manual calculation to profile any model. These are the skills every efficient ML practitioner needs before touching compression code.

Here is a map of what you'll know after each chapter:

Ch 1: Linear

Params = C_in·C_out; MACs same; reuse = 1×; quantization target

↓

Ch 2: Conv2d

Params = C_out·C_in·K²; MACs adds H·W spatial reuse; AlexNet full breakdown

↓

Ch 3: Receptive Field

RF = L·(k−1)+1; stack 3×3s for efficiency; residual memory cost

↓

Ch 4: Depthwise Separable

Ratio = 1/K² + 1/C; MobileNet 8–9× reduction; memory-bandwidth caveat

↓

Ch 6–7: Normalization + Transformer

BN folding → 0 inference cost; FFN params vs attention N²·d MACs

↓

Ch 8–9: Real networks

CNN/Transformer per-layer cost maps; compression technique targeting

A ResNet-50's first conv layer (3→64 channels, 7×7 kernel, output 112×112) has very few parameters but huge activation memory. Its last FC layer (2048→1000) has millions of parameters but tiny activation memory. Which compression technique is most naturally matched to shrinking the last FC layer's weight storage?

Activation checkpointing — recompute activations to save memory Channel pruning on the first conv — it's the bottleneck Weight pruning or quantization — target the large parameter tensor directly Depthwise separable convolution — split the spatial and channel operations

Chapter 1: Linear Layer: Params & MACs

The Linear layer (also called fully-connected, or FC) is the simplest building block. Every output neuron connects to every input neuron. If the input has C_in features and the output has C_out features, then every output y_j is computed as:

y_j = ∑_i W_ji · x_i + b_j

The weight tensor W has shape (C_out, C_in). The bias b has shape (C_out,). Count the elements:

Parameters = C_in · C_out + C_out = C_out(C_in + 1)

For the MAC count, each output neuron requires C_in multiply-accumulate operations (one per input × weight pair), and there are C_out output neurons:

MACs = C_in · C_out (bias adds are negligible)

Concrete example — BERT's final projection: C_in=768, C_out=768.

Params = 768 × 768 + 768 = 590,592 ≈ 590K MACs = 768 × 768 = 589,824

Notice params ≈ MACs for a Linear layer — one MAC per parameter, because each weight is used exactly once per forward pass (no spatial reuse). This is the key difference from Conv, where each weight gets reused across spatial positions.

AlexNet's two giant FC layers (4096→4096 and 4096→1000) together hold 4096²+4096×1000 = 20,873,216 params — 34% of AlexNet's total 61M — yet do only 20.9M MACs, 2.9% of AlexNet's 724M total. Params are expensive on storage (41 MB at FP32) but cheap on compute. A 4× weight quantization from FP32→INT8 saves 31 MB while leaving MACs unchanged. This is why quantization disproportionately benefits models with large FC layers.

The reuse gap. A Linear layer has arithmetic intensity ≈ 1 MAC/param. A Conv layer with 56×56 output has 56×56 = 3,136 MACs/param. This is why convolution is so much more compute-efficient than FC for spatial data — the same weight serves thousands of positions.

Linear layer parameter and MAC counter

Drag the sliders to set layer dimensions. The canvas shows the weight matrix shape and live counts.

C_in768

C_out768

A Linear layer maps 2048 inputs to 1000 outputs (like AlexNet's final classifier). Ignoring bias, how many parameters does it have?

3,048 — C_in + C_out 2,048,000 — C_in × C_out 4,096,000 — (C_in × C_out) × 2 1,000 — just the output dimension

Chapter 2: Conv2d: The Workhorse

A 2D convolution layer takes an input feature map of shape (C_in, H_in, W_in) and produces an output of shape (C_out, H_out, W_out). The kernel has spatial size K_h × K_w. There is one kernel per (input channel, output channel) pair — so the weight tensor has shape (C_out, C_in, K_h, K_w).

Count the parameters: C_out output channels, each needing a kernel of size C_in × K_h × K_w, plus one bias per output channel:

Params = C_out · C_in · K_h · K_w + C_out

For MACs: each of the H_out × W_out output positions requires computing C_out output values, each needing C_in × K_h × K_w multiply-accumulates:

MACs = C_in · C_out · K_h · K_w · H_out · W_out

The crucial insight: params don't depend on spatial size, but MACs do. The same weight tensor is reused at every output position. A layer with output spatial H_out×W_out = 56×56 = 3,136 performs 3,136× more MACs than a layer with output 1×1 — with identical parameter count.

Output spatial size formula

Given input height H_in, kernel K_h, padding p, and stride s, the output height is:

H_out = ⌊(H_in + 2p − K_h) / s⌋ + 1

Common cases: "same" padding (output = input size): p = (K−1)/2 with s=1. "Valid" padding (no padding, p=0): H_out = H_in − K + 1. Stride-2 halving: p=1, s=2, K=3 → H_out = H_in/2. For the AlexNet first conv: H_in=224, K=11, p=2, s=4 → H_out = ⌊(224+4−11)/4⌋+1 = ⌊217/4⌋+1 = 54+1 = 55. ✓

This formula matters for counting MACs: if you miscompute H_out, your MAC count is wrong. The padding and stride choices also affect how many times each input pixel contributes to outputs — relevant for understanding why strided conv reduces activations without reducing params.

Worked example — standard 3×3 conv, 64→128 channels, output 56×56:
Params (ignoring bias) = 128 × 64 × 3 × 3 = 73,728
MACs = 64 × 128 × 3 × 3 × 56 × 56 = 73,728 × 3,136 = 231,211,008 ≈ 231M MACs
Reuse factor = 3,136× — every weight is used 3,136 times. This is why CNNs are compute-efficient.

Now apply the formula to AlexNet's first layer: 3→96 channels, 11×11 kernel, output 55×55 (stride 4 applied to 224×224 input):

Params = 96 × 3 × 11 × 11 = 34,848 MACs = 3 × 96 × 11 × 11 × 55 × 55 = 105,415,200

That first layer alone costs 105M MACs — 14.5% of AlexNet's total 724M MACs — yet holds only 34K of AlexNet's 61M parameters (0.06%). Classic early-layer profile: cheap on params, expensive on compute due to large spatial maps.

python
def conv2d_cost(c_in, c_out, kh, kw, h_out, w_out):
    """Compute parameter count and MACs for a Conv2d layer."""
    params = c_out * c_in * kh * kw + c_out  # weights + bias
    macs = c_in * c_out * kh * kw * h_out * w_out
    return params, macs

# AlexNet layer 1: 3→96, 11×11 kernel, 55×55 output
p, m = conv2d_cost(3, 96, 11, 11, 55, 55)
print(f"Params: {p:,}  MACs: {m:,}")
# Params: 34,944  MACs: 105,415,200

# 3×3 conv 64→128 on 56×56 (common ResNet block)
p, m = conv2d_cost(64, 128, 3, 3, 56, 56)
print(f"Params: {p:,}  MACs: {m:,}")
# Params: 73,856  MACs: 231,211,008

Here is the complete AlexNet MAC and activation profile, derived from the formula above. Study the per-layer numbers — they reveal patterns you'll see in every CNN:

Layer	Output shape	Params	MACs	Activation (elem)
Conv1 3→96, 11×11, s4	96×55×55	34,848	105,415,200	290,400
MaxPool 3×3, s2	96×27×27	0	~0	69,984
Conv2 96→256, 5×5, g2	256×27×27	307,200	223,948,800	186,624
MaxPool 3×3, s2	256×13×13	0	~0	43,264
Conv3 256→384, 3×3	384×13×13	884,736	149,520,384	64,896
Conv4 384→384, 3×3, g2	384×13×13	663,552	112,140,288	64,896
Conv5 384→256, 3×3, g2	256×13×13	442,368	74,760,192	43,264
FC1 9216→4096	4096	37,748,736	37,748,736	4,096
FC2 4096→4096	4096	16,777,216	16,777,216	4,096
FC3 4096→1000	1000	4,096,000	4,096,000	1,000
Total	—	60,954,656 ≈ 61M	723,406,816 ≈ 724M	932,264 total

Key observations: (1) 62% of params are in FC1+FC2 alone. (2) 85% of MACs are in the five conv layers. (3) Activation element count peaks at the input (150,528) — not shown, but the input + first output together = 440,928 elements, which is the peak activation memory AlexNet requires simultaneously.

A 3×3 Conv2d layer has C_in=256, C_out=256, and output spatial size 13×13 (like AlexNet layer 3). Ignoring bias, how many parameters does it have — and how many MACs?

Params=589,824; MACs=589,824 (same because output is 1×1) Params=589,824; MACs=7,667,712 (=589,824×13) Params=589,824; MACs=99,680,256 (=589,824×169) Params=2,359,296; MACs=99,680,256 (included spatial dims in params)

Chapter 3: Receptive Field & Spatial Reach

The receptive field of an output neuron is the region of the original input that influences its value. A single 3×3 conv layer sees a 3×3 patch. Stack two 3×3 conv layers and the output sees a 5×5 patch of the original input. Stack three and it sees 7×7. The formula for L layers with kernel size k (and stride 1) is:

RF = L · (k − 1) + 1

For L=3, k=3: RF = 3×2+1 = 7×7. That's the same receptive field as a single 7×7 conv — but with dramatically fewer parameters. A single 7×7 conv with C channels has 7²=49 weights per (in,out) channel pair; three stacked 3×3 convs have 3×3²=27 — 45% fewer params, with an identical view of the input. This is why VGG replaced large kernels with stacked 3×3s.

Receptive field matters for efficiency because large RF is needed for scene understanding — but achieving it with a single large kernel is parameter-inefficient. Stacking small kernels also introduces more non-linearity (one ReLU per layer), which improves representational power. The tradeoff: more layers = more activation memory, since each intermediate activation map must be kept live for the residual connection or gradient computation.

Stride multiplies RF growth. A stride-2 conv doubles the effective RF reach in the next layer. Modern networks use strided convolutions or pooling to rapidly expand RF while reducing spatial resolution — shrinking activation memory at the cost of spatial precision.

VGG vs single large kernel: VGG-16 uses 13 layers of 3×3 convolutions to achieve an effective RF that would require a single 13×13 or larger kernel. The 3×3 stack has 13×(2×3²×C²) = 23.4C² params per unit output; a single 13×13 layer would need 13²×C² = 169C² params — 7.2× more for the same RF. Plus each 3×3 layer introduces a nonlinearity, giving VGG's stacked 3×3s better discriminative power per parameter. This "stack small kernels" insight is now standard across most CNN architectures. Dilated (atrous) convolutions offer another approach: use K=3 with dilation rate r to sample input at positions separated by r pixels, achieving RF = K+(K-1)(r-1) without any increase in params or compute.

Receptive field growth visualizer

Add conv layers and watch the receptive field (orange square) grow in the input grid. Formula: RF = L×(k−1)+1.

Layers L2

Kernel k3

You stack 5 layers of 3×3 convolutions (stride 1, no pooling). What is the receptive field size at the final output?

9×9 — three 3×3 kernels tiled 7×7 — after 4 layers 11×11 — using RF = L×(k−1)+1 = 5×2+1 15×15 — kernels compound multiplicatively

Chapter 4: Grouped & Depthwise Separable Conv

Standard convolution treats all input channels jointly for all output channels. What if we could break that coupling? Grouped convolution divides the C_in input channels into g equal groups and applies a separate, narrower convolution within each group. Each group sees only C_in/g input channels and produces C_out/g output channels:

Params_grouped = C_out · C_in · K_h · K_w / g MACs_grouped = C_in · C_out · K_h · K_w · H_out · W_out / g

Both params and MACs shrink by exactly g×. AlexNet used g=2 to split across two GPUs in 2012 — it was an engineering hack that became a principled design choice.

Worked example — AlexNet Conv2, g=2: C_in=96, C_out=256, K=5×5. Standard conv params would be 256×96×25 = 614,400. With g=2: params = 614,400/2 = 307,200. MACs similarly halved: 96×256×25×27×27/2 = 223,948,800 (vs 447,897,600 for standard). The g=2 grouping effectively creates two parallel conv towers processing 48 input channels each and producing 128 output channels each — they never interact except at the next layer.

When does grouping hurt accuracy? Grouped convolutions prevent cross-group information mixing. With g=4, channels 0–31 never influence channels 32–63 at this layer. In practice, this hurts accuracy less than expected because later pointwise (1×1) convolutions can re-mix all channels. ShuffleNet exploits this explicitly — after a grouped conv, it "shuffles" the channels across groups before the next layer, giving cross-group communication at nearly zero cost.

Depthwise Separable Convolution — the MobileNet trick

Take grouped convolution to the extreme: set g = C_in = C_out = C. Now each channel gets its own independent K×K spatial filter. This is a depthwise convolution. It captures spatial structure within each channel but never mixes channels. To mix channels afterward, apply a 1×1 pointwise convolution. Together they form depthwise separable convolution.

Step 1 — Depthwise conv (C channels, K×K kernel, output H_out×W_out):

Params_DW = C · K² MACs_DW = C · K² · H_out · W_out

Step 2 — Pointwise conv (C_in=C → C_out, 1×1 kernel):

Params_PW = C_out · C · 1 · 1 = C_out · C MACs_PW = C · C_out · H_out · W_out

Total depthwise separable (C_in=C_out=C for simplicity):

MACs_DS = C · K² · H · W + C² · H · W = C · H · W · (K² + C)

Standard conv (same dimensions):

MACs_std = C² · K² · H · W

The FLOP reduction ratio:

ratio = MACs_DS / MACs_std = (K² + C) / (C · K²) = 1/C + 1/K²

For typical values: K=3, C=128 → ratio = 1/128 + 1/9 ≈ 0.119 — roughly 8× fewer MACs. For K=3, C=256 → ratio ≈ 0.115 — still ~9×. The reduction is dominated by 1/K² for small channels but by 1/C for large channels.

Worked example — standard vs depthwise separable, C=128, K=3, 56×56:
Standard: MACs = 128 × 128 × 9 × 56 × 56 = 462,422,016 ≈ 462M
Depthwise: MACs_DW = 128 × 9 × 3136 = 3,612,672
Pointwise: MACs_PW = 128 × 128 × 3136 = 51,380,224
DS total: 54,992,896 ≈ 55M — 8.4× reduction. MobileNets achieve near-ResNet accuracy at this cost.

MobileNetV1 architecture (Howard et al. 2017) replaces every standard 3×3 conv in a VGG-like network with depthwise separable conv. Result: 28× fewer MACs than VGG-16 (569M vs 15.3B), only 1% top-1 accuracy drop on ImageNet. This single architectural change — one formula substitution — made real-time vision on smartphones feasible. MobileNetV2 and V3 further add inverted residuals and hard-swish activations for even better efficiency.

Misconception: depthwise conv is always faster in wall-clock time. Depthwise conv has fewer FLOPs, but its arithmetic intensity is very low — one MAC per weight loaded, no cross-channel reuse. On hardware with highly-optimized GEMM (GPUs, NPUs), a standard conv packs data efficiently into matrix multiplications. A depthwise conv does not. In practice, a depthwise layer can be memory-bandwidth-bound even though it does fewer FLOPs. Dedicated hardware support (like Google Edge TPU's custom depthwise kernel) is needed to fully realize the speedup.

python
def depthwise_sep_cost(c, k, h_out, w_out, c_out=None):
    """Depthwise separable conv: DW + PW."""
    if c_out is None: c_out = c
    # Depthwise: one k×k filter per channel
    params_dw = c * k * k
    macs_dw = c * k * k * h_out * w_out
    # Pointwise: 1×1 conv mixing channels
    params_pw = c_out * c
    macs_pw = c * c_out * h_out * w_out
    return params_dw + params_pw, macs_dw + macs_pw

def std_conv_cost(c_in, c_out, k, h_out, w_out):
    return c_out * c_in * k * k, c_in * c_out * k * k * h_out * w_out

c, k, h = 128, 3, 56
p_std, m_std = std_conv_cost(c, c, k, h, h)
p_ds, m_ds = depthwise_sep_cost(c, k, h, h)
print(f"Standard:  params={p_std:,}  MACs={m_std:,}")
print(f"DS:        params={p_ds:,}   MACs={m_ds:,}")
print(f"Reduction: {m_std/m_ds:.1f}× MACs  {p_std/p_ds:.1f}× params")
# Standard:  params=147,456  MACs=462,422,016
# DS:        params=17,536   MACs=54,992,896
# Reduction: 8.4× MACs  8.4× params

The FLOP reduction ratio of depthwise separable vs standard conv is 1/C + 1/K². For K=3 and C=512, approximately what is the reduction factor (i.e., how many times fewer MACs)?

About 3× — the kernel dominates About 5× — arithmetic average of C and K² savings About 9× — dominated by 1/K²=1/9, since 1/C≈0 About 512× — equal to the channel count reduction

Chapter 5: Showcase: Conv Cost Comparator

Drag the sliders to set C_in, C_out, kernel size K, and output spatial size H. The canvas computes and compares parameters and MACs side-by-side for standard conv, grouped conv (g=4), and depthwise separable conv. Watch how the relative costs shift as you change the parameters.

The three architectures represent three points on the parameter-accuracy tradeoff frontier:

Standard conv — maximum cross-channel interaction at every spatial position. Best accuracy per layer but most expensive. Used in ResNets, VGG, early stages of EfficientNet.
Grouped conv (g=4) — 4× cheaper. Channels see only 1/4 of the input channels. Used in ResNeXt (g=32) and ShuffleNet (g=3–8). Needs channel shuffle or 1×1 convs to recover cross-group communication.
Depthwise separable — 8–9× cheaper for K=3. Complete spatial-channel factorization. Used in MobileNet, EfficientNet-B0, SqueezeNet. Requires pointwise conv to restore expressiveness.

What to try: Set C=32 (narrow network) vs C=512 (wide). Notice that for narrow channels, the 1/K² term dominates and depthwise separation is less dramatic. At C=512, the 1/C term kicks in and separation is nearly 9×. This is why depthwise separable works best in wide networks. Also try K=1 (pointwise conv) — all three become identical, since there's no spatial operation to separate.

Standard vs Grouped (g=4) vs Depthwise Separable — live comparison

C_in = C_out128

Kernel K3

Output size H=W56

You move from standard conv to depthwise separable with K=3, C_in=C_out=64. The showcase canvas shows approximately 8× MAC reduction. If you then double the channel count to C=128 (keeping K=3), the reduction factor will:

Stay the same — the ratio only depends on K, not C Increase slightly — 1/C term shrinks, ratio approaches 1/K²=1/9≈8.8× Decrease — more channels means more mixing needed Double — the savings scale with C

Chapter 6: Pooling, Normalization & BN Folding

Pooling

Pooling downsamples the spatial dimensions of a feature map. Unlike conv, it has no learnable parameters — it applies a fixed operation (max or average) over a K×K window, typically with stride = K so windows don't overlap. For an input of shape (C, H, W) and pool size K, the output is (C, H/K, W/K), reducing the spatial area by K².

Params_pool = 0 MACs ≈ C · (H/K) · (W/K) · K² comparisons (max) or adds (avg)

Pooling is cheap — no weights to store or load — but it destroys spatial information irreversibly. A 2×2 max pool after a 256×56×56 feature map produces 256×28×28, cutting activation size by 4× and making all subsequent layers cheaper.

Global Average Pooling (GAP) is the modern alternative to large FC layers for classification. Instead of flattening the final feature map (e.g., 512×7×7 = 25,088 values → 25,088-dimensional FC), GAP computes the average of each channel's spatial map → one value per channel (512 values). The subsequent FC is then 512→1000 (512K params) rather than 25,088→1000 (25.1M params). GAP was popularized by NiN and used in all MobileNets. Cost: C additions per spatial position, no parameters. The spatial structure is lost but the channel-level summary is preserved.

Strided convolution vs pooling: Rather than conv + pool, modern architectures often use a strided conv (stride=2) to downsample. A stride-2 3×3 conv produces the same output spatial size as conv + 2×2 pool, but the conv layer has learned parameters and can learn how to downsample rather than applying a fixed max/avg. ResNet uses stride-2 convolutions; VGG uses max pooling. The tradeoff: strided conv has more params but is potentially more expressive and eliminates a separate memory pass for the pool layer.

Activation Functions

Every layer's output flows through an activation function — a non-linearity that gives deep networks their expressive power. Without activations, a 100-layer network would collapse to a single linear transformation. The cost: one additional operation per activation value, but zero learnable parameters for most activations.

ReLU (Rectified Linear Unit) is the default: y = max(0, x). One comparison per activation, no parameters, no exp() calls. For a 128×56×56 feature map, that's 401,408 comparisons — trivial. ReLU6 (y = min(max(0,x), 6)) is used in MobileNet to prevent very large activations, which aids fixed-point quantization (values stay in [0,6]).

Swish (y = x/(1+e^-x)) and GELU (used in Transformers) are more expensive — each requires an exp() call, ~4–8× more expensive than ReLU. Hard Swish (y = x·(x+3)/6 for x ∈ [−3,3], clamped otherwise) approximates Swish with only integer ops — used in MobileNetV3 for hardware efficiency.

Activation cost is rarely the bottleneck — a ReLU over 401K activations is negligible compared to the 231M MACs of the preceding conv. But on quantized inference engines, activations are also quantized (INT8 range), and the choice of activation affects the quantization range and precision. ReLU6 and Hard Swish clip to a known range, enabling tighter quantization without calibration.

Residual / Skip Connections

Residual connections (introduced in ResNet, He et al. 2016) add the input of a block directly to its output: y = F(x) + x. The function F is typically two or three conv layers. This lets the network learn residuals — small corrections to the identity — rather than full transformations, making deep networks trainable.

Cost of residual connections: zero additional parameters and zero additional MACs (just an element-wise add). But they require that input and output have the same shape. When the spatial size or channel count changes (e.g., stride-2 downsampling), a 1×1 conv projection shortcut is used to match dimensions — adding params = C_out×C_in and MACs = C_out×C_in×H_out×W_out.

Projection shortcut worked example (ResNet-50 first block): The block downsamples 64→256 channels at stride 2 (28×28 output). A 1×1 projection conv is needed: C_in=64, C_out=256, K=1, H=28, W=28.

Params_proj = 256 × 64 × 1 × 1 = 16,384 MACs_proj = 256 × 64 × 28 × 28 = 12,845,056 ≈ 12.8M

For comparison, the 3×3 conv in the same block (64→64): MACs = 64×64×9×28×28 = 28,901,376 ≈ 29M. The projection shortcut adds 44% of the main-path compute — not free! But it only occurs once per stage change (typically 4 times in ResNet-50), so its aggregate cost is small relative to the full network's 4B MACs.

Activation memory penalty of residuals. A residual connection requires keeping the input x alive in memory until after F(x) completes — you can't free it early. For a ResNet block with 256×28×28 input at FP16, that's 401 KB held in SRAM for the duration of the block. Deep networks with many residuals accumulate this: on an MCU, this can make activation memory the primary bottleneck even more than the conv outputs themselves.

Batch Normalization (training)

Batch Normalization normalizes the activations within a mini-batch to have zero mean and unit variance, then applies a learned scale γ and shift β per channel. During training for a conv feature map of shape (N, C, H, W), BN computes the mean and variance across (N, H, W) for each of the C channels:

μ_c = mean over (N,H,W) σ_c = std over (N,H,W)

y_c = γ_c · (x_c − μ_c) / (σ_c + ε) + β_c

This adds 2C trainable parameters (γ and β, one scalar per channel) — tiny. But at inference time, μ and σ are fixed (running statistics from training). The operation becomes a simple linear transform: multiply by (γ/σ) and add (β − γμ/σ). And a linear transform after a convolution is itself a convolution — so BatchNorm can be folded into the preceding conv layer.

LayerNorm (used in Transformers) normalizes across the feature dimension (d) for each token independently, rather than across the batch. For a token vector of size d: 2d parameters (γ, β), computed per token. Unlike BatchNorm, LayerNorm statistics change per-sample — it cannot be folded the same way. Its inference cost is 5 ops per element (subtract mean, divide std, scale, shift + the mean/var computation). For a Transformer with N=512 tokens and d=768: LayerNorm costs 512×768×5 ≈ 2M ops per block — negligible compared to the FFN's 2.4B MACs.

BN vs LN at a glance:
BatchNorm: normalizes over (N,H,W), parameters = 2C, folds into conv at inference → 0 inference cost.
LayerNorm: normalizes over d (feature dim), parameters = 2d, does NOT fold into linear, always present at inference.
GroupNorm: normalizes over (H,W) within groups — works with batch size 1, used in detection/segmentation.
InstanceNorm: normalizes over (H,W) per sample per channel — used in style transfer.

BatchNorm Folding (the inference speedup)

After training, the conv + BN sequence is:

y = γ · (W * x + b − μ) / σ + β

Rearranging into a single convolution with new weights W' and bias b':

W'_j = W_j · (γ_j / σ_j) b'_j = (b_j − μ_j) · (γ_j / σ_j) + β_j

The result: zero additional MACs at inference. The BN layer simply disappears into the conv's weights. You save both the normalization arithmetic AND the memory reads for γ, β, μ, σ. Most inference engines (TensorRT, ONNX Runtime, TFLite) do this automatically as a graph optimization pass.

BatchNorm folding requires frozen statistics. It only works at inference because μ and σ are running averages frozen after training. During training (or fine-tuning), BN must remain separate because μ and σ update with each batch. Always export with eval() mode in PyTorch before folding.

BatchNorm folding — before vs after (ops at inference)

Toggle to see the Conv+BN pipeline (training) vs the folded-BN single conv (inference). Green checkmarks = operations eliminated.

A Conv2d + BatchNorm block has 256 output channels. BatchNorm has 4 parameter tensors: γ (256), β (256), running_mean (256), running_var (256) — total 1,024 values. After folding BN into conv at inference, how many of these BN parameters remain as separate computations?

1,024 — all remain, they are just computed once 512 — only γ and β are absorbed, mean/var remain 256 — only the bias term remains 0 — all four tensors are absorbed into the conv weights and bias

Chapter 7: The Transformer Block

The Transformer block has two sub-components: Multi-Head Self-Attention (MHSA) and a Feed-Forward Network (FFN), each preceded by LayerNorm. For a sequence of N tokens with model dimension d (e.g. BERT-base: N=512, d=768), let's derive where every MAC and every parameter lives.

QKV Projections

Three Linear projections map input X (shape N×d) to Queries, Keys, and Values — each of shape N×d. Three weight matrices, each d×d:

Params_QKV = 3 × d² MACs_QKV = 3 × N × d²

For BERT-base (d=768, N=512): Params_QKV = 3×768²=1,769,472; MACs = 3×512×768²=905,969,664 ≈ 906M per block.

Attention computation

Attention scores are Q × K^T (shape N×N), requiring N² × d multiply-accumulates. Then softmax over the N scores, then weighted sum V (another N² × d MACs):

MACs_attn ≈ 2 × N² × d Params_attn = 0 (no weights in the matrix multiply)

For N=512, d=768: MACs_attn = 2×512²×768 = 402,653,184 ≈ 403M. Notice: these MACs scale as N² — double the sequence length, quadruple the attention cost. This is the quadratic scaling problem of standard Transformers.

Output projection

Another d×d linear map merges all attention heads: Params = d², MACs = N×d². For BERT-base: Params = 589,824; MACs = 512×768² = 301,989,888 ≈ 302M.

Combining all four attention-related components (QKV + attention + O-proj) for one head with d=768, N=512:

Params_{attn block} = 3d² + d² = 4d² = 4 × 768² = 2,359,296 ≈ 2.4M

MACs_{attn block} = 3Nd² + 2N²d + Nd² = 4Nd² + 2N²d

At N=512, d=768: MACs = 4×512×768² + 2×512²×768 = 1,207,959,552 + 402,653,184 = 1,610,612,736 ≈ 1.6B MACs.

Feed-Forward Network (FFN)

Two linear layers: d → 4d → d (the 4× expansion is standard). Followed by a nonlinearity (GELU in BERT):

Params_FFN = d × 4d + 4d × d = 8d² MACs_FFN = 2 × N × 4d² = 8 × N × d²

For BERT-base (d=768, N=512): Params_FFN = 8×768² = 4,718,592 ≈ 4.7M; MACs_FFN = 8×512×768² = 2,415,919,104 ≈ 2.4B per block.

Multi-head attention (MHA) vs single-head: Standard Transformers split attention into h heads (h=12 for BERT-base). Each head uses dimension d/h=64. The QKV weight matrices are still each d×d (just logically partitioned into h blocks). So multi-head attention has the same parameter count and MACs as single-head — the split is conceptual, not a cost change. The benefit is multiple attention patterns (different heads learn different relationships) without additional compute.

The key asymmetry in Transformers: FFN holds most of the parameters (~66% of a block: 8d² vs 4d² for attention projections). But at long sequences, attention dominates MACs because it scales N² while FFN scales N. For BERT at N=512, FFN MACs ≈ 2.4B vs attention ≈ 403M + 906M = 1.3B. Flip this: at N=4096, attention MACs grow to 2×4096²×768 ≈ 25.8B while FFN grows linearly to 8×4096×768² ≈ 19.3B — attention now dominates. This crossover is why long-context efficiency research (FlashAttention, linear attention) focuses on the N² term.

Transformer block cost breakdown — FFN vs attention (sequence length slider)

Drag the sequence length slider and watch attention MACs (∝ N²) overtake FFN MACs (∝ N). Model dimension d is fixed at 768.

Sequence length N512

For a single Transformer block with d=1024 and N=2048, which component has the most parameters (ignoring LayerNorm and biases)?

QKV projections — 3 matrices of size d×d Attention output projection — d×d matrix FFN — two matrices d×4d and 4d×d = 8d² total Attention score matrix QK^T — it's N×N which is 2048×2048

Chapter 8: Where Cost Lives in Real Networks

Formulas are one thing; seeing them in a real architecture is another. Let's profile a simplified CNN (ResNet-style) and a Transformer block side-by-side, computing params, MACs, and activation memory per layer. The pattern that emerges is the foundation of every efficiency decision in the rest of this course.

The CNN asymmetry

Consider a simplified CNN: Input 3×224×224 → Conv 3→64, 7×7, stride 4 → Conv 64→128, 3×3 → Conv 128→256, 3×3 → Conv 256→512, 3×3, stride 2 → GlobalAvgPool → FC 512→1000. Spatial sizes shrink as we go deeper.

The output spatial size at each layer: 224→56 (stride 4), 56→56, 56→28 (via stride 2 in Conv3), 28→14, 14→1 (GlobalAvgPool). Channel counts grow to compensate: 3→64→128→256→512. This is the canonical ResNet-style "halve spatial, double channels" design.

Layer	Output shape	Params	MACs	Act. size (FP16)
Conv1 7×7 3→64 s4	64×56×56	9,408	29.6M	401 KB
Conv2 3×3 64→128	128×56×56	73,728	231M	802 KB
Conv3 3×3 128→256	256×28×28	294,912	231M	401 KB
Conv4 3×3 256→512 s2	512×14×14	1,179,648	231M	201 KB
FC 512→1000	1000	512,000	512K	2 KB

The pattern is clear: early layers — huge activation maps (401–802 KB), few params (9K–74K), big spatial MACs. Late conv layers — many params (1.2M for Conv4), smaller spatial maps (14×14), same total MACs (spatial shrinks as channels expand). FC layer — half-million params but trivial MACs (no spatial dimension).

Now consider the activation memory peak. The first conv takes a 3×224×224 input (302 KB at FP16) and produces a 64×56×56 output (401 KB). During computation, both must be live simultaneously — that's 703 KB just for that one layer. On a 256 KB MCU, this is impossible without a patched/tiled inference approach. This is exactly the MCUNet problem: MobileNetV2 reduces parameters dramatically but its peak activation memory is still >1 MB.

Compression targets differ by layer: To save activation memory (bottleneck on MCUs) → target early layers or reduce input resolution. To save parameter storage → target late conv and FC layers. To save inference latency on compute-bound hardware → target the layers with most MACs (often the mid-network convolutions with large channels AND spatial dimensions).

The Transformer per-layer breakdown (BERT-base, d=768)

A BERT-base Transformer has 12 identical blocks. Each block's cost (N=512):

Component	Params	MACs (N=512)	% of block MACs
QKV projections (3×d²)	1,769,472	906M	22%
Attention (2N²d)	0	403M	10%
Output projection (d²)	589,824	302M	7%
FFN total (8d²)	4,718,592	2,416M	60%
LayerNorm (×2, 2d params)	3,072	~8M	<1%

ResNet-50 full model: 25M params, 4B MACs, 8 MB activation memory at inference (FP32). A MobileNetV2 with ~70% top-1 accuracy: 3.4M params, 300M MACs — 8× fewer MACs, 7× fewer params. The accuracy drop is only ~6 points. This is the efficiency-accuracy frontier: depthwise separable convolutions buy you 8× on the MAC axis at a small accuracy cost. Future techniques (NAS, knowledge distillation) push further.

Key takeaway: FFN holds 67% of block parameters and 60% of MACs at N=512. Attention holds 0% of parameters but grows to dominate MACs as N increases. This is why GPT-style compression research typically focuses on FFN sparsity (sparse attention, MoE routing) for params, and FlashAttention / linear attention for long-context compute.

Per-layer cost breakdown of a CNN — params vs MACs vs activation memory

Click a metric to highlight it. Notice how params peak late, activations peak early, and MACs are distributed across all conv layers.

This insight — that MACs stay roughly constant across stages when channels double and spatial size halves — is why "stage-wise" network design is so common. ResNet, EfficientNet, and MobileNet all use it. It means the network's total compute is distributed roughly evenly across all stages, rather than concentrated in one place. For efficiency engineers, this is convenient: no single stage dominates, so you can apply a uniform compression ratio across all stages without unbalancing the pipeline.

In the CNN table above, Conv1 has 9,408 parameters but Conv4 has 1,179,648 parameters — 125× more. Yet both Conv2 and Conv4 have approximately 231M MACs. Why do they have the same MAC count despite very different parameter counts?

It's a coincidence specific to this architecture As spatial size shrinks by 2× (stride 2), channels double — so C²×H×W stays roughly constant Later layers use larger kernels to compensate for fewer spatial positions Params and MACs are always equal in conv layers

Chapter 9: Connections & Cheat Sheet

You now have the measurement vocabulary for every layer type in modern deep learning. Here is the complete reference — every formula, plugged with real numbers, and mapped to the efficiency techniques that target each cost type.

Complete Parameter & MAC Formula Reference

Layer	Params	MACs	Example (numbers)
Linear C_in→C_out	C_in·C_out + C_out	C_in·C_out	768→768: 590K params, 590K MACs
Conv2d C_in→C_out, K×K, H×W out	C_out·C_in·K²	C_in·C_out·K²·H·W	64→128, 3×3, 56²: 74K params, 231M MACs
Grouped Conv g groups	C_out·C_in·K²/g	C_in·C_out·K²·H·W/g	g=4: 4× fewer params and MACs
Depthwise Conv C channels, K×K	C·K²	C·K²·H·W	C=128, K=3, 56²: 1,152 params, 3.6M MACs
Pointwise Conv C→C_out	C_out·C	C·C_out·H·W	128→128, 56²: 16K params, 51M MACs
Depthwise Sep (DW+PW)	C·K² + C_out·C	C·K²·HW + C·C_out·HW	Reduction ≈ 1/K²+1/C vs standard
Pooling K×K, stride K	0	C·(H/K)·(W/K)·K²	No params; MACs ~ spatial comparisons
BatchNorm C channels	2C (γ,β)	≈4C·H·W train; 0 inference (folded)	256 ch: 512 params; folded at inference
Transformer QKV d-dim	3·d²	3·N·d²	d=768, N=512: 1.8M params, 906M MACs
Transformer Attn N×d	0 (no weights)	2·N²·d	N=512, d=768: 0 params, 403M MACs
Transformer FFN d→4d→d	8·d²	8·N·d²	d=768, N=512: 4.7M params, 2.4B MACs

Which efficiency technique attacks which cost?

Cost type	Technique (upcoming lessons)	Target layer	Lesson
Parameter count / model size	Pruning, Quantization, Low-rank decomposition	Late conv, FC, FFN	TinyML L4–L6
MAC count (compute)	Depthwise sep conv, NAS, operator fusion	Mid-network wide convs	TinyML L3, L8
Activation memory (peak)	Gradient checkpointing, in-place ops, TinyTL, MCUNet tiling	Early layers (large H×W)	TinyML L10
N² attention cost (long context)	FlashAttention, linear attention, sparse attn, GQA	Transformer attention block	TinyML L12
Inference memory bandwidth	Quantization (INT8/INT4/FP8), weight compression	All weight-loading ops	TinyML L5–L6
Energy / power budget	Mixed-precision, hardware-aware NAS, co-design	Full model	TinyML L9, L11

The lesson that motivated each row of this table is the formula you now know. Pruning targets params — you know exactly how many each layer has. Quantization targets model size (params × bitwidth) — you can compute the MB savings. Depthwise separation targets MACs — you derived the 8–9× reduction. MCUNet targets peak activation memory — you know which layers dominate (early, large H×W). FlashAttention targets the N²·d attention term — you derived why it grows quadratically. Every efficient ML paper starts with a cost measurement, using exactly these formulas.

python
"""Per-layer profiler: compute params, MACs, activation size for a model spec."""

def profile_model(layers):
    """layers: list of dicts with type and dimension specs."""
    total_params, total_macs, peak_act = 0, 0, 0
    print(f"{'Layer':<30} {'Params':>10} {'MACs':>12} {'Act(KB)':>8}")
    print("-"*62)
    for L in layers:
        t = L['type']
        if t == 'linear':
            p = L['ci'] * L['co'] + L['co']
            m = L['ci'] * L['co']
            a = L['co'] * 2  # FP16 bytes
        elif t == 'conv':
            p = L['co']*L['ci']*L['k']**2 + L['co']
            m = L['ci']*L['co']*L['k']**2*L['h']*L['w']
            a = L['co']*L['h']*L['w']*2
        elif t == 'dw_sep':
            p = L['c']*L['k']**2 + L['co']*L['c']
            m = (L['c']*L['k']**2 + L['c']*L['co'])*L['h']*L['w']
            a = L['co']*L['h']*L['w']*2
        total_params += p; total_macs += m; peak_act = max(peak_act, a)
        print(f"{L['name']:<30} {p:>10,} {m:>12,} {a/1024:>7.1f}")
    print(f"\nTotal params: {total_params:,}  Total MACs: {total_macs:,}  Peak act: {peak_act/1024:.1f} KB")

# Example: simplified MobileNet-style backbone
profile_model([
    {'name':'Conv1 3→32',    'type':'conv',   'ci':3,   'co':32,  'k':3, 'h':112, 'w':112},
    {'name':'DWSep 32→64',   'type':'dw_sep', 'c':32,  'co':64,  'k':3, 'h':56,  'w':56},
    {'name':'DWSep 64→128',  'type':'dw_sep', 'c':64,  'co':128, 'k':3, 'h':28,  'w':28},
    {'name':'DWSep 128→256', 'type':'dw_sep', 'c':128, 'co':256, 'k':3, 'h':14,  'w':14},
    {'name':'FC 256→1000',   'type':'linear', 'ci':256, 'co':1000},
])

python
"""Full per-layer profiler with PyTorch hook-based MAC counting.
Works on any nn.Module — registers forward hooks to intercept layer shapes."""
import torch
import torch.nn as nn

def count_layer_macs(module, input, output):
    """Forward hook that appends MAC count to module._macs."""
    t = type(module)
    macs = 0
    if isinstance(module, nn.Linear):
        # MACs = batch × C_in × C_out
        macs = input[0].numel() * module.out_features
    elif isinstance(module, nn.Conv2d):
        # MACs = C_in × C_out × K² × H_out × W_out / groups
        n, c_in, h_in, w_in = input[0].shape
        n, c_out, h_out, w_out = output.shape
        kh, kw = module.kernel_size
        macs = (c_in // module.groups) * c_out * kh * kw * h_out * w_out
    module._macs = module._macs + macs if hasattr(module, '_macs') else macs

def profile_model_torch(model, input_shape=(1, 3, 224, 224)):
    """Profile a PyTorch model: params and MACs per layer."""
    hooks = []
    for m in model.modules():
        if isinstance(m, (nn.Linear, nn.Conv2d)):
            hooks.append(m.register_forward_hook(count_layer_macs))

    x = torch.zeros(*input_shape)
    with torch.no_grad():
        model(x)

    print(f"{'Layer':<35} {'Params':>10} {'MACs':>12}")
    print("-"*60)
    total_p, total_m = 0, 0
    for name, m in model.named_modules():
        if hasattr(m, '_macs'):
            p = sum(x.numel() for x in m.parameters())
            total_p += p; total_m += m._macs
            print(f"{name:<35} {p:>10,} {m._macs:>12,}")
    print(f"\nTotal: {total_p:,} params  {total_m:,} MACs")
    for h in hooks: h.remove()

# Example: profile a 3-block MobileNet-style backbone
class DWSep(nn.Module):
    def __init__(self, cin, cout, stride=1):
        super().__init__()
        self.dw = nn.Conv2d(cin, cin, 3, stride=stride, padding=1, groups=cin, bias=False)
        self.pw = nn.Conv2d(cin, cout, 1, bias=False)
    def forward(self, x): return self.pw(self.dw(x))

backbone = nn.Sequential(
    nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
    DWSep(32, 64), DWSep(64, 128, stride=2),
    DWSep(128, 256, stride=2), DWSep(256, 512, stride=2),
    nn.AdaptiveAvgPool2d(1), nn.Flatten(),
    nn.Linear(512, 1000)
)
profile_model_torch(backbone)

Related lessons: TinyML L1 — Why Efficiency & Metrics covers MACs, FLOPs, model size, and activation memory in depth. Transformer Architecture covers attention mechanisms in detail. Attention Variants covers FlashAttention, linear attention, and sparse attention. Attention in Transformers walks through QKV mechanics. Next in this series: TinyML L3 — Pruning.

The 1×1 conv (pointwise) is just a Linear layer applied spatially. A 1×1 conv with C_in=128 and C_out=256 on a 56×56 feature map: params = 256×128 = 32,768. MACs = 256×128×56×56 = 102,760,448 ≈ 103M. Same params as a Linear(128, 256) layer, but 56²=3,136× more MACs because it's applied at every spatial position. This is why a pointwise conv after a depthwise conv is cheap on params but still carries spatial MACs — the entire cost of depthwise separable is in the 1×1 pointwise step for wide networks (large C).

Quick formula check — BERT-large (d=1024, 24 blocks, N=512)

BERT-large total parameters (ignoring embeddings): per block = 4d² (attention projections) + 8d² (FFN) = 12d². Times 24 blocks: 24×12×1024² = 301,989,888 ≈ 302M. Plus embeddings (30,522 vocab × 1024 = 31M). Total ≈ 333M — matches the published 340M (small discrepancy from biases and LN).

For comparison, GPT-2 medium (d=1024, 24 layers, same architecture as BERT-large but causal): same param count (~340M) and similar MACs per forward pass. The difference is inference mode — BERT processes all N tokens in parallel (one pass), while GPT-2 generates autoregressively (one token at a time, with the KV cache). At generation time, GPT-2's batch=1 MACs per token are dominated by the FFN weights loaded from DRAM (AI ≈ 0.5 FLOPs/byte), making it memory-bandwidth-bound rather than compute-bound. Same formulas, completely different bottleneck.

BERT-large total MACs (N=512): per block = (4Nd² + 2N²d) + 8Nd² = 12Nd² + 2N²d. Block MACs = 12×512×1024² + 2×512²×1024 = 6,442,450,944 + 537,919,488 = 6,980,370,432 ≈ 7B. Times 24 blocks = 167.5B MACs for one forward pass at N=512. This is why BERT is expensive to run — 167B MACs × 2 FLOPs/MAC = 335 GFLOPs, vs ResNet-50's 4 GFLOPs. LLM inference at long contexts is even more expensive, driven by the N² term growing with context window.

How to use this knowledge in practice

When evaluating a new architecture or a compressed model, run this three-step mental audit:

Step 1: Count

Compute total params, total MACs, and peak activation memory. Flag any layer that's disproportionately expensive in each dimension.

↓

Step 2: Classify

Is the bottleneck params (late conv/FC/FFN)? MACs (early/mid conv, or long-context attention)? Activation memory (early layers with large H×W on MCUs)?

↓

Step 3: Match

Apply the technique that targets the bottleneck. Don't quantize the first conv layer to save params (it has almost none). Don't try FlashAttention on a short-sequence BERT (N²·d is not the bottleneck). Match tool to cost type.

The efficiency techniques you'll learn in subsequent lessons map directly to this table. Structured pruning (L3) removes entire filters from Conv/Linear layers — directly reducing the C_out dimension, shrinking both params and MACs. Quantization (L5) reduces bits per weight — shrinking model size without touching the MAC formula, but reducing the bytes-loaded per weight and enabling INT8 SIMD parallelism that gives ~4× throughput. Knowledge distillation (L7) reduces the overall network width d — replacing BERT-large (d=1024) with BERT-small (d=256) gives 16× param reduction (d² shrinks 16×) and 16× MAC reduction. Neural Architecture Search (L8) automates the tradeoff — searching the space of C_in, C_out, K, and layer count to minimize MACs subject to an accuracy constraint.

One more mental model: arithmetic intensity = MACs / bytes loaded. For a Linear layer (batch=1): AI = C_in·C_out MACs / (C_in·C_out·2 bytes) = 0.5 FLOPs/byte → deeply memory-bound. For a 3×3 Conv on 56×56: AI = 73K×3K MACs / (73K×2 bytes) = 3,136 FLOPs/byte → compute-bound on any modern GPU. This is why convolutions run efficiently on GPUs (high AI) but LLM autoregressive decoding does not (batch=1 Linear → AI=0.5).

"What I cannot create, I do not understand." — Richard Feynman. If you can compute the parameter count and MAC cost of any layer from its shape alone, you understand that layer. Everything else in efficient deep learning is a consequence of those two numbers.

A 3×3 standard conv with 256 input and 256 output channels is replaced by depthwise separable (DW+PW). The FLOP ratio formula is 1/K²+1/C = 1/9+1/256. Which part of this formula comes from the depthwise step and which from the pointwise?

1/K² from pointwise (1×1 kernel reduces spatial ops); 1/C from depthwise (fewer channels) Both terms come from depthwise; pointwise adds no savings 1/K² from depthwise (spatial filtering only, no cross-channel MACs); 1/C from pointwise (1×1 kern, saves K² factor of spatial kernel) The ratio depends only on architecture, not on K or C