Why Efficiency? The TinyML Problem & Metrics

Chapter 0: The 50,000× Gap

You want to run a large language model on your phone. Not the cloud — on the device, offline, with no round-trip latency. The model is Llama-2-7B: 7 billion parameters, trained on 2 trillion tokens, capable of remarkable things. You open the spec sheet and run the arithmetic:

7,000,000,000 params × 2 bytes/param (FP16) = 14,000,000,000 bytes = 14 GB

Your phone has 6 GB of RAM — shared with the OS, the camera app, everything else. The model alone is 2.3× the entire available memory. Strike one.

You try a microcontroller instead — an Arduino Nano 33 BLE Sense, the kind of chip in a smart thermostat or a hearing aid. It has 256 KB of SRAM. To run even a tiny 1M-parameter model at INT8 (1 byte per param) you need 1 MB — four times more than the chip's entire memory.

14 GB ÷ 256 KB = 14,000,000 KB ÷ 256 KB = 54,688× too large

This is the TinyML problem in one number: a 50,000× gap between what modern AI demands and what edge hardware supplies. The question this entire course asks — and answers — is: how do we bridge it?

The gap is not just about memory. Even if you could somehow fit the weights, running a 7B model forward pass costs ~14 billion multiply-accumulate operations per token. A microcontroller executes ~10 million operations per second. That is 1,400 seconds — 23 minutes — to generate a single word. Efficiency is not optional. It is the only path forward.

The answer involves a toolkit: pruning (remove weights that barely matter), quantization (use INT4 instead of FP32), neural architecture search (design lean architectures), knowledge distillation (compress a big model into a small one). Each technique in this toolkit has a precise cost-benefit story told in the efficiency metrics we build in this lesson.

Before you can compress, you need to measure. Before you measure, you need units. This lesson gives you those units — and the intuition behind each one.

Model size vs hardware memory — the diverging gap (2017-2023)

Bars show model sizes (blue) vs GPU memory available on flagship chips (orange). Notice models grow ~4× every two years while hardware grows ~2×. Toggle the view to see the same data on a log scale.

A 7B-parameter model stored at FP16 (2 bytes per parameter) and a microcontroller with 256 KB SRAM. Approximately how many times larger is the model than the chip's memory?

About 54× — the model is 54 MB, the chip has 1 MB About 5,400× — the model is 1.4 GB, the chip has 256 KB About 54,000× — the model is 14 GB, the chip has 256 KB About 540,000× — including OS and other overhead

Chapter 1: Moore's Law vs Model Growth

In 1965, Gordon Moore observed that the number of transistors on a chip doubled roughly every two years. This became Moore's Law — a self-fulfilling prophecy that the semiconductor industry used as a roadmap for decades. More transistors meant more compute, more memory, faster chips. The free lunch of hardware improvement.

Deep learning arrived and ate that lunch — and then demanded dessert. From 2017 to 2022:

GPU memory capacity grew ~2× every two years (Moore's Law pace): V100 had 16 GB (2017), A100 had 40 GB (2020), A100 80GB (2021).
Model sizes grew ~4× every two years: GPT (2018) 0.11B → BERT (2018) 0.34B → GPT-2 (2019) 1.5B → GPT-3 (2020) 175B → MT-NLG (2021) 530B.

The math is brutal: hardware doubles, models quadruple. Every two years the gap widens by a factor of 2. Over six years (2017-2023), that compounds to an 8× mismatch just from the growth rate difference — on top of the absolute gap that already existed at the start.

Why do models grow faster than hardware? Scaling laws show that model performance improves as a power law with model size. The payoff for doubling parameters is real and measurable: lower perplexity, better reasoning, emergent capabilities. Researchers have strong incentive to scale. Hardware is constrained by physics, yield rates, and economics. Software ambition outpaces silicon.

This is not a temporary mismatch that will self-correct. The trend has continued past 2022: LLaMA-3 405B (2024), GPT-4 (estimated 1-2T parameters), Gemini Ultra. Hardware has improved but not at the pace needed. The gap is structural — and closing it efficiently is the job of the field this course covers.

What "efficient deep learning" actually means

The goal is not to make a worse model that fits. The goal is to deploy the best possible model within the given constraints. "Efficient" means maximizing accuracy per unit of resources — where resources include memory, compute (MACs), energy (Joules), and latency (milliseconds).

Training (cloud)

Large model, full precision, lots of data. Optimize for accuracy. Cost: $1M+ for large runs. No real-time constraint.

↓ model compression pipeline

Efficient Model

Pruned, quantized, distilled. Same task, far fewer resources. The art: minimize accuracy loss.

↓ deploy to target hardware

Edge Device

Phone (6 GB RAM, 20 TOPS), microcontroller (256 KB, 10 MOPS), or custom ASIC. Real-time, battery-powered, private.

Before any of these compression techniques can be applied correctly, you need to know exactly what you are compressing. That means measuring the model with precise metrics. The rest of this lesson builds those metrics from scratch.

Model growth vs hardware memory (2017–2023) — animated

Drag the year slider to watch the gap evolve. Notice that hardware grows at Moore's Law pace (2×/2yr) while models grow 4×/2yr.

Year 2023

GPU memory roughly doubles every 2 years (Moore's Law). Model sizes have been growing ~4× every 2 years. After 4 years, by how much has the ratio of model-size-to-GPU-memory grown?

2× — models grew 8×, hardware grew 4× 3× — approximately 4× — models grew 16× (4×4), hardware grew 4× (2×2), ratio grew 16/4=4 8× — models grow in both size and complexity

Chapter 2: Parameters & Model Size

The first efficiency metric: how many learnable numbers does your model contain? This is the #parameters — the count of all weights and biases across every layer. Parameters live in memory between forward passes. They must be loaded from storage, held in RAM (or SRAM on a microcontroller), and updated during training.

Counting parameters: fully-connected layer

A fully-connected (linear) layer takes an input vector of size c_in and produces an output vector of size c_out. Every output neuron connects to every input neuron. The weight matrix W has shape (c_out, c_in), plus a bias vector of size c_out.

#params_FC = c_in × c_out + c_out = c_out(c_in + 1)

Worked example: A hidden layer with c_in = 768 (e.g., BERT's hidden dimension) and c_out = 3072 (the FFN expansion):

768 × 3072 + 3072 = 2,359,296 + 3,072 = 2,362,368 params

In a language model, these FFN layers appear at every transformer block. GPT-2 has 48 such blocks. Just the FFN weights are: 48 × 2 × 2,362,368 ≈ 227M parameters — nearly all of GPT-2-large's 345M.

From parameter count to bytes: the bitwidth factor

Model size is the memory footprint of the weights. It depends on two things: how many parameters, and how many bits each one uses. This is the bitwidth (or precision).

Model Size (bytes) = #params × bitwidth / 8

Format	Bits/param	7B model size	Notes
FP32	32	28 GB	Training default (gradients, optimizer states need this)
BF16 / FP16	16	14 GB	Inference default for large models; same range as FP32 (BF16)
INT8	8	7 GB	Post-training quantization; ~1% accuracy loss typically
INT4	4	3.5 GB	Fits in 4 GB GPU; widely used (GPTQ, AWQ, bitsandbytes)
INT2	2	1.75 GB	Aggressive; significant accuracy loss; active research area

INT4 quantization is why you can run a 7B model on a consumer GPU: 3.5 GB fits in an NVIDIA RTX 3080 (10 GB VRAM) with room for activations and KV cache. The 4× compression from FP16 to INT4 is entirely from halving the bitwidth twice — no architectural changes needed.

Misconception: fewer parameters always means smaller model. Architecture matters too. A 1B-parameter model in FP32 is 4 GB. A 7B-parameter model in INT4 is 3.5 GB. The quantized 7B is smaller than the full-precision 1B. Always track both parameter count AND bitwidth when comparing model sizes.

python
def count_fc_params(c_in, c_out, bias=True):
    # Weight matrix: c_out × c_in
    # Bias vector: c_out (optional)
    params = c_in * c_out
    if bias:
        params += c_out
    return params

def model_size_mb(n_params, bits_per_param=16):
    bytes_total = n_params * bits_per_param / 8
    return bytes_total / (1024 ** 2)  # convert to MB

# Example: BERT-base (12-layer, hidden=768)
ffn_per_block = count_fc_params(768, 3072) + count_fc_params(3072, 768)
total_ffn = ffn_per_block * 12  # 12 transformer blocks
print(f"FFN params: {total_ffn:,}")            # ≈ 56,623,104
print(f"Size at FP16: {model_size_mb(total_ffn, 16):.1f} MB")  # ≈ 108 MB
print(f"Size at INT4: {model_size_mb(total_ffn, 4):.1f} MB")   # ≈ 27 MB

A transformer has 32 layers. Each layer has two FC layers: one from hidden=4096 to intermediate=16384, and one back from 16384 to 4096. Ignoring biases, how many parameters are in all the FFN weights?

About 2 billion parameters About 4.3 billion — 32 × 2 × (4096 × 16384) = 32 × 2 × 67,108,864 ≈ 4.3B About 1 billion parameters About 8.6 billion — counting biases doubles the total

Chapter 3: MACs, FLOPs & OPs

Parameters measure what is stored. But running inference requires computation — and the compute cost is measured differently. The core unit is the MAC: multiply-accumulate operation.

MAC: output += weight × input

One MAC = one multiply + one add. That is exactly 2 floating-point operations (FLOPs). The conversion is always: 1 MAC = 2 FLOPs. Hardware often reports performance in FLOPS (floating-point operations per second), so you will frequently need to convert.

Why MACs, not FLOPs? Neural network inference is dominated by matrix multiplications, which fuse multiply and add into a single hardware instruction. Modern accelerators (GPU tensor cores, TPU MXUs, Apple ANE) count throughput in MACs/second (or equivalently, their performance is rated in FLOPS = 2 × MACs/sec). Counting MACs directly is closer to what hardware actually does.

MACs of a fully-connected layer

For a linear layer with input size c_in and output size c_out, computing one output element y_j = Σ_i w_ji·x_i + b_j requires c_in MACs (the dot product). There are c_out such output elements.

MACs_FC = c_in × c_out

Worked example: The BERT FFN expansion from 768 to 3072:

MACs = 768 × 3072 = 2,359,296 MACs = 4,718,592 FLOPs

Notice: for FC layers, #MACs = #params (ignoring bias). This is not a coincidence — every weight participates in exactly one multiply-accumulate per forward pass (for a single input).

MACs of a 2D convolution layer

Convolutions are where the MAC count diverges from the parameter count due to weight sharing. A 2D conv has input feature map (c_in, H, W), output feature map (c_out, H_out, W_out), and kernel size K×K. Each output location requires one full filter application:

MACs_Conv = c_out × c_in × K × K × H_out × W_out

The parameters in the conv are: c_out × c_in × K × K. But the MACs multiply that by H_out × W_out — the number of spatial positions where the kernel is applied. The same filter is reused at every position (weight sharing), so MACs ≫ params for large feature maps.

Worked example: ResNet-50's first conv: c_in=3, c_out=64, K=7, input 224×224, stride=2 → H_out=W_out=112:

params = 64 × 3 × 7 × 7 = 9,408

MACs = 64 × 3 × 7 × 7 × 112 × 112 = 9,408 × 12,544 = 118,013,952

That single conv layer has 118M MACs (236M FLOPs) — 12,544× more MACs than parameters. This is why large spatial feature maps are so expensive: the parameter count looks small but the compute is massive.

Misconception: FLOPs = latency. They do not. FLOPs measure arithmetic operations. Latency measures wall-clock time. A model can have fewer FLOPs and be slower if it accesses memory inefficiently. A GPU might spend 80% of its time waiting for data from memory rather than computing. The roofline model (Chapter 7) explains when you are compute-bound vs memory-bound.

python
def macs_conv2d(c_in, c_out, k, h_out, w_out):
    # Each output location: c_in × k × k MACs
    # Total output locations: h_out × w_out
    # Total filters: c_out
    return c_out * c_in * k * k * h_out * w_out

def macs_fc(c_in, c_out):
    return c_in * c_out

# ResNet-50 layer 1: 3→64, 7×7 kernel, output 112×112
macs = macs_conv2d(3, 64, 7, 112, 112)
params = 3 * 64 * 7 * 7
print(f"Params: {params:,}")             # 9,408
print(f"MACs:   {macs:,}")               # 118,013,952
print(f"Ratio MACs/params: {macs/params:.0f}×")  # 12,544×

# 1 MAC = 2 FLOPs (multiply + add)
flops = macs * 2
print(f"FLOPs:  {flops/1e6:.1f} MFLOPs")  # 236 MFLOPs

A conv layer has c_in=128, c_out=256, kernel size K=3, and output spatial size 56×56. How many MACs does one forward pass through this layer require?

About 32K MACs — just 128×256 About 294K MACs — 128×256×9 About 924M MACs — 128×256×9×56×56 / 10 About 924M MACs — 256×128×3×3×56×56 = 924,844,032

Chapter 4: Activation Memory

Model size counts parameters — the weights the model was trained to have. But during inference (and especially training), the network generates a second category of memory usage: activations. Activations are the intermediate feature maps produced by each layer as the input flows forward through the network.

Activations are not stored between forward passes — they are computed fresh each time. But they must be live in memory simultaneously while a given layer is executing. On a microcontroller, with only 256 KB of SRAM shared between everything, activations often dominate memory usage — far exceeding the weights of the layer computing them.

Computing activation memory: conv layer

A 2D conv layer with output feature map (c_out, H_out, W_out) must store that entire output tensor in memory. Its size (for batch size 1):

Activation Size = c_out × H_out × W_out × (bitwidth / 8)

Worked example: ResNet-50's first conv output (64 channels, 112×112):

64 × 112 × 112 × 2 bytes (FP16) = 1,605,632 bytes = 1.5 MB

Compare this to the weight size: 9,408 params × 2 bytes = 18 KB. The activation is 86× larger than the layer's parameters. This ratio — large output feature maps with relatively small filters — is the typical case in the early layers of CNNs.

Critical misconception: parameters are the memory bottleneck on MCUs. For tiny microcontrollers, activations are often the binding constraint, not parameters. Consider a network with 100K parameters at INT8 = 100 KB. That might just fit in 256 KB SRAM. But if any layer produces a feature map bigger than 100 KB, the network cannot run — even though its weights are fine. MCU-friendly architectures (like MCUNet) co-design parameter count AND activation sizes simultaneously.

Peak activation memory

During inference, you need memory for the current layer's input AND output simultaneously (so the output doesn't overwrite the input before the computation finishes). The peak activation memory is the maximum memory needed at any point during the forward pass:

Peak Activation ≈ max_{layer l}(input_l size + output_l size)

For sequential networks like ResNets, this is usually the first few layers where spatial resolution is high. For transformers, it is the attention matrices: storing the full Q·K^T matrix for a sequence length L and h attention heads requires L² × h × 2 bytes, which at L=4096 and h=32 is 4096² × 32 × 2 = 1 GB — just for one attention score matrix.

python
def activation_size_bytes(c_out, h_out, w_out, bits=16, batch=1):
    return batch * c_out * h_out * w_out * bits // 8

# ResNet-50 layer 1 output
act = activation_size_bytes(64, 112, 112, bits=16)
weights = 3 * 64 * 7 * 7 * 2  # FP16 weights
print(f"Activations: {act/1024:.1f} KB")       # 1,568 KB = 1.5 MB
print(f"Weights:     {weights/1024:.1f} KB")    # 18 KB
print(f"Ratio: {act/weights:.0f}×")             # 86×

# Transformer attention memory at sequence length L
L, h, d_head = 4096, 32, 128
# QK^T matrix: (h, L, L) — one attention score per head per query-key pair
attn_mem = h * L * L * 2  # FP16
print(f"Attention scores at L={L}: {attn_mem/1e9:.2f} GB")  # ~1 GB

A MCU has 512 KB SRAM. A conv layer has 200K parameters (INT8 = 200 KB) and produces a feature map of size 128 channels × 64 × 64. Can this layer run on the MCU at INT8?

Yes — the weights are only 200 KB, well within 512 KB No — the activation (128×64×64×1 byte = 524 KB) plus weights (200 KB) = 724 KB, exceeds 512 KB Yes — activations are stored on disk, not SRAM Cannot determine without knowing the layer type

Chapter 5: The Energy Hierarchy

On a battery-powered edge device, energy is the ultimate constraint. A Raspberry Pi running ResNet inference at 1 fps uses enough power to drain a CR2032 coin cell in about 90 minutes. A hearing aid runs on the same battery for weeks. The difference is not just chip efficiency — it is where the chip spends its energy.

The energy cost of a computation depends not just on what arithmetic is performed, but on where the data comes from. The memory hierarchy in a processor has fundamentally different energy costs at each level:

The key insight: memory access dominates energy, not arithmetic. A 32-bit multiply on a modern chip costs about 3.7 pJ. A 32-bit DRAM read costs about 640 pJ — over 170× more energy. Not computing faster is useless if the chip is spending all its energy waiting for data from DRAM. Efficient ML is largely about making the memory access pattern cheaper.

Memory Level	Access Energy (32-bit)	Relative to Reg	Capacity (typical)
Register	~0.1 pJ	1×	~KB (tens of registers)
L1 Cache (SRAM)	~0.5 pJ	5×	32–256 KB
L2 Cache (SRAM)	~2 pJ	20×	256 KB–4 MB
On-chip SRAM (ML)	~5 pJ	50×	4–32 MB (GPU shared mem)
Off-chip DRAM	~640 pJ	6,400×	4–80 GB (GPU HBM)

These numbers are from Song Han's own benchmarks on 45nm CMOS. The takeaway is unambiguous: if you read a weight from DRAM, you spend 640 pJ. The multiply-accumulate with that weight costs 3.7 pJ. The memory access is 173× more expensive than the computation.

Why this makes on-chip SRAM crucial

If you can keep the weights in on-chip SRAM (5 pJ/access) instead of DRAM (640 pJ/access), you reduce energy by 128×. This is the fundamental reason why model compression matters for energy efficiency — smaller models fit in cache, eliminating DRAM traffic. A model that is 4× smaller may use much less than 4× the energy because the entire model stays in SRAM.

This also explains why quantization (INT8 vs FP32) saves more energy than just the 4× size reduction suggests. INT8 weights fit in 4× less memory, so 4× more of them fit in cache. The cache hit rate improves, fewer DRAM accesses occur, and each access fetches a shorter word — the energy savings compound.

python
# Energy estimation: DRAM access vs computation
E_DRAM_pJ = 640    # energy for 32-bit DRAM access
E_SRAM_pJ = 5     # on-chip SRAM
E_MAC_pJ  = 3.7   # multiply-accumulate (32-bit)

# ResNet-50 first layer: 118M MACs, 9,408 parameters
n_macs = 118_013_952
n_params = 9_408  # filter weights, read once per forward pass

# Case 1: weights in DRAM
energy_compute = n_macs * E_MAC_pJ
energy_mem_dram = n_params * E_DRAM_pJ  # each weight loaded from DRAM
print(f"Compute energy: {energy_compute/1e6:.1f} µJ")      # 436 µJ
print(f"DRAM load energy: {energy_mem_dram/1e6:.4f} µJ")  # 0.006 µJ
# Wait — for a conv, each weight is REUSED h_out×w_out = 12,544 times!
energy_mem_dram_total = n_params * 12544 * E_DRAM_pJ  # if not cached
print(f"DRAM if not cached: {energy_mem_dram_total/1e9:.1f} mJ")  # 75 mJ
print(f"That's {energy_mem_dram_total/energy_compute:.0f}× the compute energy!")

The code reveals why filter caching is critical in conv layers: the same 9,408 weights are used at 12,544 different spatial positions. If you reload them from DRAM each time, memory energy dwarfs compute energy by 170×. If you cache the filter in on-chip SRAM for the duration of the spatial sweep, the energy profile inverts.

Memory hierarchy energy ladder

Click a memory level to see its energy cost per 32-bit access in pJ. The bar heights are log-scaled. Hover to see the ratio vs a register access.

A model has 10M parameters at FP32. You quantize to INT8 (4× fewer bytes). Which energy benefit is GREATER: the 4× reduction in bytes transferred, or the improvement from keeping more weights in cache?

The 4× fewer bytes transferred — it directly reduces DRAM bandwidth They are exactly equal — quantization saves exactly 4× Improved cache hit rate — avoiding DRAM (640 pJ) in favor of SRAM (5 pJ) saves 128× per access, which can compound beyond 4× Neither — quantization affects accuracy, not energy

Chapter 6: Latency vs Throughput

Two numbers describe how fast a system processes data: latency and throughput. They sound like they measure the same thing, but they don't — and confusing them leads to wrong optimization decisions.

Latency is the time from input arrival to output delivery for a single request: milliseconds per query. It is what matters for interactive applications — if you ask a question and the answer takes 10 seconds, you notice. Voice recognition, autonomous driving, and AR all require low latency.

Throughput is the total number of queries processed per unit time: queries per second (QPS) or, for text generation, tokens per second. A system can have high throughput and high latency simultaneously if it processes many requests in parallel — each individual request waits longer, but the system handles more volume overall.

Throughput = Batch Size / Latency

This formula hides the tension: increasing batch size increases throughput but also increases latency (each request waits for the batch to fill before processing begins). The tradeoff is real and unavoidable. Cloud inference providers exploit this by batching requests from multiple users — a request that might take 50 ms alone takes 80 ms batched, but the system handles 10× more traffic.

Batching and arithmetic intensity

Batching does more than just amortize overhead. It fundamentally changes the arithmetic intensity of the computation — and whether the workload is compute-bound or memory-bound.

For a matrix multiplication with weight matrix W of shape (c_in, c_out) and a batch of B inputs:

Arithmetic work: B × c_in × c_out MACs — scales linearly with B
Weight memory reads: c_in × c_out — independent of B (weights are reused across batch)

Arithmetic Intensity = MACs / bytes = (B × c_in × c_out) / (c_in × c_out × 2) = B / 2

At batch size B=1, intensity = 0.5 MACs/byte — deeply memory-bound (the A100 needs ~178 MACs/byte to be compute-bound). At B=512, intensity = 256 MACs/byte — now compute-bound. Batching moves the workload from the memory-bound to the compute-bound regime, recovering the GPU's full compute utilization.

LLM autoregressive decode is always memory-bound at batch size 1. During token generation, the model processes one token at a time (batch=1). The arithmetic intensity for each transformer layer's weight matmul is B/2 = 0.5 — far below the roofline ridge point. The model spends most of its time loading weights from HBM, not computing. This is why speculative decoding, continuous batching, and quantization (smaller weights = faster load) matter so much for LLM serving latency.

python
import numpy as np

# Arithmetic intensity for FC layer, varying batch size
c_in, c_out = 4096, 4096
bytes_per_param = 2  # FP16

for B in [1, 4, 16, 64, 256, 1024]:
    macs = B * c_in * c_out
    weight_bytes = c_in * c_out * bytes_per_param  # loaded once per batch
    intensity = macs / weight_bytes  # MACs per byte
    bound = "memory" if intensity < 178 else "compute"  # A100 ridge
    print(f"B={B:5d}: intensity={intensity:6.1f} MACs/byte → {bound}-bound")

# Output:
# B=    1: intensity=   0.5 MACs/byte → memory-bound
# B=    4: intensity=   2.0 MACs/byte → memory-bound
# B=   16: intensity=   8.0 MACs/byte → memory-bound
# B=   64: intensity=  32.0 MACs/byte → memory-bound
# B=  256: intensity= 128.0 MACs/byte → memory-bound
# B= 1024: intensity= 512.0 MACs/byte → compute-bound

During LLM inference, a user is generating text one token at a time (batch size = 1). An engineer proposes reducing model size by 4× via INT4 quantization. What is the PRIMARY benefit for latency?

4× fewer FLOPs — INT4 arithmetic is faster than FP16 4× less memory used — the model fits in a smaller GPU 4× faster weight loading from HBM — at batch=1 the bottleneck is memory bandwidth, not compute Latency is unchanged — INT4 affects throughput, not latency

Chapter 7: Showcase: The Roofline Model

You have a neural network layer. You have a GPU. Will the layer run fast? The answer depends on one number: arithmetic intensity — how many arithmetic operations are performed per byte of data read from memory.

Arithmetic Intensity (AI) = #MACs / bytes transferred from memory

The roofline model makes the bound explicit. A processor has two limits:

Peak compute throughput (FLOPS): the maximum arithmetic operations per second — determined by hardware (e.g., A100 FP16 = 312 TFLOPS)
Peak memory bandwidth (BW): the maximum bytes per second the chip can load from memory (A100 HBM = 2 TB/s)

The ridge point is the arithmetic intensity at which both limits are saturated simultaneously:

Ridge Point = Peak FLOPS / Peak BW = 312 TFLOPS / 2 TB/s = 156 FLOPs/byte (A100)

If your workload's arithmetic intensity is below 156 FLOPs/byte, you are memory-bound — you compute so fast that the chip sits idle waiting for data. Performance is limited by BW: achievable_FLOPS = AI × BW. If intensity is above 156, you are compute-bound — you process data fast enough to fully utilize the compute units. Performance is limited by peak FLOPS.

The roofline gives you the ceiling. Achievable performance = min(Peak FLOPS, AI × BW). This is two lines on a log-log plot: a horizontal ceiling (peak compute) and a sloped line (bandwidth limit). The model "rooflines" below both. Plotting your operation on this chart instantly tells you what to optimize: if you are on the sloped wall (memory-bound), optimizing compute is useless. You need to increase AI — tiling, caching, quantization.

Real operations on the A100 roofline

Operation	Arithmetic Intensity	Regime (A100 FP16)
Elementwise ReLU	~0.25 FLOPs/byte	Memory-bound (624× below ridge)
LLM decode (B=1)	~1 FLOPs/byte	Memory-bound (156× below ridge)
Softmax	~3 FLOPs/byte	Memory-bound
LLM prefill (B=32)	~32 FLOPs/byte	Memory-bound (approaching ridge)
Large matmul (B=512)	~512 FLOPs/byte	Compute-bound
Conv (large batch)	~1000 FLOPs/byte	Deeply compute-bound

Interactive Roofline Model — A100 FP16

Drag the arithmetic intensity slider to see where your operation lands. Below the ridge point = memory-bound (you need faster memory or higher intensity). Above = compute-bound (you need faster math or fewer FLOPs).

Arithmetic Intensity (FLOPs/byte) 1 FLOP/byte

A GPU has peak compute of 100 TFLOPS and memory bandwidth of 2 TB/s. The ridge point is 50 FLOPs/byte. An operation has arithmetic intensity of 10 FLOPs/byte. What is its achievable throughput?

100 TFLOPS — it runs at peak compute 20 TFLOPS — memory-bound: 10 FLOPs/byte × 2 TB/s = 20 TFLOPS 50 TFLOPS — halfway between memory and compute limits 200 TFLOPS — it needs to go above the ridge point

Chapter 8: Interactive Conv MAC Calculator

This chapter brings together every metric from this lesson into a single interactive calculator. Adjust any parameter of a convolutional layer and see all efficiency metrics update live: parameter count, model size at multiple precisions, MAC count, activation size, and arithmetic intensity.

This is the tool you would use at the start of a TinyML design project: set your target hardware constraints (MCU SRAM, power budget), then explore architectures that fit those constraints before writing any code.

Conv Layer Efficiency Calculator

Drag sliders to configure the convolutional layer. All metrics update live. Red values exceed typical MCU constraints (256 KB SRAM for activations, 512 KB for weights).

C_in16

C_out16

Kernel K3×3

Output H=W56

You are designing a conv layer for an MCU with 256 KB SRAM (activations must fit). You have c_in=32, c_out=64, K=3, and spatial output 32×32. Does the activation fit? (INT8 = 1 byte/element)

Yes — 64 × 32 × 32 × 1 byte = 65,536 bytes = 64 KB, fits in 256 KB No — the activation is 128 KB, which exceeds 64 KB limit Yes, but only if you also store input — combined memory is fine Cannot determine without knowing the kernel size

Chapter 9: Connections & Cheat Sheet

You now have the complete vocabulary of efficient deep learning. This chapter consolidates every metric into a single reference, shows how they connect to the four major compression techniques, and points to the next lessons in this series.

The complete metrics cheat sheet

Metric	Formula	What it measures	Typical units
#Parameters	FC: c_in×c_out; Conv: c_out×c_in×K²	Learnable weights (storage)	M, B (billions)
Model Size	#params × bitwidth / 8	Memory footprint of weights	MB, GB
Peak Activations	max(c_out × H_out × W_out) × bytes	Intermediate feature map memory	KB, MB
MACs	FC: c_in×c_out; Conv: c_out×c_in×K²×H_out×W_out	Multiply-accumulate ops (compute)	M, G, T MACs
FLOPs	2 × MACs	Floating-point ops (hardware reports)	MFLOP, GFLOP, TFLOP
Arithmetic Intensity	FLOPs / bytes_from_memory	Compute efficiency vs memory	FLOPs/byte
Latency	Wall-clock time per query	Response time	ms, s
Throughput	Batch Size / Latency	Requests per second	QPS, tokens/s
Energy	Σ (accesses × pJ/access)	Battery impact	µJ, mJ per inference

How each compression technique targets these metrics

Pruning

Removes weights → ↓ #params → ↓ model size → ↓ MACs. Structured pruning (whole channels) → ↓ activation size. Requires sparse-aware hardware for full speedup. CS 6.5940 Lecture 3-5.

↓

Quantization

↓ bitwidth → ↓ model size → ↓ energy (smaller DRAM reads) → ↑ arithmetic intensity (denser packing) → may move from memory-bound to compute-bound. INT8/INT4 quantization. CS 6.5940 Lecture 6-8.

↓

Neural Architecture Search

Finds architectures with low MACs and activations for given accuracy. MCUNet co-optimizes params + peak activations. CS 6.5940 Lecture 9-11.

↓

Knowledge Distillation

Trains a small student to mimic a large teacher. Targets: same task, ↓ params, ↓ MACs. Does not change bitwidth or spatial sizes directly. CS 6.5940 Lecture 12-13.

Related lessons on this site

The CS336 Resource Accounting lesson covers FLOPs, memory, and the roofline model from the training perspective: CS336 Lec 2 — PyTorch & Resource Accounting. The GPU internals lesson goes deep on arithmetic intensity and Flash Attention: CS336 Lec 5 — GPUs. For inference latency and the KV cache: CS336 Lec 10 — Inference.

The CS336 Kernels lesson derives the exact roofline analysis for elementwise operations and tiling: CS336 Lec 6 — Kernels & Triton.

What you can now do. Given any neural network layer, you can compute: its parameter count and model size at any bitwidth, its MAC count and FLOPs, its activation memory footprint, its arithmetic intensity, and whether it will be memory-bound or compute-bound on a given accelerator. You can estimate energy using the memory hierarchy numbers and identify the binding constraint (compute, memory bandwidth, or SRAM capacity). This is the vocabulary every paper in this field uses.

"What I cannot create, I do not understand." — Richard Feynman. Build a function that takes (c_in, c_out, K, H_out, W_out, bitwidth) and returns a complete efficiency report. If you can write it from scratch, you understand every metric in this lesson.

A TinyML engineer needs to deploy a classification model to a device with 512 KB SRAM. She has a model with 400K parameters at INT8 (400 KB). The largest activation tensor is 150 KB. Can the model run? What is the binding constraint?

Yes — weights (400 KB) + peak activation (150 KB) = 550 KB > 512 KB. Does NOT fit. The binding constraint is peak activation memory, not weights alone. Yes — 400 KB fits in 512 KB, activation is loaded separately No — 400K parameters at INT8 already exceeds 512 KB Yes — activations are computed on-the-fly and don't need SRAM