A 7B-parameter LLM weighs ~14 GB at FP16. Your phone has 6 GB of RAM. A microcontroller has 256 KB of SRAM — a 50,000× gap. This lesson builds the vocabulary the entire field uses: MACs, FLOPs, model size, activation memory, latency vs throughput, energy hierarchies, and the roofline model. Every metric derived with real numbers. MIT 6.5940 by Song Han.
You want to run a large language model on your phone. Not the cloud — on the device, offline, with no round-trip latency. The model is Llama-2-7B: 7 billion parameters, trained on 2 trillion tokens, capable of remarkable things. You open the spec sheet and run the arithmetic:
Your phone has 6 GB of RAM — shared with the OS, the camera app, everything else. The model alone is 2.3× the entire available memory. Strike one.
You try a microcontroller instead — an Arduino Nano 33 BLE Sense, the kind of chip in a smart thermostat or a hearing aid. It has 256 KB of SRAM. To run even a tiny 1M-parameter model at INT8 (1 byte per param) you need 1 MB — four times more than the chip's entire memory.
This is the TinyML problem in one number: a 50,000× gap between what modern AI demands and what edge hardware supplies. The question this entire course asks — and answers — is: how do we bridge it?
The answer involves a toolkit: pruning (remove weights that barely matter), quantization (use INT4 instead of FP32), neural architecture search (design lean architectures), knowledge distillation (compress a big model into a small one). Each technique in this toolkit has a precise cost-benefit story told in the efficiency metrics we build in this lesson.
Before you can compress, you need to measure. Before you measure, you need units. This lesson gives you those units — and the intuition behind each one.
Bars show model sizes (blue) vs GPU memory available on flagship chips (orange). Notice models grow ~4× every two years while hardware grows ~2×. Toggle the view to see the same data on a log scale.
In 1965, Gordon Moore observed that the number of transistors on a chip doubled roughly every two years. This became Moore's Law — a self-fulfilling prophecy that the semiconductor industry used as a roadmap for decades. More transistors meant more compute, more memory, faster chips. The free lunch of hardware improvement.
Deep learning arrived and ate that lunch — and then demanded dessert. From 2017 to 2022:
The math is brutal: hardware doubles, models quadruple. Every two years the gap widens by a factor of 2. Over six years (2017-2023), that compounds to an 8× mismatch just from the growth rate difference — on top of the absolute gap that already existed at the start.
This is not a temporary mismatch that will self-correct. The trend has continued past 2022: LLaMA-3 405B (2024), GPT-4 (estimated 1-2T parameters), Gemini Ultra. Hardware has improved but not at the pace needed. The gap is structural — and closing it efficiently is the job of the field this course covers.
The goal is not to make a worse model that fits. The goal is to deploy the best possible model within the given constraints. "Efficient" means maximizing accuracy per unit of resources — where resources include memory, compute (MACs), energy (Joules), and latency (milliseconds).
Before any of these compression techniques can be applied correctly, you need to know exactly what you are compressing. That means measuring the model with precise metrics. The rest of this lesson builds those metrics from scratch.
Drag the year slider to watch the gap evolve. Notice that hardware grows at Moore's Law pace (2×/2yr) while models grow 4×/2yr.
The first efficiency metric: how many learnable numbers does your model contain? This is the #parameters — the count of all weights and biases across every layer. Parameters live in memory between forward passes. They must be loaded from storage, held in RAM (or SRAM on a microcontroller), and updated during training.
A fully-connected (linear) layer takes an input vector of size cin and produces an output vector of size cout. Every output neuron connects to every input neuron. The weight matrix W has shape (cout, cin), plus a bias vector of size cout.
Worked example: A hidden layer with cin = 768 (e.g., BERT's hidden dimension) and cout = 3072 (the FFN expansion):
In a language model, these FFN layers appear at every transformer block. GPT-2 has 48 such blocks. Just the FFN weights are: 48 × 2 × 2,362,368 ≈ 227M parameters — nearly all of GPT-2-large's 345M.
Model size is the memory footprint of the weights. It depends on two things: how many parameters, and how many bits each one uses. This is the bitwidth (or precision).
| Format | Bits/param | 7B model size | Notes |
|---|---|---|---|
| FP32 | 32 | 28 GB | Training default (gradients, optimizer states need this) |
| BF16 / FP16 | 16 | 14 GB | Inference default for large models; same range as FP32 (BF16) |
| INT8 | 8 | 7 GB | Post-training quantization; ~1% accuracy loss typically |
| INT4 | 4 | 3.5 GB | Fits in 4 GB GPU; widely used (GPTQ, AWQ, bitsandbytes) |
| INT2 | 2 | 1.75 GB | Aggressive; significant accuracy loss; active research area |
INT4 quantization is why you can run a 7B model on a consumer GPU: 3.5 GB fits in an NVIDIA RTX 3080 (10 GB VRAM) with room for activations and KV cache. The 4× compression from FP16 to INT4 is entirely from halving the bitwidth twice — no architectural changes needed.
python def count_fc_params(c_in, c_out, bias=True): # Weight matrix: c_out × c_in # Bias vector: c_out (optional) params = c_in * c_out if bias: params += c_out return params def model_size_mb(n_params, bits_per_param=16): bytes_total = n_params * bits_per_param / 8 return bytes_total / (1024 ** 2) # convert to MB # Example: BERT-base (12-layer, hidden=768) ffn_per_block = count_fc_params(768, 3072) + count_fc_params(3072, 768) total_ffn = ffn_per_block * 12 # 12 transformer blocks print(f"FFN params: {total_ffn:,}") # ≈ 56,623,104 print(f"Size at FP16: {model_size_mb(total_ffn, 16):.1f} MB") # ≈ 108 MB print(f"Size at INT4: {model_size_mb(total_ffn, 4):.1f} MB") # ≈ 27 MB
Parameters measure what is stored. But running inference requires computation — and the compute cost is measured differently. The core unit is the MAC: multiply-accumulate operation.
One MAC = one multiply + one add. That is exactly 2 floating-point operations (FLOPs). The conversion is always: 1 MAC = 2 FLOPs. Hardware often reports performance in FLOPS (floating-point operations per second), so you will frequently need to convert.
For a linear layer with input size cin and output size cout, computing one output element yj = Σi wji·xi + bj requires cin MACs (the dot product). There are cout such output elements.
Worked example: The BERT FFN expansion from 768 to 3072:
Notice: for FC layers, #MACs = #params (ignoring bias). This is not a coincidence — every weight participates in exactly one multiply-accumulate per forward pass (for a single input).
Convolutions are where the MAC count diverges from the parameter count due to weight sharing. A 2D conv has input feature map (cin, H, W), output feature map (cout, Hout, Wout), and kernel size K×K. Each output location requires one full filter application:
The parameters in the conv are: cout × cin × K × K. But the MACs multiply that by Hout × Wout — the number of spatial positions where the kernel is applied. The same filter is reused at every position (weight sharing), so MACs ≫ params for large feature maps.
Worked example: ResNet-50's first conv: cin=3, cout=64, K=7, input 224×224, stride=2 → Hout=Wout=112:
That single conv layer has 118M MACs (236M FLOPs) — 12,544× more MACs than parameters. This is why large spatial feature maps are so expensive: the parameter count looks small but the compute is massive.
python def macs_conv2d(c_in, c_out, k, h_out, w_out): # Each output location: c_in × k × k MACs # Total output locations: h_out × w_out # Total filters: c_out return c_out * c_in * k * k * h_out * w_out def macs_fc(c_in, c_out): return c_in * c_out # ResNet-50 layer 1: 3→64, 7×7 kernel, output 112×112 macs = macs_conv2d(3, 64, 7, 112, 112) params = 3 * 64 * 7 * 7 print(f"Params: {params:,}") # 9,408 print(f"MACs: {macs:,}") # 118,013,952 print(f"Ratio MACs/params: {macs/params:.0f}×") # 12,544× # 1 MAC = 2 FLOPs (multiply + add) flops = macs * 2 print(f"FLOPs: {flops/1e6:.1f} MFLOPs") # 236 MFLOPs
Model size counts parameters — the weights the model was trained to have. But during inference (and especially training), the network generates a second category of memory usage: activations. Activations are the intermediate feature maps produced by each layer as the input flows forward through the network.
Activations are not stored between forward passes — they are computed fresh each time. But they must be live in memory simultaneously while a given layer is executing. On a microcontroller, with only 256 KB of SRAM shared between everything, activations often dominate memory usage — far exceeding the weights of the layer computing them.
A 2D conv layer with output feature map (cout, Hout, Wout) must store that entire output tensor in memory. Its size (for batch size 1):
Worked example: ResNet-50's first conv output (64 channels, 112×112):
Compare this to the weight size: 9,408 params × 2 bytes = 18 KB. The activation is 86× larger than the layer's parameters. This ratio — large output feature maps with relatively small filters — is the typical case in the early layers of CNNs.
During inference, you need memory for the current layer's input AND output simultaneously (so the output doesn't overwrite the input before the computation finishes). The peak activation memory is the maximum memory needed at any point during the forward pass:
For sequential networks like ResNets, this is usually the first few layers where spatial resolution is high. For transformers, it is the attention matrices: storing the full Q·KT matrix for a sequence length L and h attention heads requires L² × h × 2 bytes, which at L=4096 and h=32 is 4096² × 32 × 2 = 1 GB — just for one attention score matrix.
python def activation_size_bytes(c_out, h_out, w_out, bits=16, batch=1): return batch * c_out * h_out * w_out * bits // 8 # ResNet-50 layer 1 output act = activation_size_bytes(64, 112, 112, bits=16) weights = 3 * 64 * 7 * 7 * 2 # FP16 weights print(f"Activations: {act/1024:.1f} KB") # 1,568 KB = 1.5 MB print(f"Weights: {weights/1024:.1f} KB") # 18 KB print(f"Ratio: {act/weights:.0f}×") # 86× # Transformer attention memory at sequence length L L, h, d_head = 4096, 32, 128 # QK^T matrix: (h, L, L) — one attention score per head per query-key pair attn_mem = h * L * L * 2 # FP16 print(f"Attention scores at L={L}: {attn_mem/1e9:.2f} GB") # ~1 GB
On a battery-powered edge device, energy is the ultimate constraint. A Raspberry Pi running ResNet inference at 1 fps uses enough power to drain a CR2032 coin cell in about 90 minutes. A hearing aid runs on the same battery for weeks. The difference is not just chip efficiency — it is where the chip spends its energy.
The energy cost of a computation depends not just on what arithmetic is performed, but on where the data comes from. The memory hierarchy in a processor has fundamentally different energy costs at each level:
| Memory Level | Access Energy (32-bit) | Relative to Reg | Capacity (typical) |
|---|---|---|---|
| Register | ~0.1 pJ | 1× | ~KB (tens of registers) |
| L1 Cache (SRAM) | ~0.5 pJ | 5× | 32–256 KB |
| L2 Cache (SRAM) | ~2 pJ | 20× | 256 KB–4 MB |
| On-chip SRAM (ML) | ~5 pJ | 50× | 4–32 MB (GPU shared mem) |
| Off-chip DRAM | ~640 pJ | 6,400× | 4–80 GB (GPU HBM) |
These numbers are from Song Han's own benchmarks on 45nm CMOS. The takeaway is unambiguous: if you read a weight from DRAM, you spend 640 pJ. The multiply-accumulate with that weight costs 3.7 pJ. The memory access is 173× more expensive than the computation.
If you can keep the weights in on-chip SRAM (5 pJ/access) instead of DRAM (640 pJ/access), you reduce energy by 128×. This is the fundamental reason why model compression matters for energy efficiency — smaller models fit in cache, eliminating DRAM traffic. A model that is 4× smaller may use much less than 4× the energy because the entire model stays in SRAM.
This also explains why quantization (INT8 vs FP32) saves more energy than just the 4× size reduction suggests. INT8 weights fit in 4× less memory, so 4× more of them fit in cache. The cache hit rate improves, fewer DRAM accesses occur, and each access fetches a shorter word — the energy savings compound.
python # Energy estimation: DRAM access vs computation E_DRAM_pJ = 640 # energy for 32-bit DRAM access E_SRAM_pJ = 5 # on-chip SRAM E_MAC_pJ = 3.7 # multiply-accumulate (32-bit) # ResNet-50 first layer: 118M MACs, 9,408 parameters n_macs = 118_013_952 n_params = 9_408 # filter weights, read once per forward pass # Case 1: weights in DRAM energy_compute = n_macs * E_MAC_pJ energy_mem_dram = n_params * E_DRAM_pJ # each weight loaded from DRAM print(f"Compute energy: {energy_compute/1e6:.1f} µJ") # 436 µJ print(f"DRAM load energy: {energy_mem_dram/1e6:.4f} µJ") # 0.006 µJ # Wait — for a conv, each weight is REUSED h_out×w_out = 12,544 times! energy_mem_dram_total = n_params * 12544 * E_DRAM_pJ # if not cached print(f"DRAM if not cached: {energy_mem_dram_total/1e9:.1f} mJ") # 75 mJ print(f"That's {energy_mem_dram_total/energy_compute:.0f}× the compute energy!")
The code reveals why filter caching is critical in conv layers: the same 9,408 weights are used at 12,544 different spatial positions. If you reload them from DRAM each time, memory energy dwarfs compute energy by 170×. If you cache the filter in on-chip SRAM for the duration of the spatial sweep, the energy profile inverts.
Click a memory level to see its energy cost per 32-bit access in pJ. The bar heights are log-scaled. Hover to see the ratio vs a register access.
Two numbers describe how fast a system processes data: latency and throughput. They sound like they measure the same thing, but they don't — and confusing them leads to wrong optimization decisions.
Latency is the time from input arrival to output delivery for a single request: milliseconds per query. It is what matters for interactive applications — if you ask a question and the answer takes 10 seconds, you notice. Voice recognition, autonomous driving, and AR all require low latency.
Throughput is the total number of queries processed per unit time: queries per second (QPS) or, for text generation, tokens per second. A system can have high throughput and high latency simultaneously if it processes many requests in parallel — each individual request waits longer, but the system handles more volume overall.
This formula hides the tension: increasing batch size increases throughput but also increases latency (each request waits for the batch to fill before processing begins). The tradeoff is real and unavoidable. Cloud inference providers exploit this by batching requests from multiple users — a request that might take 50 ms alone takes 80 ms batched, but the system handles 10× more traffic.
Batching does more than just amortize overhead. It fundamentally changes the arithmetic intensity of the computation — and whether the workload is compute-bound or memory-bound.
For a matrix multiplication with weight matrix W of shape (cin, cout) and a batch of B inputs:
At batch size B=1, intensity = 0.5 MACs/byte — deeply memory-bound (the A100 needs ~178 MACs/byte to be compute-bound). At B=512, intensity = 256 MACs/byte — now compute-bound. Batching moves the workload from the memory-bound to the compute-bound regime, recovering the GPU's full compute utilization.
python import numpy as np # Arithmetic intensity for FC layer, varying batch size c_in, c_out = 4096, 4096 bytes_per_param = 2 # FP16 for B in [1, 4, 16, 64, 256, 1024]: macs = B * c_in * c_out weight_bytes = c_in * c_out * bytes_per_param # loaded once per batch intensity = macs / weight_bytes # MACs per byte bound = "memory" if intensity < 178 else "compute" # A100 ridge print(f"B={B:5d}: intensity={intensity:6.1f} MACs/byte → {bound}-bound") # Output: # B= 1: intensity= 0.5 MACs/byte → memory-bound # B= 4: intensity= 2.0 MACs/byte → memory-bound # B= 16: intensity= 8.0 MACs/byte → memory-bound # B= 64: intensity= 32.0 MACs/byte → memory-bound # B= 256: intensity= 128.0 MACs/byte → memory-bound # B= 1024: intensity= 512.0 MACs/byte → compute-bound
You have a neural network layer. You have a GPU. Will the layer run fast? The answer depends on one number: arithmetic intensity — how many arithmetic operations are performed per byte of data read from memory.
The roofline model makes the bound explicit. A processor has two limits:
The ridge point is the arithmetic intensity at which both limits are saturated simultaneously:
If your workload's arithmetic intensity is below 156 FLOPs/byte, you are memory-bound — you compute so fast that the chip sits idle waiting for data. Performance is limited by BW: achievable_FLOPS = AI × BW. If intensity is above 156, you are compute-bound — you process data fast enough to fully utilize the compute units. Performance is limited by peak FLOPS.
| Operation | Arithmetic Intensity | Regime (A100 FP16) |
|---|---|---|
| Elementwise ReLU | ~0.25 FLOPs/byte | Memory-bound (624× below ridge) |
| LLM decode (B=1) | ~1 FLOPs/byte | Memory-bound (156× below ridge) |
| Softmax | ~3 FLOPs/byte | Memory-bound |
| LLM prefill (B=32) | ~32 FLOPs/byte | Memory-bound (approaching ridge) |
| Large matmul (B=512) | ~512 FLOPs/byte | Compute-bound |
| Conv (large batch) | ~1000 FLOPs/byte | Deeply compute-bound |
Drag the arithmetic intensity slider to see where your operation lands. Below the ridge point = memory-bound (you need faster memory or higher intensity). Above = compute-bound (you need faster math or fewer FLOPs).
This chapter brings together every metric from this lesson into a single interactive calculator. Adjust any parameter of a convolutional layer and see all efficiency metrics update live: parameter count, model size at multiple precisions, MAC count, activation size, and arithmetic intensity.
This is the tool you would use at the start of a TinyML design project: set your target hardware constraints (MCU SRAM, power budget), then explore architectures that fit those constraints before writing any code.
Drag sliders to configure the convolutional layer. All metrics update live. Red values exceed typical MCU constraints (256 KB SRAM for activations, 512 KB for weights).
You now have the complete vocabulary of efficient deep learning. This chapter consolidates every metric into a single reference, shows how they connect to the four major compression techniques, and points to the next lessons in this series.
| Metric | Formula | What it measures | Typical units |
|---|---|---|---|
| #Parameters | FC: c_in×c_out; Conv: c_out×c_in×K² | Learnable weights (storage) | M, B (billions) |
| Model Size | #params × bitwidth / 8 | Memory footprint of weights | MB, GB |
| Peak Activations | max(c_out × H_out × W_out) × bytes | Intermediate feature map memory | KB, MB |
| MACs | FC: c_in×c_out; Conv: c_out×c_in×K²×H_out×W_out | Multiply-accumulate ops (compute) | M, G, T MACs |
| FLOPs | 2 × MACs | Floating-point ops (hardware reports) | MFLOP, GFLOP, TFLOP |
| Arithmetic Intensity | FLOPs / bytes_from_memory | Compute efficiency vs memory | FLOPs/byte |
| Latency | Wall-clock time per query | Response time | ms, s |
| Throughput | Batch Size / Latency | Requests per second | QPS, tokens/s |
| Energy | Σ (accesses × pJ/access) | Battery impact | µJ, mJ per inference |
The CS336 Resource Accounting lesson covers FLOPs, memory, and the roofline model from the training perspective: CS336 Lec 2 — PyTorch & Resource Accounting. The GPU internals lesson goes deep on arithmetic intensity and Flash Attention: CS336 Lec 5 — GPUs. For inference latency and the KV cache: CS336 Lec 10 — Inference.
The CS336 Kernels lesson derives the exact roofline analysis for elementwise operations and tiling: CS336 Lec 6 — Kernels & Triton.