TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 1

Why Efficiency? The TinyML Problem & Metrics

A 7B-parameter LLM weighs ~14 GB at FP16. Your phone has 6 GB of RAM. A microcontroller has 256 KB of SRAM — a 50,000× gap. This lesson builds the vocabulary the entire field uses: MACs, FLOPs, model size, activation memory, latency vs throughput, energy hierarchies, and the roofline model. Every metric derived with real numbers. MIT 6.5940 by Song Han.

Prerequisites: basic neural network intuition (what a weight is, what a forward pass does). No calculus required for this lesson.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The 50,000× Gap

You want to run a large language model on your phone. Not the cloud — on the device, offline, with no round-trip latency. The model is Llama-2-7B: 7 billion parameters, trained on 2 trillion tokens, capable of remarkable things. You open the spec sheet and run the arithmetic:

7,000,000,000 params × 2 bytes/param (FP16) = 14,000,000,000 bytes = 14 GB

Your phone has 6 GB of RAM — shared with the OS, the camera app, everything else. The model alone is 2.3× the entire available memory. Strike one.

You try a microcontroller instead — an Arduino Nano 33 BLE Sense, the kind of chip in a smart thermostat or a hearing aid. It has 256 KB of SRAM. To run even a tiny 1M-parameter model at INT8 (1 byte per param) you need 1 MB — four times more than the chip's entire memory.

14 GB ÷ 256 KB = 14,000,000 KB ÷ 256 KB = 54,688× too large

This is the TinyML problem in one number: a 50,000× gap between what modern AI demands and what edge hardware supplies. The question this entire course asks — and answers — is: how do we bridge it?

The gap is not just about memory. Even if you could somehow fit the weights, running a 7B model forward pass costs ~14 billion multiply-accumulate operations per token. A microcontroller executes ~10 million operations per second. That is 1,400 seconds — 23 minutes — to generate a single word. Efficiency is not optional. It is the only path forward.

The answer involves a toolkit: pruning (remove weights that barely matter), quantization (use INT4 instead of FP32), neural architecture search (design lean architectures), knowledge distillation (compress a big model into a small one). Each technique in this toolkit has a precise cost-benefit story told in the efficiency metrics we build in this lesson.

Before you can compress, you need to measure. Before you measure, you need units. This lesson gives you those units — and the intuition behind each one.

Model size vs hardware memory — the diverging gap (2017-2023)

Bars show model sizes (blue) vs GPU memory available on flagship chips (orange). Notice models grow ~4× every two years while hardware grows ~2×. Toggle the view to see the same data on a log scale.

A 7B-parameter model stored at FP16 (2 bytes per parameter) and a microcontroller with 256 KB SRAM. Approximately how many times larger is the model than the chip's memory?

Chapter 1: Moore's Law vs Model Growth

In 1965, Gordon Moore observed that the number of transistors on a chip doubled roughly every two years. This became Moore's Law — a self-fulfilling prophecy that the semiconductor industry used as a roadmap for decades. More transistors meant more compute, more memory, faster chips. The free lunch of hardware improvement.

Deep learning arrived and ate that lunch — and then demanded dessert. From 2017 to 2022:

The math is brutal: hardware doubles, models quadruple. Every two years the gap widens by a factor of 2. Over six years (2017-2023), that compounds to an 8× mismatch just from the growth rate difference — on top of the absolute gap that already existed at the start.

Why do models grow faster than hardware? Scaling laws show that model performance improves as a power law with model size. The payoff for doubling parameters is real and measurable: lower perplexity, better reasoning, emergent capabilities. Researchers have strong incentive to scale. Hardware is constrained by physics, yield rates, and economics. Software ambition outpaces silicon.

This is not a temporary mismatch that will self-correct. The trend has continued past 2022: LLaMA-3 405B (2024), GPT-4 (estimated 1-2T parameters), Gemini Ultra. Hardware has improved but not at the pace needed. The gap is structural — and closing it efficiently is the job of the field this course covers.

What "efficient deep learning" actually means

The goal is not to make a worse model that fits. The goal is to deploy the best possible model within the given constraints. "Efficient" means maximizing accuracy per unit of resources — where resources include memory, compute (MACs), energy (Joules), and latency (milliseconds).

Training (cloud)
Large model, full precision, lots of data. Optimize for accuracy. Cost: $1M+ for large runs. No real-time constraint.
↓ model compression pipeline
Efficient Model
Pruned, quantized, distilled. Same task, far fewer resources. The art: minimize accuracy loss.
↓ deploy to target hardware
Edge Device
Phone (6 GB RAM, 20 TOPS), microcontroller (256 KB, 10 MOPS), or custom ASIC. Real-time, battery-powered, private.

Before any of these compression techniques can be applied correctly, you need to know exactly what you are compressing. That means measuring the model with precise metrics. The rest of this lesson builds those metrics from scratch.

Model growth vs hardware memory (2017–2023) — animated

Drag the year slider to watch the gap evolve. Notice that hardware grows at Moore's Law pace (2×/2yr) while models grow 4×/2yr.

Year 2023
GPU memory roughly doubles every 2 years (Moore's Law). Model sizes have been growing ~4× every 2 years. After 4 years, by how much has the ratio of model-size-to-GPU-memory grown?

Chapter 2: Parameters & Model Size

The first efficiency metric: how many learnable numbers does your model contain? This is the #parameters — the count of all weights and biases across every layer. Parameters live in memory between forward passes. They must be loaded from storage, held in RAM (or SRAM on a microcontroller), and updated during training.

Counting parameters: fully-connected layer

A fully-connected (linear) layer takes an input vector of size cin and produces an output vector of size cout. Every output neuron connects to every input neuron. The weight matrix W has shape (cout, cin), plus a bias vector of size cout.

#paramsFC = cin × cout + cout = cout(cin + 1)

Worked example: A hidden layer with cin = 768 (e.g., BERT's hidden dimension) and cout = 3072 (the FFN expansion):

768 × 3072 + 3072 = 2,359,296 + 3,072 = 2,362,368 params

In a language model, these FFN layers appear at every transformer block. GPT-2 has 48 such blocks. Just the FFN weights are: 48 × 2 × 2,362,368 ≈ 227M parameters — nearly all of GPT-2-large's 345M.

From parameter count to bytes: the bitwidth factor

Model size is the memory footprint of the weights. It depends on two things: how many parameters, and how many bits each one uses. This is the bitwidth (or precision).

Model Size (bytes) = #params × bitwidth / 8
FormatBits/param7B model sizeNotes
FP323228 GBTraining default (gradients, optimizer states need this)
BF16 / FP161614 GBInference default for large models; same range as FP32 (BF16)
INT887 GBPost-training quantization; ~1% accuracy loss typically
INT443.5 GBFits in 4 GB GPU; widely used (GPTQ, AWQ, bitsandbytes)
INT221.75 GBAggressive; significant accuracy loss; active research area

INT4 quantization is why you can run a 7B model on a consumer GPU: 3.5 GB fits in an NVIDIA RTX 3080 (10 GB VRAM) with room for activations and KV cache. The 4× compression from FP16 to INT4 is entirely from halving the bitwidth twice — no architectural changes needed.

Misconception: fewer parameters always means smaller model. Architecture matters too. A 1B-parameter model in FP32 is 4 GB. A 7B-parameter model in INT4 is 3.5 GB. The quantized 7B is smaller than the full-precision 1B. Always track both parameter count AND bitwidth when comparing model sizes.
python
def count_fc_params(c_in, c_out, bias=True):
    # Weight matrix: c_out × c_in
    # Bias vector: c_out (optional)
    params = c_in * c_out
    if bias:
        params += c_out
    return params

def model_size_mb(n_params, bits_per_param=16):
    bytes_total = n_params * bits_per_param / 8
    return bytes_total / (1024 ** 2)  # convert to MB

# Example: BERT-base (12-layer, hidden=768)
ffn_per_block = count_fc_params(768, 3072) + count_fc_params(3072, 768)
total_ffn = ffn_per_block * 12  # 12 transformer blocks
print(f"FFN params: {total_ffn:,}")            # ≈ 56,623,104
print(f"Size at FP16: {model_size_mb(total_ffn, 16):.1f} MB")  # ≈ 108 MB
print(f"Size at INT4: {model_size_mb(total_ffn, 4):.1f} MB")   # ≈ 27 MB
A transformer has 32 layers. Each layer has two FC layers: one from hidden=4096 to intermediate=16384, and one back from 16384 to 4096. Ignoring biases, how many parameters are in all the FFN weights?

Chapter 3: MACs, FLOPs & OPs

Parameters measure what is stored. But running inference requires computation — and the compute cost is measured differently. The core unit is the MAC: multiply-accumulate operation.

MAC: output += weight × input

One MAC = one multiply + one add. That is exactly 2 floating-point operations (FLOPs). The conversion is always: 1 MAC = 2 FLOPs. Hardware often reports performance in FLOPS (floating-point operations per second), so you will frequently need to convert.

Why MACs, not FLOPs? Neural network inference is dominated by matrix multiplications, which fuse multiply and add into a single hardware instruction. Modern accelerators (GPU tensor cores, TPU MXUs, Apple ANE) count throughput in MACs/second (or equivalently, their performance is rated in FLOPS = 2 × MACs/sec). Counting MACs directly is closer to what hardware actually does.

MACs of a fully-connected layer

For a linear layer with input size cin and output size cout, computing one output element yj = Σi wji·xi + bj requires cin MACs (the dot product). There are cout such output elements.

MACsFC = cin × cout

Worked example: The BERT FFN expansion from 768 to 3072:

MACs = 768 × 3072 = 2,359,296 MACs = 4,718,592 FLOPs

Notice: for FC layers, #MACs = #params (ignoring bias). This is not a coincidence — every weight participates in exactly one multiply-accumulate per forward pass (for a single input).

MACs of a 2D convolution layer

Convolutions are where the MAC count diverges from the parameter count due to weight sharing. A 2D conv has input feature map (cin, H, W), output feature map (cout, Hout, Wout), and kernel size K×K. Each output location requires one full filter application:

MACsConv = cout × cin × K × K × Hout × Wout

The parameters in the conv are: cout × cin × K × K. But the MACs multiply that by Hout × Wout — the number of spatial positions where the kernel is applied. The same filter is reused at every position (weight sharing), so MACs ≫ params for large feature maps.

Worked example: ResNet-50's first conv: cin=3, cout=64, K=7, input 224×224, stride=2 → Hout=Wout=112:

params = 64 × 3 × 7 × 7 = 9,408
MACs = 64 × 3 × 7 × 7 × 112 × 112 = 9,408 × 12,544 = 118,013,952

That single conv layer has 118M MACs (236M FLOPs) — 12,544× more MACs than parameters. This is why large spatial feature maps are so expensive: the parameter count looks small but the compute is massive.

Misconception: FLOPs = latency. They do not. FLOPs measure arithmetic operations. Latency measures wall-clock time. A model can have fewer FLOPs and be slower if it accesses memory inefficiently. A GPU might spend 80% of its time waiting for data from memory rather than computing. The roofline model (Chapter 7) explains when you are compute-bound vs memory-bound.
python
def macs_conv2d(c_in, c_out, k, h_out, w_out):
    # Each output location: c_in × k × k MACs
    # Total output locations: h_out × w_out
    # Total filters: c_out
    return c_out * c_in * k * k * h_out * w_out

def macs_fc(c_in, c_out):
    return c_in * c_out

# ResNet-50 layer 1: 3→64, 7×7 kernel, output 112×112
macs = macs_conv2d(3, 64, 7, 112, 112)
params = 3 * 64 * 7 * 7
print(f"Params: {params:,}")             # 9,408
print(f"MACs:   {macs:,}")               # 118,013,952
print(f"Ratio MACs/params: {macs/params:.0f}×")  # 12,544×

# 1 MAC = 2 FLOPs (multiply + add)
flops = macs * 2
print(f"FLOPs:  {flops/1e6:.1f} MFLOPs")  # 236 MFLOPs
A conv layer has c_in=128, c_out=256, kernel size K=3, and output spatial size 56×56. How many MACs does one forward pass through this layer require?

Chapter 4: Activation Memory

Model size counts parameters — the weights the model was trained to have. But during inference (and especially training), the network generates a second category of memory usage: activations. Activations are the intermediate feature maps produced by each layer as the input flows forward through the network.

Activations are not stored between forward passes — they are computed fresh each time. But they must be live in memory simultaneously while a given layer is executing. On a microcontroller, with only 256 KB of SRAM shared between everything, activations often dominate memory usage — far exceeding the weights of the layer computing them.

Computing activation memory: conv layer

A 2D conv layer with output feature map (cout, Hout, Wout) must store that entire output tensor in memory. Its size (for batch size 1):

Activation Size = cout × Hout × Wout × (bitwidth / 8)

Worked example: ResNet-50's first conv output (64 channels, 112×112):

64 × 112 × 112 × 2 bytes (FP16) = 1,605,632 bytes = 1.5 MB

Compare this to the weight size: 9,408 params × 2 bytes = 18 KB. The activation is 86× larger than the layer's parameters. This ratio — large output feature maps with relatively small filters — is the typical case in the early layers of CNNs.

Critical misconception: parameters are the memory bottleneck on MCUs. For tiny microcontrollers, activations are often the binding constraint, not parameters. Consider a network with 100K parameters at INT8 = 100 KB. That might just fit in 256 KB SRAM. But if any layer produces a feature map bigger than 100 KB, the network cannot run — even though its weights are fine. MCU-friendly architectures (like MCUNet) co-design parameter count AND activation sizes simultaneously.

Peak activation memory

During inference, you need memory for the current layer's input AND output simultaneously (so the output doesn't overwrite the input before the computation finishes). The peak activation memory is the maximum memory needed at any point during the forward pass:

Peak Activation ≈ maxlayer l(inputl size + outputl size)

For sequential networks like ResNets, this is usually the first few layers where spatial resolution is high. For transformers, it is the attention matrices: storing the full Q·KT matrix for a sequence length L and h attention heads requires L² × h × 2 bytes, which at L=4096 and h=32 is 4096² × 32 × 2 = 1 GB — just for one attention score matrix.

python
def activation_size_bytes(c_out, h_out, w_out, bits=16, batch=1):
    return batch * c_out * h_out * w_out * bits // 8

# ResNet-50 layer 1 output
act = activation_size_bytes(64, 112, 112, bits=16)
weights = 3 * 64 * 7 * 7 * 2  # FP16 weights
print(f"Activations: {act/1024:.1f} KB")       # 1,568 KB = 1.5 MB
print(f"Weights:     {weights/1024:.1f} KB")    # 18 KB
print(f"Ratio: {act/weights:.0f}×")             # 86×

# Transformer attention memory at sequence length L
L, h, d_head = 4096, 32, 128
# QK^T matrix: (h, L, L) — one attention score per head per query-key pair
attn_mem = h * L * L * 2  # FP16
print(f"Attention scores at L={L}: {attn_mem/1e9:.2f} GB")  # ~1 GB
A MCU has 512 KB SRAM. A conv layer has 200K parameters (INT8 = 200 KB) and produces a feature map of size 128 channels × 64 × 64. Can this layer run on the MCU at INT8?

Chapter 5: The Energy Hierarchy

On a battery-powered edge device, energy is the ultimate constraint. A Raspberry Pi running ResNet inference at 1 fps uses enough power to drain a CR2032 coin cell in about 90 minutes. A hearing aid runs on the same battery for weeks. The difference is not just chip efficiency — it is where the chip spends its energy.

The energy cost of a computation depends not just on what arithmetic is performed, but on where the data comes from. The memory hierarchy in a processor has fundamentally different energy costs at each level:

The key insight: memory access dominates energy, not arithmetic. A 32-bit multiply on a modern chip costs about 3.7 pJ. A 32-bit DRAM read costs about 640 pJ — over 170× more energy. Not computing faster is useless if the chip is spending all its energy waiting for data from DRAM. Efficient ML is largely about making the memory access pattern cheaper.
Memory LevelAccess Energy (32-bit)Relative to RegCapacity (typical)
Register~0.1 pJ~KB (tens of registers)
L1 Cache (SRAM)~0.5 pJ32–256 KB
L2 Cache (SRAM)~2 pJ20×256 KB–4 MB
On-chip SRAM (ML)~5 pJ50×4–32 MB (GPU shared mem)
Off-chip DRAM~640 pJ6,400×4–80 GB (GPU HBM)

These numbers are from Song Han's own benchmarks on 45nm CMOS. The takeaway is unambiguous: if you read a weight from DRAM, you spend 640 pJ. The multiply-accumulate with that weight costs 3.7 pJ. The memory access is 173× more expensive than the computation.

Why this makes on-chip SRAM crucial

If you can keep the weights in on-chip SRAM (5 pJ/access) instead of DRAM (640 pJ/access), you reduce energy by 128×. This is the fundamental reason why model compression matters for energy efficiency — smaller models fit in cache, eliminating DRAM traffic. A model that is 4× smaller may use much less than 4× the energy because the entire model stays in SRAM.

This also explains why quantization (INT8 vs FP32) saves more energy than just the 4× size reduction suggests. INT8 weights fit in 4× less memory, so 4× more of them fit in cache. The cache hit rate improves, fewer DRAM accesses occur, and each access fetches a shorter word — the energy savings compound.

python
# Energy estimation: DRAM access vs computation
E_DRAM_pJ = 640    # energy for 32-bit DRAM access
E_SRAM_pJ = 5     # on-chip SRAM
E_MAC_pJ  = 3.7   # multiply-accumulate (32-bit)

# ResNet-50 first layer: 118M MACs, 9,408 parameters
n_macs = 118_013_952
n_params = 9_408  # filter weights, read once per forward pass

# Case 1: weights in DRAM
energy_compute = n_macs * E_MAC_pJ
energy_mem_dram = n_params * E_DRAM_pJ  # each weight loaded from DRAM
print(f"Compute energy: {energy_compute/1e6:.1f} µJ")      # 436 µJ
print(f"DRAM load energy: {energy_mem_dram/1e6:.4f} µJ")  # 0.006 µJ
# Wait — for a conv, each weight is REUSED h_out×w_out = 12,544 times!
energy_mem_dram_total = n_params * 12544 * E_DRAM_pJ  # if not cached
print(f"DRAM if not cached: {energy_mem_dram_total/1e9:.1f} mJ")  # 75 mJ
print(f"That's {energy_mem_dram_total/energy_compute:.0f}× the compute energy!")

The code reveals why filter caching is critical in conv layers: the same 9,408 weights are used at 12,544 different spatial positions. If you reload them from DRAM each time, memory energy dwarfs compute energy by 170×. If you cache the filter in on-chip SRAM for the duration of the spatial sweep, the energy profile inverts.

Memory hierarchy energy ladder

Click a memory level to see its energy cost per 32-bit access in pJ. The bar heights are log-scaled. Hover to see the ratio vs a register access.

A model has 10M parameters at FP32. You quantize to INT8 (4× fewer bytes). Which energy benefit is GREATER: the 4× reduction in bytes transferred, or the improvement from keeping more weights in cache?

Chapter 6: Latency vs Throughput

Two numbers describe how fast a system processes data: latency and throughput. They sound like they measure the same thing, but they don't — and confusing them leads to wrong optimization decisions.

Latency is the time from input arrival to output delivery for a single request: milliseconds per query. It is what matters for interactive applications — if you ask a question and the answer takes 10 seconds, you notice. Voice recognition, autonomous driving, and AR all require low latency.

Throughput is the total number of queries processed per unit time: queries per second (QPS) or, for text generation, tokens per second. A system can have high throughput and high latency simultaneously if it processes many requests in parallel — each individual request waits longer, but the system handles more volume overall.

Throughput = Batch Size / Latency

This formula hides the tension: increasing batch size increases throughput but also increases latency (each request waits for the batch to fill before processing begins). The tradeoff is real and unavoidable. Cloud inference providers exploit this by batching requests from multiple users — a request that might take 50 ms alone takes 80 ms batched, but the system handles 10× more traffic.

Batching and arithmetic intensity

Batching does more than just amortize overhead. It fundamentally changes the arithmetic intensity of the computation — and whether the workload is compute-bound or memory-bound.

For a matrix multiplication with weight matrix W of shape (cin, cout) and a batch of B inputs:

Arithmetic Intensity = MACs / bytes = (B × cin × cout) / (cin × cout × 2) = B / 2

At batch size B=1, intensity = 0.5 MACs/byte — deeply memory-bound (the A100 needs ~178 MACs/byte to be compute-bound). At B=512, intensity = 256 MACs/byte — now compute-bound. Batching moves the workload from the memory-bound to the compute-bound regime, recovering the GPU's full compute utilization.

LLM autoregressive decode is always memory-bound at batch size 1. During token generation, the model processes one token at a time (batch=1). The arithmetic intensity for each transformer layer's weight matmul is B/2 = 0.5 — far below the roofline ridge point. The model spends most of its time loading weights from HBM, not computing. This is why speculative decoding, continuous batching, and quantization (smaller weights = faster load) matter so much for LLM serving latency.
python
import numpy as np

# Arithmetic intensity for FC layer, varying batch size
c_in, c_out = 4096, 4096
bytes_per_param = 2  # FP16

for B in [1, 4, 16, 64, 256, 1024]:
    macs = B * c_in * c_out
    weight_bytes = c_in * c_out * bytes_per_param  # loaded once per batch
    intensity = macs / weight_bytes  # MACs per byte
    bound = "memory" if intensity < 178 else "compute"  # A100 ridge
    print(f"B={B:5d}: intensity={intensity:6.1f} MACs/byte → {bound}-bound")

# Output:
# B=    1: intensity=   0.5 MACs/byte → memory-bound
# B=    4: intensity=   2.0 MACs/byte → memory-bound
# B=   16: intensity=   8.0 MACs/byte → memory-bound
# B=   64: intensity=  32.0 MACs/byte → memory-bound
# B=  256: intensity= 128.0 MACs/byte → memory-bound
# B= 1024: intensity= 512.0 MACs/byte → compute-bound
During LLM inference, a user is generating text one token at a time (batch size = 1). An engineer proposes reducing model size by 4× via INT4 quantization. What is the PRIMARY benefit for latency?

Chapter 7: Showcase: The Roofline Model

You have a neural network layer. You have a GPU. Will the layer run fast? The answer depends on one number: arithmetic intensity — how many arithmetic operations are performed per byte of data read from memory.

Arithmetic Intensity (AI) = #MACs / bytes transferred from memory

The roofline model makes the bound explicit. A processor has two limits:

  1. Peak compute throughput (FLOPS): the maximum arithmetic operations per second — determined by hardware (e.g., A100 FP16 = 312 TFLOPS)
  2. Peak memory bandwidth (BW): the maximum bytes per second the chip can load from memory (A100 HBM = 2 TB/s)

The ridge point is the arithmetic intensity at which both limits are saturated simultaneously:

Ridge Point = Peak FLOPS / Peak BW = 312 TFLOPS / 2 TB/s = 156 FLOPs/byte (A100)

If your workload's arithmetic intensity is below 156 FLOPs/byte, you are memory-bound — you compute so fast that the chip sits idle waiting for data. Performance is limited by BW: achievable_FLOPS = AI × BW. If intensity is above 156, you are compute-bound — you process data fast enough to fully utilize the compute units. Performance is limited by peak FLOPS.

The roofline gives you the ceiling. Achievable performance = min(Peak FLOPS, AI × BW). This is two lines on a log-log plot: a horizontal ceiling (peak compute) and a sloped line (bandwidth limit). The model "rooflines" below both. Plotting your operation on this chart instantly tells you what to optimize: if you are on the sloped wall (memory-bound), optimizing compute is useless. You need to increase AI — tiling, caching, quantization.

Real operations on the A100 roofline

OperationArithmetic IntensityRegime (A100 FP16)
Elementwise ReLU~0.25 FLOPs/byteMemory-bound (624× below ridge)
LLM decode (B=1)~1 FLOPs/byteMemory-bound (156× below ridge)
Softmax~3 FLOPs/byteMemory-bound
LLM prefill (B=32)~32 FLOPs/byteMemory-bound (approaching ridge)
Large matmul (B=512)~512 FLOPs/byteCompute-bound
Conv (large batch)~1000 FLOPs/byteDeeply compute-bound
Interactive Roofline Model — A100 FP16

Drag the arithmetic intensity slider to see where your operation lands. Below the ridge point = memory-bound (you need faster memory or higher intensity). Above = compute-bound (you need faster math or fewer FLOPs).

Arithmetic Intensity (FLOPs/byte) 1 FLOP/byte
A GPU has peak compute of 100 TFLOPS and memory bandwidth of 2 TB/s. The ridge point is 50 FLOPs/byte. An operation has arithmetic intensity of 10 FLOPs/byte. What is its achievable throughput?

Chapter 8: Interactive Conv MAC Calculator

This chapter brings together every metric from this lesson into a single interactive calculator. Adjust any parameter of a convolutional layer and see all efficiency metrics update live: parameter count, model size at multiple precisions, MAC count, activation size, and arithmetic intensity.

This is the tool you would use at the start of a TinyML design project: set your target hardware constraints (MCU SRAM, power budget), then explore architectures that fit those constraints before writing any code.

Conv Layer Efficiency Calculator

Drag sliders to configure the convolutional layer. All metrics update live. Red values exceed typical MCU constraints (256 KB SRAM for activations, 512 KB for weights).

C_in16
C_out16
Kernel K3×3
Output H=W56
You are designing a conv layer for an MCU with 256 KB SRAM (activations must fit). You have c_in=32, c_out=64, K=3, and spatial output 32×32. Does the activation fit? (INT8 = 1 byte/element)

Chapter 9: Connections & Cheat Sheet

You now have the complete vocabulary of efficient deep learning. This chapter consolidates every metric into a single reference, shows how they connect to the four major compression techniques, and points to the next lessons in this series.

The complete metrics cheat sheet

MetricFormulaWhat it measuresTypical units
#ParametersFC: c_in×c_out; Conv: c_out×c_in×K²Learnable weights (storage)M, B (billions)
Model Size#params × bitwidth / 8Memory footprint of weightsMB, GB
Peak Activationsmax(c_out × H_out × W_out) × bytesIntermediate feature map memoryKB, MB
MACsFC: c_in×c_out; Conv: c_out×c_in×K²×H_out×W_outMultiply-accumulate ops (compute)M, G, T MACs
FLOPs2 × MACsFloating-point ops (hardware reports)MFLOP, GFLOP, TFLOP
Arithmetic IntensityFLOPs / bytes_from_memoryCompute efficiency vs memoryFLOPs/byte
LatencyWall-clock time per queryResponse timems, s
ThroughputBatch Size / LatencyRequests per secondQPS, tokens/s
EnergyΣ (accesses × pJ/access)Battery impactµJ, mJ per inference

How each compression technique targets these metrics

Pruning
Removes weights → ↓ #params → ↓ model size → ↓ MACs. Structured pruning (whole channels) → ↓ activation size. Requires sparse-aware hardware for full speedup. CS 6.5940 Lecture 3-5.
Quantization
↓ bitwidth → ↓ model size → ↓ energy (smaller DRAM reads) → ↑ arithmetic intensity (denser packing) → may move from memory-bound to compute-bound. INT8/INT4 quantization. CS 6.5940 Lecture 6-8.
Neural Architecture Search
Finds architectures with low MACs and activations for given accuracy. MCUNet co-optimizes params + peak activations. CS 6.5940 Lecture 9-11.
Knowledge Distillation
Trains a small student to mimic a large teacher. Targets: same task, ↓ params, ↓ MACs. Does not change bitwidth or spatial sizes directly. CS 6.5940 Lecture 12-13.

Related lessons on this site

The CS336 Resource Accounting lesson covers FLOPs, memory, and the roofline model from the training perspective: CS336 Lec 2 — PyTorch & Resource Accounting. The GPU internals lesson goes deep on arithmetic intensity and Flash Attention: CS336 Lec 5 — GPUs. For inference latency and the KV cache: CS336 Lec 10 — Inference.

The CS336 Kernels lesson derives the exact roofline analysis for elementwise operations and tiling: CS336 Lec 6 — Kernels & Triton.

What you can now do. Given any neural network layer, you can compute: its parameter count and model size at any bitwidth, its MAC count and FLOPs, its activation memory footprint, its arithmetic intensity, and whether it will be memory-bound or compute-bound on a given accelerator. You can estimate energy using the memory hierarchy numbers and identify the binding constraint (compute, memory bandwidth, or SRAM capacity). This is the vocabulary every paper in this field uses.
"What I cannot create, I do not understand." — Richard Feynman. Build a function that takes (c_in, c_out, K, H_out, W_out, bitwidth) and returns a complete efficiency report. If you can write it from scratch, you understand every metric in this lesson.
A TinyML engineer needs to deploy a classification model to a device with 512 KB SRAM. She has a model with 400K parameters at INT8 (400 KB). The largest activation tensor is 150 KB. Can the model run? What is the binding constraint?