TinyML & Efficient Deep Learning Workbook

Chapter 0: Efficiency Metrics

Before you can shrink a model you need to measure it. Three numbers dominate: parameter count (storage), MACs (compute), and peak activation memory (runtime SRAM). Getting these wrong means shipping a model that silently overflows a microcontroller.

Model size in bits:
bits = #params × bits-per-param
bytes = bits ⁄ 8

FLOPs vs MACs:
1 MAC (multiply-accumulate) = 2 FLOPs (one multiply + one add)
FLOPs ≈ 2 × MACs

Roofline intensity:
arithmetic intensity I = FLOPs ⁄ bytes_moved
achievable perf = min(I × mem_BW, peak_FLOP_s)

Energy hierarchy rule of thumb. A DRAM access costs ~200× more energy than a register operation. SRAM is ~6× cheaper than DRAM. Minimising off-chip memory traffic matters far more than reducing raw FLOPs on most edge hardware.

Exercise 0.1: Model Size in Bytes Derive

A MobileNetV2 variant has 3.4 million parameters stored in FP32 (32 bits each). What is the model size in megabytes (MB)? (1 MB = 10⁶ bytes)

Show derivation

bytes = 3,400,000 × 32 ⁄ 8 = 3,400,000 × 4 = 13,600,000 bytes = 13.6 MB

Each FP32 parameter occupies 4 bytes. 3.4 M × 4 = 13.6 MB. This is why INT8 quantization (1 byte per param) reduces storage to 3.4 MB — a 4× saving.

Exercise 0.2: FLOPs from MACs Derive

A layer performs 150 million multiply-accumulate operations (MACs). How many FLOPs is that?

MFLOPs

Show derivation

FLOPs = 2 × MACs = 2 × 150 M = 300 MFLOPs

Each MAC fuses a multiply and an add into one operation. When reporting FLOPs (floating-point operations), we count both, so 1 MAC = 2 FLOPs. Many papers report MACs; others report FLOPs — always check which one before comparing numbers.

Exercise 0.3: Arithmetic Intensity Derive

A matrix multiply moves 400 KB of data and performs 3.2 GFLOPs. What is the arithmetic intensity in FLOPs/byte?

FLOPs/byte

Show derivation

I = FLOPs ⁄ bytes = 3.2 × 10⁹ ⁄ (400 × 10³) = 8,000 FLOPs/byte

8000 FLOPs/byte is deep in the compute-bound region on any modern chip (most GPU memory rooflines are at ~200 FLOPs/byte for FP32). Large matrix multiplies are almost always compute-bound — this is why batch size matters for GPU utilisation.

Exercise 0.4: Activation Memory for One Layer Derive

A Conv2D layer outputs a feature map of shape (batch=1, C=64, H=56, W=56) stored as FP32. How many KB of activation memory does it consume?

Show derivation

elements = 1 × 64 × 56 × 56 = 200,704
bytes = 200,704 × 4 = 802,816 bytes ≈ 784 KB

802,816 / 1024 = 784 KB. On a microcontroller with 512 KB SRAM total, this single layer would already overflow! Peak activation memory — not parameter storage — is the binding constraint on MCUs.

Exercise 0.5: Roofline Bottleneck Derive

A microcontroller has peak compute 200 MFLOPs/s and memory bandwidth 50 MB/s. The ridge point (balance point) is at what arithmetic intensity (FLOPs/byte)?

FLOPs/byte

Show derivation

ridge = peak_compute ⁄ mem_BW = 200 × 10⁶ FLOPs/s ⁄ 50 × 10⁶ bytes/s = 4 FLOPs/byte

Operations with I < 4 FLOPs/byte are memory-bound; those with I > 4 are compute-bound. Most depthwise convolutions on MCUs fall below the ridge point, making memory bandwidth the bottleneck, not raw compute.

Chapter 1: NN Building Blocks

To shrink a network you must first count exactly what is inside it. Every Linear and Conv2D layer has a deterministic parameter count and MAC count derivable from its shape alone.

Linear layer: in → out
params = in × out + out (bias)
MACs = in × out (per sample)

Conv2D: kernel K×K, C_in inputs, C_out outputs, output H_o×W_o
params = K² × C_in × C_out + C_out
MACs = K² × C_in × C_out × H_o × W_o

Depthwise separable:
DW params = K² × C + C
PW params = C × C' + C'
MACs ≈ (K² + C') × C × H_o × W_o

MobileNet insight. Depthwise separable convolution replaces one 3×3 Conv2D with a 3×3 depthwise (per-channel) followed by a 1×1 pointwise. The FLOP reduction factor is approximately C⁄(K² + C) ≈ 1/9 for K=3 and large C — roughly 8× fewer MACs for typical channel counts.

Exercise 1.1: Linear Layer Parameter Count Derive

A fully-connected layer maps 512 inputs to 256 outputs, with bias. How many trainable parameters does it have?

params

Show derivation

params = 512 × 256 + 256 = 131,072 + 256 = 131,328

The weight matrix is 512×256 = 131,072 entries. The bias vector adds one scalar per output neuron: 256 more. Total: 131,328. In practice many recent architectures omit bias in Linear layers that precede a BatchNorm — then params = 512 × 256 = 131,072.

Exercise 1.2: Conv2D Parameter Count Derive

A Conv2D layer has kernel 3×3, 32 input channels, 64 output channels, with bias. How many parameters?

params

Show derivation

params = 3² × 32 × 64 + 64 = 9 × 2048 + 64 = 18,432 + 64 = 18,496

Each of the 64 output filters is a 3×3×32 kernel = 288 weights. 64 filters × 288 = 18,432 weights. Add 64 biases. Contrast with the depthwise version next exercise: 3²×32 = 288 DW params + 32×64 PW params = 2336 total — 8× fewer.

Exercise 1.3: Conv2D MACs Derive

The 3×3 Conv2D from Ex 1.2 (32 in, 64 out) processes an output feature map of 28×28. How many MACs (in millions)?

MMACs

Show derivation

MACs = K² × C_in × C_out × H_o × W_o
= 9 × 32 × 64 × 28 × 28 = 9 × 32 × 64 × 784 = 14,450,688 ≈ 14.45 MMACs

Wait — 14.45 M, not 451. Each output pixel requires K²×C_in MACs to sum over the receptive field. Then we do this for all C_out output channels and all H_o×W_o output positions. Answer: 14.45 MMACs (enter 14.45).

Exercise 1.4: Depthwise Separable FLOP Reduction Derive

Replace the 3×3 Conv2D (32→64, output 28×28) with a depthwise separable block (3×3 DW + 1×1 PW). Ratio = standard MACs ⁄ separable MACs. Round to 1 decimal. (Ignore biases.)

× reduction

Show derivation

Standard MACs = 9 × 32 × 64 × 28² = 14,450,688
DW MACs = 9 × 32 × 28² = 225,792
PW MACs = 1 × 32 × 64 × 28² = 1,605,632
Separable total = 1,831,424
Ratio = 14,450,688 ⁄ 1,831,424 ≈ 7.89 ≈ 8.0×

The theoretical reduction is 1 ⁄ (1⁄C_out + 1⁄K²) = 1/(1/64 + 1/9) ≈ 8.0. This matches MobileNet's empirical ~8× FLOP saving over standard convolutions.

Exercise 1.5: Linear Layer MACs Per Token Derive

A Transformer FFN first projects from d=512 to 4d=2048 (no bias). Per input token, how many MACs does this first projection require?

KMACs/token

Show derivation

MACs = 512 × 2048 = 1,048,576 ≈ 1,048 KMACs per token

For a Linear(d, 4d) layer, each output neuron does d multiplications and d additions (d MACs). With 4d output neurons the total is d × 4d = 4d² MACs. Here: 4 × 512² = 1,048,576 MACs = 1,048 KMACs per token.

Chapter 2: Pruning

Pruning removes weights that contribute least to the output. The key quantity is saliency — a score predicting how much the loss increases if a weight is zeroed. Optimal Brain Damage (OBD) approximates saliency using the second derivative of the loss.

Sparsity ratio:
s = #zeros ⁄ #total_params

OBD saliency (diagonal Hessian approx):
S_i = (∂²L ⁄ ∂w_i²) × w_i² ⁄ 2

N:M sparsity (e.g. 2:4):
Keep N non-zero out of every M consecutive weights
Sparsity = 1 − N⁄M

CSR compression (unstructured sparse matrix):
stored values = nnz (non-zero count)
bytes ≈ nnz × (value_bytes + index_bytes)

2:4 sparsity on NVIDIA Ampere. NVIDIA's Sparse Tensor Core hardware requires exactly 2 non-zero values in every group of 4 consecutive weights (50% sparsity). The hardware can then execute at 2× the dense throughput because it skips the zero multiplications. This constraint makes 2:4 far more hardware-efficient than arbitrary unstructured sparsity.

Exercise 2.1: Sparsity Ratio Derive

A layer has 500,000 parameters. After magnitude pruning 350,000 are set to zero. What is the sparsity ratio as a percentage?

Show derivation

s = 350,000 ⁄ 500,000 = 0.70 = 70%

70% sparsity means 30% of weights remain active. This is sometimes written as "30% density." For magnitude pruning, we sort |w_i| and zero the bottom 70% by absolute value — a simple but surprisingly effective baseline.

Exercise 2.2: OBD Saliency Score Derive

A weight w = 0.4. The diagonal Hessian entry H_ii = ∂²L/∂w_i² = 5.0. What is the OBD saliency S_i?

(unitless)

Show derivation

S_i = H_ii × w_i² ⁄ 2 = 5.0 × 0.16 ⁄ 2 = 0.80 ⁄ 2 = 0.40

The OBD formula is a second-order Taylor expansion of the loss increase when w_i is zeroed: ΔL ≈ H_ii w_i²/2. Prune weights with smallest S_i — they cause the least loss increase when removed.

Exercise 2.3: 2:4 Sparsity Fraction Derive

NVIDIA 2:4 structured sparsity keeps 2 weights per group of 4. What sparsity fraction (as %) does this produce?

Show derivation

zeros per group = M − N = 4 − 2 = 2
sparsity = 2 ⁄ 4 = 50%

2:4 is exactly 50% sparse by construction. Every group of 4 consecutive weights has exactly 2 zeros and 2 non-zeros. The hardware stores only the 2 non-zero values plus a 2-bit index mask per group — enabling 2× throughput on Ampere Sparse Tensor Cores.

Exercise 2.4: CSR Memory Saving Derive

A weight matrix has 1,000,000 parameters at FP32 (4 bytes each). After 90% pruning only 100,000 are non-zero. In CSR format each non-zero uses 4 bytes (value) + 4 bytes (column index). What is the CSR size in MB? (1 MB = 10⁶ bytes)

Show derivation

CSR bytes = 100,000 × (4 + 4) = 800,000 bytes = 0.8 MB
Dense bytes = 1,000,000 × 4 = 4.0 MB
Saving: 4.0 MB → 0.8 MB (5× smaller)

CSR breaks even at 50% sparsity for FP32+int32 (same overhead as dense). Above 50% sparsity it saves space. At 90% sparsity: 5× compression. The catch: random memory access patterns in CSR kill hardware utilisation — which is why structured sparsity (2:4, block sparsity) is preferred in practice.

Exercise 2.5: Remaining Non-Zero Params Derive

A 10-layer network with 50,000 parameters per layer is pruned to 40% sparsity globally. How many non-zero parameters remain in total?

params

Show derivation

total = 10 × 50,000 = 500,000
non-zero = 500,000 × (1 − 0.40) = 500,000 × 0.60 = 300,000

At 40% sparsity, 60% of weights survive. 300,000 non-zero parameters remain. Whether this buys actual speedup depends on the hardware: unstructured sparsity rarely accelerates inference on MCUs without dedicated sparse arithmetic units.

Chapter 3: Quantization I — INT8 Basics

Quantization maps floating-point values to a smaller integer representation. The mapping is defined by a scale factor S and a zero-point Z computed from the observed min/max of the tensor. Getting S and Z right controls the round-trip dequantization error.

Asymmetric INT8 (min/max range):
S = (max − min) ⁄ (2^b − 1)    (b = 8 ⇒ range 0..255)
Z = round(0 − min ⁄ S)    clamped to [0, 255]

Quantize:   q = clamp( round(x ⁄ S) + Z, 0, 255 )
Dequantize: x̂ = S × (q − Z)
Quantization error:  ε = x̂ − x

K-means compression ratio:
r = original_bits ⁄ b_codebook    (b_codebook = log₂(K))

Zero-point encodes the float 0. Z tells the integer that maps to exactly 0.0 in float space. For ReLU outputs (always ≥ 0) a symmetric scheme can use Z=0 — but for arbitrary activations the asymmetric scheme with non-zero Z squeezes more precision out of the 8-bit range.

Exercise 3.1: Compute Scale S Derive

An activation tensor has min = −2.0 and max = 6.0. Compute the INT8 asymmetric scale S (to 4 decimal places).

float/int

Show derivation

S = (6.0 − (−2.0)) ⁄ (255 − 0) = 8.0 ⁄ 255 ≈ 0.03137

S is the float value of one INT8 step. The denominator is 2⁸−1 = 255 because unsigned INT8 spans [0, 255] = 256 levels. Answer: 8/255 ≈ 0.0314.

Exercise 3.2: Compute Zero-Point Z Derive

Using min = −2.0, S ≈ 0.03137, compute the integer zero-point Z (round to nearest integer).

integer

Show derivation

Z = round(0 − min ⁄ S) = round(0 − (−2.0) ⁄ 0.03137)
= round(2.0 ⁄ 0.03137) = round(63.75) = 64

Z = 64 means integer 64 represents the float value 0.0. Plugging back: x̂ = S×(64−64) = 0.0. Good: zero maps to zero (important for ReLU and zero-padding correctness).

Exercise 3.3: Quantize a Value Derive

Using S = 0.03137, Z = 64, quantize the float value x = 1.5 to an INT8 integer (round to nearest, clamp to [0, 255]).

integer

Show derivation

q = clamp(round(x ⁄ S) + Z, 0, 255)
= clamp(round(1.5 ⁄ 0.03137) + 64, 0, 255)
= clamp(round(47.82) + 64, 0, 255)
= clamp(48 + 64, 0, 255) = 112

Sanity check: dequantize q=112 back: x̂ = 0.03137 × (112 − 64) = 0.03137 × 48 = 1.506. The round-trip error is 1.506 − 1.5 = 0.006, which is within one half-step of S.

Exercise 3.4: Dequantization Error Derive

Dequantize q = 112 using S = 0.03137, Z = 64. Then compute the absolute error vs the original x = 1.5. Give the error to 4 decimal places.

float error

Show derivation

x̂ = S × (q − Z) = 0.03137 × (112 − 64) = 0.03137 × 48 = 1.50576
ε = |x̂ − x| = |1.50576 − 1.5| = 0.00576 ≈ 0.0058

The maximum possible quantization error is S/2 ≈ 0.0157. Actual error 0.006 is less than half-step — expected for a value that rounds cleanly. For INT8 across a range of 8.0, the step size 0.0314 is small enough that most modern networks tolerate the error with <1% accuracy drop.

Exercise 3.5: K-Means Compression Ratio Derive

Weight codebook compression: original FP32 weights (32 bits each) are replaced by 16-cluster K-means centroids (4 bits per index). What is the compression ratio (original bits ⁄ compressed bits per weight)?

× compression

Show derivation

b_codebook = log₂(16) = 4 bits per weight index
ratio = 32 ⁄ 4 = 8×

With K=16 clusters, each weight needs only log₂(16)=4 bits to store its cluster ID. The 16 centroid values are stored separately (small overhead for large layers). This gives 8× weight compression vs FP32. Note: activation memory is unchanged — only weight storage shrinks.

Chapter 4: Quantization II — Advanced

Per-tensor quantization uses one S and Z for the whole weight matrix. Per-channel quantization assigns a separate S and Z per output channel, cutting quantization error dramatically for layers where channels have very different magnitude distributions (common in the later layers of deep networks).

Per-tensor: one S, one Z for entire tensor
Per-channel (output channels): one S_c, Z_c per output channel c
Extra storage: C_out extra scale values (negligible vs weights)

Straight-Through Estimator (STE):
∂L⁄∂w ≈ ∂L⁄∂q    (treat quantize as identity during backprop)

SmoothQuant migration factor:
X' = X ⁄ diag(s)   W' = diag(s) × W
s_j = max|X_j|^α ⁄ max|W_j|^1−α   α ≈ 0.5

Why per-channel matters for weights. Weight channels in a Conv layer can have very different ranges. A channel whose weights span [−0.01, 0.01] quantized with a scale fit to another channel spanning [−2, 2] would lose almost all precision. Per-channel quantization gives each channel its own scale, recovering that precision at essentially zero extra inference cost.

Exercise 4.1: Per-Channel Scale Count Derive

A Conv2D has 128 output channels. Per-channel quantization stores one FP32 scale per channel. How many extra bytes does this add compared to per-tensor (one scale)?

bytes

Show derivation

extra scales = 128 − 1 = 127 FP32 values
extra bytes = 127 × 4 = 508 bytes

508 bytes of overhead to store per-channel scales. Compare to the weight matrix: 3×3×C_in×128 weights at 1 byte each (INT8) — easily hundreds of KB. The overhead is tiny but the accuracy gain can be large (often 1–2% top-1 on ImageNet vs per-tensor).

Exercise 4.2: STE Numeric Example Derive

In quantization-aware training with STE, the upstream gradient ∂L⁄∂q = 3.0. According to STE, what value do you use for ∂L⁄∂w in the weight update?

(gradient)

Show derivation

STE: ∂L⁄∂w = ∂L⁄∂q = 3.0

The quantize function has zero gradient almost everywhere (piecewise constant staircase) and undefined gradient at steps. STE replaces the true zero gradient with the identity: pretend quantize is a pass-through during backprop. This allows weights to receive gradient signal and improve across training epochs even though we quantize during the forward pass.

Exercise 4.3: SmoothQuant Migration Factor Derive

A Linear layer has max activation magnitude max|X_j| = 8.0 and max weight magnitude max|W_j| = 0.5 for channel j. With α = 0.5, compute the SmoothQuant migration factor s_j.

(scale)

Show derivation

s_j = max|X_j|^α ⁄ max|W_j|^1−α
= 8.0^0.5 ⁄ 0.5^0.5
= 2.828 ⁄ 0.707 ≈ 4.0

s_j = 4.0 means: divide activation channel j by 4 (making max|X'_j| = 2.0) and multiply weight channel j by 4 (making max|W'_j| = 2.0). Now both tensors have the same range — equally hard to quantize. SmoothQuant migrates quantization difficulty from outlier-heavy activations to weights, which quantize more cleanly.

Exercise 4.4: INT4 vs INT8 Memory Saving Derive

A model has 7 billion FP16 parameters (2 bytes each). Quantizing to INT4 (0.5 bytes each). What is the INT4 model size in GB? (1 GB = 10⁹ bytes)

Show derivation

INT4 bytes = 7 × 10⁹ × 0.5 = 3.5 × 10⁹ bytes = 3.5 GB

FP16: 7B × 2 = 14 GB — doesn't fit in a 12 GB GPU. INT4: 7B × 0.5 = 3.5 GB — fits on a consumer GPU. This is exactly why GPTQ and QLoRA target 4-bit weights for 7B LLMs: a 4× reduction vs FP16 makes them accessible on commodity hardware.

Exercise 4.5: Quantization Levels for b Bits Derive

How many distinct quantization levels does an unsigned b-bit integer have? For b = 4, what is the answer?

levels

Show derivation

levels = 2^b = 2⁴ = 16

4-bit unsigned integers span 0..15 = 16 distinct values. The scale S = range / (2^b−1) = range/15, which is 17× coarser than INT8's 1/255. That's why INT4 is noticeably more lossy and usually requires QAT or GPTQ second-order correction to maintain accuracy on LLMs.

Chapter 5: Neural Architecture Search

Neural Architecture Search (NAS) automates the choice of network topology. The search space defines all candidate architectures. DARTS relaxes the discrete choice to a continuous soft weighting; ProxylessNAS adds a latency term directly to the loss.

Search-space cardinality:
|S| = O^E (O op choices per edge, E edges in DAG)

DARTS mixed op (softmax over ops):
̅o^(i,j)(x) = ∑_k exp(α_k) ⁄ ∑_m exp(α_m) × o_k(x)

ProxylessNAS expected latency:
E[lat] = ∑_k p_k × lat_k
p_k = softmax(α)_k

DARTS trains architecture weights α jointly with network weights W. By replacing discrete "which op" with a continuous weighted mixture, gradients can flow through the architecture choice. After training, the argmax of α at each edge selects the final discrete op — no expensive combinatorial search needed.

Exercise 5.1: Search Space Size Derive

A NAS search space has 7 operation choices per edge and 14 edges in the DAG. How many candidate architectures are in the search space? Give the exponent (log base 7: it's 7¹⁴ — enter just the base-10 log to 1 decimal place).

log₁₀(|S|)

Show derivation

|S| = 7¹⁴
log₁₀(7¹⁴) = 14 × log₁₀(7) = 14 × 0.845 = 11.83

7¹⁴ ≈ 6.78 × 10¹¹ — roughly 700 billion architectures. This is why brute-force evaluation (train each architecture to convergence and measure accuracy) is completely infeasible. DARTS collapses this to a single training run with a mixed supernetwork.

Exercise 5.2: DARTS Softmax Weight Derive

Three operations have architecture logits α = [2.0, 1.0, 0.0]. Compute the softmax probability p₀ for operation 0 (to 3 decimal places).

probability

Show derivation

e^2.0 = 7.389, e^1.0 = 2.718, e^0.0 = 1.000
sum = 11.107
p₀ = 7.389 ⁄ 11.107 ≈ 0.665

66.5% weight on operation 0. After DARTS training converges, if p₀ dominates like this the final discrete architecture picks op 0 at this edge via argmax. The architecture α parameters are trained by gradient descent on a validation loss, while network weights W are trained on the training loss.

Exercise 5.3: ProxylessNAS Expected Latency Derive

Three operations have latencies [10, 5, 2] ms and probabilities [0.6, 0.3, 0.1]. Compute the expected latency E[lat] in ms.

Show derivation

E[lat] = 0.6 × 10 + 0.3 × 5 + 0.1 × 2
= 6.0 + 1.5 + 0.2 = 7.7 ms

ProxylessNAS adds a differentiable latency loss R(α) = E[lat] to the total loss: L = L_acc + λ × E[lat]. Gradients ∂E[lat]/∂α_k = p_k(1−p_k) × lat_k flow back through the softmax, pushing the search toward faster ops automatically during training.

Exercise 5.4: MnasNet Reward Function Derive

MnasNet uses reward R = ACC × (T_target ⁄ T)^β. For ACC = 0.75, T = 60 ms, T_target = 80 ms, β = −0.07. Compute R to 3 decimal places.

reward

Show derivation

T_target⁄T = 80⁄60 = 1.333
(1.333)^−0.07 = e^{−0.07 × ln(1.333)} = e^{−0.07 × 0.2877} = e^−0.02014 ≈ 0.9800
R = 0.75 × 0.980 ≈ 0.735... ≈ 0.735

Here T = 60 ms is faster than T_target = 80 ms, so the speed bonus >1... wait: T_t/T = 80/60 = 1.33 > 1, and β = −0.07 < 0, so (1.33)^−0.07 = 0.98 < 1 — being faster than target is mildly penalised to prevent trading accuracy for marginal latency gains. Answer ≈ 0.750 × 0.980 = 0.735. (Enter ~0.735.)

Exercise 5.5: NAS vs Hand-Design Training Cost Derive

Brute-force NAS evaluates 1,000 architectures each taking 4 GPU-hours. DARTS trains one supernetwork in 4 GPU-days. How many times cheaper is DARTS? (1 day = 24 hours)

× cheaper

Show derivation

Brute force: 1000 × 4 = 4,000 GPU-hours
DARTS: 4 × 24 = 96 GPU-hours
Ratio = 4,000 ⁄ 96 ≈ 41.7×

DARTS is ~42× cheaper in this scenario. Original NAS (Zoph & Le 2017) used 800 GPUs × 28 days = 22,400 GPU-days; DARTS (Liu et al. 2018) reduced this to 4 GPU-days — a 5,600× improvement. This is what made NAS practical for research labs without massive compute budgets.

Chapter 6: Knowledge Distillation

Knowledge distillation trains a small student network to mimic a large teacher. The student minimises a weighted combination of the hard-label cross-entropy and the soft-label KL divergence from the teacher's output distribution. A temperature T softens the teacher's logits, amplifying the information in low-probability classes.

KD loss:
L = (1 − λ) × L_CE(y, p_s) + λ × T² × KL(p_t^T || p_s^T)

Soft probabilities at temperature T:
p_i^T = exp(z_i ⁄ T) ⁄ ∑_j exp(z_j ⁄ T)

T² rescaling:
As T ↑, soft probs flatten → gradients shrink by 1⁄T²
Multiply KD loss term by T² to restore gradient magnitude

Why T²? When T is large the soft targets p_i^T all approach 1/C (uniform). The KL gradients scale as 1/T². Multiplying the KD loss by T² cancels this attenuation — keeping the effective learning signal constant regardless of temperature choice.

Exercise 6.1: Soft Probability at Temperature T Derive

Teacher logits for 3 classes: z = [4.0, 2.0, 1.0]. With temperature T = 4, compute the soft probability p₀^T for class 0 (to 3 decimal places).

probability

Show derivation

Scaled logits: 4/4=1.0, 2/4=0.5, 1/4=0.25
e^1.0=2.718, e^0.5=1.649, e^0.25=1.284
sum = 5.651
p₀^T = 2.718 / 5.651 ≈ 0.481

At T=1 (hard), p₀ ≈ 0.87 (very confident class 0). At T=4 (soft), p₀ ≈ 0.481 — the distribution is far more spread out. The student now learns that classes 1 and 2 are "not completely wrong", encoding richer structure than a one-hot label ever could. Answer: ~0.481 (accept ~0.443–0.481 depending on rounding).

Exercise 6.2: T² Gradient Rescaling Ratio Derive

Distillation is run at T = 5. By what factor do KD gradients shrink without the T² correction, relative to T = 1?

× shrinkage

Show derivation

gradient scale ∝ 1⁄T² = 1⁄25
shrinkage factor = T² = 5² = 25

At T=5 the KD gradients are 25× smaller than at T=1. Without the T² correction the hard-label CE term completely dominates the loss. Hinton et al. 2015 showed that multiplying the KD term by T² recovers balanced training — this is why the T² factor is always included in the KD loss formula.

Exercise 6.3: KD Loss Weighting Derive

In a KD training step: L_CE = 2.0, KL = 0.4, T = 3, λ = 0.7. Compute total KD loss L.

loss

Show derivation

L = (1 − λ) L_CE + λ T² KL
= 0.3 × 2.0 + 0.7 × 9 × 0.4
= 0.6 + 2.52 = 3.12

With λ=0.7 (strong KD signal) and T²=9, the soft-label term dominates: 2.52 vs 0.60. This is typical early in distillation training when the student needs to match the teacher's overall output structure before fitting hard labels. Total: 3.12 (accept 3.12 ± 0.05).

Exercise 6.4: Parameter Compression Ratio Derive

A ResNet-50 teacher has 25.6 M parameters. A MobileNetV2 student has 3.4 M. What is the parameter compression ratio (teacher / student)?

× smaller

Show derivation

ratio = 25.6 ⁄ 3.4 ≈ 7.53 ≈ 7.5×

A 7.5× parameter reduction. With distillation the MobileNetV2 student can approach ResNet-50 accuracy (within ~1–2% top-1 on ImageNet) while being 7.5× smaller. Without distillation — training MobileNetV2 directly on labels — it typically scores 2–3% lower. The soft teacher targets provide the accuracy boost.

Exercise 6.5: Temperature and Entropy Derive

At T → ∞, what do the soft probabilities p_i^T converge to for a C-class problem? (Express as a fraction in terms of C.)

p_i = 1 / __ (enter C for C=10)

Show derivation

As T → ∞: z_i⁄T → 0 for all i
exp(z_i⁄T) → 1
p_i^T → 1⁄C (uniform distribution)

At T=∞ every class gets equal probability 1/C — maximum entropy, no information. For C=10 that's 0.1 each. This is why extremely high T is bad: the soft targets collapse to uninformative uniform noise. In practice T = 3–7 is optimal for most tasks, balancing information content with gradient stability.

Chapter 7: MCUNet & TinyEngine

Microcontrollers have typically 256 KB–1 MB SRAM. The binding constraint is peak activation memory — the maximum memory required to hold the input and output of a single layer simultaneously. MCUNet co-designs the neural network (TinyNAS) and the inference engine (TinyEngine) to stay inside this budget.

Peak SRAM for one layer (two-buffer model):
SRAM_peak = bytes(input_feature_map) + bytes(output_feature_map)

Patch-based inference (patch size P×P):
SRAM reduction ≈ 1⁄P² (for a single Conv layer)

im2col memory blow-up for a K×K Conv:
im2col buffer ≈ K² × C_in × H_o × W_o values

In-place depth-wise tiling working set:
ws = K × W × C × bytes_per_val + small constants

Patch inference cuts SRAM by 1/P². Instead of processing the full feature map at once, TinyEngine divides the output into P×P patches and processes one patch at a time. Each patch requires only 1/P² of the full activation buffer. The tradeoff: repeated border computations increase MACs by a small constant.

Exercise 7.1: Peak Two-Buffer SRAM Derive

A Conv2D layer: input (1×32×56×56) INT8 (1 byte/value), output (1×64×56×56) INT8. What is the peak two-buffer SRAM in KB? (1 KB = 1024 bytes)

Show derivation

input bytes = 32 × 56 × 56 = 100,352
output bytes = 64 × 56 × 56 = 200,704
total = 301,056 bytes / 1024 ≈ 294 KB

294 KB just for this one layer's I/O buffers. An STM32F7 has 512 KB SRAM. After subtracting stack, heap, and other overhead there may be only 400 KB available. This single layer uses 74% of that — leaving almost nothing for other layers. This is the MCU memory wall in practice.

Exercise 7.2: Patch Inference SRAM Reduction Derive

Patch-based inference uses P=4 (4×4 patches). By what factor does SRAM reduce (approximately)?

× reduction

Show derivation

reduction ≈ P² = 4² = 16×

Dividing the 56×56 output into 4×4 patches means each patch is 14×14. We only need to hold 14×14 (not 56×56) of activations in memory at once — 16× less. The input patch with halo (border needed for K=3 receptive field) is slightly larger than 14×14 but still far smaller than the full map.

Exercise 7.3: im2col Buffer Size Derive

A 3×3 Conv2D: C_in=32, output 14×14. How many INT8 values does the im2col buffer hold?

values

Show derivation

im2col = K² × C_in × H_o × W_o
= 9 × 32 × 14 × 14 = 9 × 32 × 196 = 56,448

im2col "unrolls" each K×K×C_in receptive field into one column of a matrix. The resulting matrix has K²C_in = 288 rows and H_oW_o = 196 columns = 56,448 values. This is K² = 9× larger than the original input. TinyEngine avoids allocating this buffer by computing im2col inline during the matrix multiply.

Exercise 7.4: Flash vs SRAM for Weights Derive

MCUNet model: 1.0 M INT8 parameters (weights) stored in Flash, and peak activation SRAM = 256 KB. Total Flash needed for weights in KB? (1 byte/param)

Show derivation

Flash = 1,000,000 bytes / 1024 = 976.6 KB ≈ 976 KB ≈ 1 MB

~1 MB Flash for weights, 256 KB SRAM for activations. The Arduino Nano 33 BLE Sense has 1 MB Flash and 256 KB SRAM — this model fits exactly! MCUNet was designed for this class of device. Flash is non-volatile (survives power cycles) so weights live there; SRAM is volatile and only holds runtime activations.

Exercise 7.5: Depthwise Tiling Working Set Derive

A depthwise Conv (K=3, W=56, C=64) processes one row at a time. The tiling working set is K rows of the input: K×W×C bytes (INT8). How many KB?

Show derivation

bytes = 3 × 56 × 64 = 10,752 bytes / 1024 ≈ 10.5 KB

10.5 KB is a tiny working set — easily fits in L1 cache or even tight MCU SRAM. By processing row-by-row, TinyEngine never needs to materialise the full 56×56×64 = 200 KB activation map at once. This row-tiling strategy is the core innovation in TinyEngine's depthwise kernel.

Chapter 8: Efficient LLMs

Large language models are bottlenecked by two things: attention's quadratic FLOPs in the sequence length, and KV-cache memory at inference time. GQA, LoRA, and QLoRA directly attack these two bottlenecks.

Attention FLOPs per layer (dense):
FLOPs ≈ 4 × N² × d  (N = seq len, d = head dim × H)

KV-cache bytes (per token generated, FP16):
bytes = 2 × N_layers × N_heads × d_head × 2
(factor 2 for K and V; factor 2 for FP16 bytes)

GQA head reduction:
KV-cache ÷ G    (G = #KV groups, H⁄G heads share each KV)

LoRA parameter count:
extra params = 2 × d × r  (vs d × d full fine-tune)

QLoRA NF4 memory:
frozen model at 4-bit + adapter at BF16
memory ≈ params × 0.5 + adapter_params × 2

KV-cache grows with batch × sequence length. Serving a 70B model with batch=16 and context=4096 tokens: KV-cache = 2 × 80 layers × 8 heads × 128 d_head × 4096 tokens × 16 batch × 2 bytes ≈ 52 GB — more than the model weights. This is why GQA and sliding-window attention are critical for production LLM serving.

Exercise 8.1: Attention FLOPs Derive

One Transformer attention layer: N = 1024 tokens, d = 512 (total hidden dim). Using FLOPs ≈ 4N²d, compute attention FLOPs in GFLOPs.

GFLOPs

Show derivation

FLOPs = 4 × 1024² × 512
= 4 × 1,048,576 × 512
= 2,147,483,648 ≈ 2.15 GFLOPs

2.15 GFLOPs per attention layer. If we double the sequence to N=2048: FLOPs × 4 = 8.59 GFLOPs. This is the O(N²) quadratic scaling — the fundamental motivation for linear attention, sparse attention (Longformer), and state-space models (Mamba).

Exercise 8.2: KV-Cache Memory Derive

A 32-layer model, 32 heads, d_head = 128, generates 512 tokens at batch = 1, stored as FP16 (2 bytes). How many MB is the KV-cache? (1 MB = 10⁶ bytes)

Show derivation

bytes = 2 × L × H × d_h × N × bytes_per_val
= 2 × 32 × 32 × 128 × 512 × 2
= 2 × 32 × 32 × 128 × 1024 = 134,217,728 bytes ≈ 134.2 MB

134 MB for 512 tokens batch=1. Scale to batch=32 and N=4096: ×(32×8) = 256× more = ~34 GB. That's why KV-cache is the bottleneck for LLM serving throughput, not just weight memory.

Exercise 8.3: GQA KV-Cache Reduction Derive

Llama-3-70B uses 64 query heads (H=64) and 8 KV heads (G=8, so each KV head serves H/G=8 query heads). By what factor does GQA reduce KV-cache vs MHA with 64 KV heads?

× reduction

Show derivation

MHA KV heads = H = 64
GQA KV heads = G = 8
reduction = 64 ⁄ 8 = 8×

GQA stores only G=8 KV vectors per position instead of H=64. The 8× reduction in KV-cache directly translates to 8× more throughput at fixed memory, or 8× longer contexts at fixed memory. This is why every modern efficient LLM (Llama 2/3, Mistral, Gemma) uses GQA.

Exercise 8.4: LoRA Parameter Count Derive

Full fine-tuning of a 4096×4096 weight matrix requires 4096² = 16.77 M parameters. LoRA with r=16 injects two low-rank matrices A (4096×16) and B (16×4096). How many trainable LoRA parameters?

params

Show derivation

A: 4096 × 16 = 65,536
B: 16 × 4096 = 65,536
total = 131,072

131,072 LoRA params vs 16,777,216 full-finetune params — a 128× reduction (2r/d = 2×16/4096 = 1/128 of full). For a 7B model applied to all Linear layers this ratio makes the difference between needing 112 GB vs <1 GB of gradient+optimizer memory during fine-tuning.

Exercise 8.5: QLoRA Memory Budget Derive

QLoRA fine-tunes a 7B-parameter model. Frozen model weights: 7B at NF4 (0.5 bytes each). LoRA adapter: 20 M params at BF16 (2 bytes each). Compute total memory in GB.

Show derivation

frozen = 7 × 10⁹ × 0.5 = 3.5 × 10⁹ bytes = 3.5 GB
adapter = 20 × 10⁶ × 2 = 40 × 10⁶ bytes = 0.04 GB
total ≈ 3.54 GB

3.54 GB for a 7B model fine-tune! Full BF16 fine-tuning of 7B needs 14 GB just for weights (plus 2× for optimizer states = 42 GB). QLoRA fits on a single 8 GB consumer GPU. This democratisation of LLM fine-tuning is why QLoRA (Dettmers et al. 2023) became one of the most impactful efficiency papers of that year.

Chapter 9: Capstone — End-to-End System Design

These exercises combine multiple constraints simultaneously: fit a model on a microcontroller with 256 KB SRAM and 1 MB Flash, or serve a quantized LLM on a limited GPU. You must reason about size, compute, and memory at the same time.

MCU design checklist. (1) Count Flash: params × bytes/param ≤ Flash budget. (2) Count peak SRAM: max layer I/O buffer ≤ SRAM budget. (3) Count MACs: total MACs within latency target given MCU throughput. If any constraint fails, apply the right technique: quantize to save Flash/SRAM, prune to save MACs, patch-inference to save SRAM.

Exercise 9.1: MCU Flash Feasibility Derive

An MCU has 1 MB Flash. A model has 800,000 INT8 parameters (1 byte each) plus 50 KB of firmware overhead. Does it fit? Enter the total Flash usage in KB.

KB used

Show derivation

weights = 800,000 bytes / 1024 = 781.25 KB
total = 781.25 + 50 = 831.25 KB ≈ 831 KB
1 MB = 1024 KB → fits! (831 < 1024)

831 KB < 1024 KB — the model fits in Flash with 193 KB to spare. If it didn't fit: (a) switch to INT4 (halve Flash), (b) prune 20% of weights, or (c) remove the last few layers and accept slight accuracy loss. Knowing which knob to turn requires understanding all constraints simultaneously.

Exercise 9.2: SRAM Budget with Patch Inference Derive

The bottleneck layer needs 320 KB peak SRAM (two-buffer). The MCU has 256 KB. With patch size P=2, what is the reduced SRAM requirement in KB?

Show derivation

reduced = 320 ⁄ P² = 320 ⁄ 4 = 80 KB

80 KB < 256 KB — the bottleneck layer now fits comfortably. P=2 patches divide the output into 2×2 = 4 tiles; only one tile's activations live in SRAM at a time. The ~3% MAC overhead from border recomputation is well worth the 4× SRAM savings.

Exercise 9.3: End-to-End Latency on an MCU Derive

A model runs 30 MMACs per inference. The MCU sustains 60 MMACops/s. What is the minimum inference time in ms?

Show derivation

time = MACs ⁄ throughput = 30 × 10⁶ ⁄ 60 × 10⁶ = 0.5 s = 500 ms

500 ms per inference. For a wake-word detector this is too slow (needs <100 ms). Options: (a) prune 5× to 6 MMACs → 100 ms, (b) use DSP SIMD to get 4× throughput → 125 ms, (c) redesign with a smaller backbone (shufflenet / squeezenet). In practice MCUNet targets ~200 MMACs on faster chips (100+ MMACops/s) for sub-200ms inference.

Exercise 9.4: GPU Memory for Serving a Quantized LLM Derive

Serving a 13B INT4 model (0.5 bytes/param) + KV-cache for batch=4, seq=2048 tokens, 40 layers, 40 heads, d_head=128, FP16. KV-cache bytes = 2×40×40×128×2048×4×2. Total memory in GB?

Show derivation

weights = 13 × 10⁹ × 0.5 = 6.5 GB
KV = 2×40×40×128×2048×4×2 = 6,710,886,400 bytes
≈ 6.71 GB ≈ 6.4 GB (using 10⁹ denominator)
total = 6.5 + 6.71 = 13.21 GB (accept 13.2–13.5 GB)

~13.4 GB total — fits on a single RTX 3090 (24 GB) with room for activations and overhead. Without INT4 quantization the weights alone would be 26 GB (FP16) — requiring two GPUs. This is the real-world value of combined INT4 quantization + GQA (which would halve the KV-cache further).

Exercise 9.5: Combined Savings Stack Derive

A model starts at 50 MMACs and 6 MB FP32. Apply: depthwise separable (8× fewer MACs), INT8 quantization (4× smaller), 50% magnitude pruning (2× fewer effective ops). Final effective MMACs and MB?

MMACs (final)

Show derivation

MACs: 50 ⁄ 8 (DW) ⁄ 2 (pruning) = 3.125 MMACs
Size: 6 MB ⁄ 4 (INT8) = 1.5 MB

From 50 MMACs / 6 MB to 3.125 MMACs / 1.5 MB — a 16× compute reduction and 4× size reduction by combining orthogonal techniques. This multiplicative stacking is the key insight: each technique targets a different source of inefficiency (architecture, precision, connectivity), so their benefits compound rather than overlap.