Workbook — TinyML & Efficient Deep Learning

TinyML & Efficient Deep Learning Workbook

Every number a TinyML engineer must derive cold: model size, MACs, pruning saliency, INT8 quantization, NAS search-space cardinality, knowledge distillation temperature scaling, MCU SRAM tiling, efficient LLM attention costs, KV-cache bytes, and LoRA parameter counts. All verifiable in-browser with instant feedback.

Prerequisites: Basic linear algebra + familiarity with neural nets + Python basics. Everything else we derive.
10
Chapters
50
Exercises
5
Exercise Types
Mastery
0 / 50 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Efficiency Metrics

Before you can shrink a model you need to measure it. Three numbers dominate: parameter count (storage), MACs (compute), and peak activation memory (runtime SRAM). Getting these wrong means shipping a model that silently overflows a microcontroller.

Model size in bits:
bits = #params × bits-per-param
bytes = bits ⁄ 8

FLOPs vs MACs:
1 MAC (multiply-accumulate) = 2 FLOPs (one multiply + one add)
FLOPs ≈ 2 × MACs

Roofline intensity:
arithmetic intensity I = FLOPs ⁄ bytes_moved
achievable perf = min(I × mem_BW, peak_FLOP_s)
Energy hierarchy rule of thumb. A DRAM access costs ~200× more energy than a register operation. SRAM is ~6× cheaper than DRAM. Minimising off-chip memory traffic matters far more than reducing raw FLOPs on most edge hardware.
Exercise 0.1: Model Size in Bytes Derive

A MobileNetV2 variant has 3.4 million parameters stored in FP32 (32 bits each). What is the model size in megabytes (MB)? (1 MB = 106 bytes)

MB
Show derivation
bytes = 3,400,000 × 32 ⁄ 8 = 3,400,000 × 4 = 13,600,000 bytes = 13.6 MB

Each FP32 parameter occupies 4 bytes. 3.4 M × 4 = 13.6 MB. This is why INT8 quantization (1 byte per param) reduces storage to 3.4 MB — a 4× saving.

Exercise 0.2: FLOPs from MACs Derive

A layer performs 150 million multiply-accumulate operations (MACs). How many FLOPs is that?

MFLOPs
Show derivation
FLOPs = 2 × MACs = 2 × 150 M = 300 MFLOPs

Each MAC fuses a multiply and an add into one operation. When reporting FLOPs (floating-point operations), we count both, so 1 MAC = 2 FLOPs. Many papers report MACs; others report FLOPs — always check which one before comparing numbers.

Exercise 0.3: Arithmetic Intensity Derive

A matrix multiply moves 400 KB of data and performs 3.2 GFLOPs. What is the arithmetic intensity in FLOPs/byte?

FLOPs/byte
Show derivation
I = FLOPs ⁄ bytes = 3.2 × 109 ⁄ (400 × 103) = 8,000 FLOPs/byte

8000 FLOPs/byte is deep in the compute-bound region on any modern chip (most GPU memory rooflines are at ~200 FLOPs/byte for FP32). Large matrix multiplies are almost always compute-bound — this is why batch size matters for GPU utilisation.

Exercise 0.4: Activation Memory for One Layer Derive

A Conv2D layer outputs a feature map of shape (batch=1, C=64, H=56, W=56) stored as FP32. How many KB of activation memory does it consume?

KB
Show derivation
elements = 1 × 64 × 56 × 56 = 200,704
bytes = 200,704 × 4 = 802,816 bytes ≈ 784 KB

802,816 / 1024 = 784 KB. On a microcontroller with 512 KB SRAM total, this single layer would already overflow! Peak activation memory — not parameter storage — is the binding constraint on MCUs.

Exercise 0.5: Roofline Bottleneck Derive

A microcontroller has peak compute 200 MFLOPs/s and memory bandwidth 50 MB/s. The ridge point (balance point) is at what arithmetic intensity (FLOPs/byte)?

FLOPs/byte
Show derivation
ridge = peak_compute ⁄ mem_BW = 200 × 106 FLOPs/s ⁄ 50 × 106 bytes/s = 4 FLOPs/byte

Operations with I < 4 FLOPs/byte are memory-bound; those with I > 4 are compute-bound. Most depthwise convolutions on MCUs fall below the ridge point, making memory bandwidth the bottleneck, not raw compute.

Chapter 1: NN Building Blocks

To shrink a network you must first count exactly what is inside it. Every Linear and Conv2D layer has a deterministic parameter count and MAC count derivable from its shape alone.

Linear layer: in → out
params = in × out + out (bias)
MACs = in × out (per sample)

Conv2D: kernel K×K, Cin inputs, Cout outputs, output Ho×Wo
params = K2 × Cin × Cout + Cout
MACs = K2 × Cin × Cout × Ho × Wo

Depthwise separable:
DW params = K2 × C + C
PW params = C × C' + C'
MACs ≈ (K2 + C') × C × Ho × Wo
MobileNet insight. Depthwise separable convolution replaces one 3×3 Conv2D with a 3×3 depthwise (per-channel) followed by a 1×1 pointwise. The FLOP reduction factor is approximately C⁄(K2 + C) ≈ 1/9 for K=3 and large C — roughly 8× fewer MACs for typical channel counts.
Exercise 1.1: Linear Layer Parameter Count Derive

A fully-connected layer maps 512 inputs to 256 outputs, with bias. How many trainable parameters does it have?

params
Show derivation
params = 512 × 256 + 256 = 131,072 + 256 = 131,328

The weight matrix is 512×256 = 131,072 entries. The bias vector adds one scalar per output neuron: 256 more. Total: 131,328. In practice many recent architectures omit bias in Linear layers that precede a BatchNorm — then params = 512 × 256 = 131,072.

Exercise 1.2: Conv2D Parameter Count Derive

A Conv2D layer has kernel 3×3, 32 input channels, 64 output channels, with bias. How many parameters?

params
Show derivation
params = 32 × 32 × 64 + 64 = 9 × 2048 + 64 = 18,432 + 64 = 18,496

Each of the 64 output filters is a 3×3×32 kernel = 288 weights. 64 filters × 288 = 18,432 weights. Add 64 biases. Contrast with the depthwise version next exercise: 32×32 = 288 DW params + 32×64 PW params = 2336 total — 8× fewer.

Exercise 1.3: Conv2D MACs Derive

The 3×3 Conv2D from Ex 1.2 (32 in, 64 out) processes an output feature map of 28×28. How many MACs (in millions)?

MMACs
Show derivation
MACs = K2 × Cin × Cout × Ho × Wo
= 9 × 32 × 64 × 28 × 28 = 9 × 32 × 64 × 784 = 14,450,688 ≈ 14.45 MMACs

Wait — 14.45 M, not 451. Each output pixel requires K2×Cin MACs to sum over the receptive field. Then we do this for all Cout output channels and all Ho×Wo output positions. Answer: 14.45 MMACs (enter 14.45).

Exercise 1.4: Depthwise Separable FLOP Reduction Derive

Replace the 3×3 Conv2D (32→64, output 28×28) with a depthwise separable block (3×3 DW + 1×1 PW). Ratio = standard MACs ⁄ separable MACs. Round to 1 decimal. (Ignore biases.)

× reduction
Show derivation
Standard MACs = 9 × 32 × 64 × 282 = 14,450,688
DW MACs = 9 × 32 × 282 = 225,792
PW MACs = 1 × 32 × 64 × 282 = 1,605,632
Separable total = 1,831,424
Ratio = 14,450,688 ⁄ 1,831,424 ≈ 7.89 ≈ 8.0×

The theoretical reduction is 1 ⁄ (1⁄Cout + 1⁄K2) = 1/(1/64 + 1/9) ≈ 8.0. This matches MobileNet's empirical ~8× FLOP saving over standard convolutions.

Exercise 1.5: Linear Layer MACs Per Token Derive

A Transformer FFN first projects from d=512 to 4d=2048 (no bias). Per input token, how many MACs does this first projection require?

KMACs/token
Show derivation
MACs = 512 × 2048 = 1,048,576 ≈ 1,048 KMACs per token

For a Linear(d, 4d) layer, each output neuron does d multiplications and d additions (d MACs). With 4d output neurons the total is d × 4d = 4d2 MACs. Here: 4 × 5122 = 1,048,576 MACs = 1,048 KMACs per token.

Chapter 2: Pruning

Pruning removes weights that contribute least to the output. The key quantity is saliency — a score predicting how much the loss increases if a weight is zeroed. Optimal Brain Damage (OBD) approximates saliency using the second derivative of the loss.

Sparsity ratio:
s = #zeros ⁄ #total_params

OBD saliency (diagonal Hessian approx):
Si = (∂2L ⁄ ∂wi2) × wi2 ⁄ 2

N:M sparsity (e.g. 2:4):
Keep N non-zero out of every M consecutive weights
Sparsity = 1 − N⁄M

CSR compression (unstructured sparse matrix):
stored values = nnz (non-zero count)
bytes ≈ nnz × (value_bytes + index_bytes)
2:4 sparsity on NVIDIA Ampere. NVIDIA's Sparse Tensor Core hardware requires exactly 2 non-zero values in every group of 4 consecutive weights (50% sparsity). The hardware can then execute at 2× the dense throughput because it skips the zero multiplications. This constraint makes 2:4 far more hardware-efficient than arbitrary unstructured sparsity.
Exercise 2.1: Sparsity Ratio Derive

A layer has 500,000 parameters. After magnitude pruning 350,000 are set to zero. What is the sparsity ratio as a percentage?

%
Show derivation
s = 350,000 ⁄ 500,000 = 0.70 = 70%

70% sparsity means 30% of weights remain active. This is sometimes written as "30% density." For magnitude pruning, we sort |wi| and zero the bottom 70% by absolute value — a simple but surprisingly effective baseline.

Exercise 2.2: OBD Saliency Score Derive

A weight w = 0.4. The diagonal Hessian entry Hii = ∂2L/∂wi2 = 5.0. What is the OBD saliency Si?

(unitless)
Show derivation
Si = Hii × wi2 ⁄ 2 = 5.0 × 0.16 ⁄ 2 = 0.80 ⁄ 2 = 0.40

The OBD formula is a second-order Taylor expansion of the loss increase when wi is zeroed: ΔL ≈ Hii wi2/2. Prune weights with smallest Si — they cause the least loss increase when removed.

Exercise 2.3: 2:4 Sparsity Fraction Derive

NVIDIA 2:4 structured sparsity keeps 2 weights per group of 4. What sparsity fraction (as %) does this produce?

%
Show derivation
zeros per group = M − N = 4 − 2 = 2
sparsity = 2 ⁄ 4 = 50%

2:4 is exactly 50% sparse by construction. Every group of 4 consecutive weights has exactly 2 zeros and 2 non-zeros. The hardware stores only the 2 non-zero values plus a 2-bit index mask per group — enabling 2× throughput on Ampere Sparse Tensor Cores.

Exercise 2.4: CSR Memory Saving Derive

A weight matrix has 1,000,000 parameters at FP32 (4 bytes each). After 90% pruning only 100,000 are non-zero. In CSR format each non-zero uses 4 bytes (value) + 4 bytes (column index). What is the CSR size in MB? (1 MB = 106 bytes)

MB
Show derivation
CSR bytes = 100,000 × (4 + 4) = 800,000 bytes = 0.8 MB
Dense bytes = 1,000,000 × 4 = 4.0 MB
Saving: 4.0 MB → 0.8 MB (5× smaller)

CSR breaks even at 50% sparsity for FP32+int32 (same overhead as dense). Above 50% sparsity it saves space. At 90% sparsity: 5× compression. The catch: random memory access patterns in CSR kill hardware utilisation — which is why structured sparsity (2:4, block sparsity) is preferred in practice.

Exercise 2.5: Remaining Non-Zero Params Derive

A 10-layer network with 50,000 parameters per layer is pruned to 40% sparsity globally. How many non-zero parameters remain in total?

params
Show derivation
total = 10 × 50,000 = 500,000
non-zero = 500,000 × (1 − 0.40) = 500,000 × 0.60 = 300,000

At 40% sparsity, 60% of weights survive. 300,000 non-zero parameters remain. Whether this buys actual speedup depends on the hardware: unstructured sparsity rarely accelerates inference on MCUs without dedicated sparse arithmetic units.

Chapter 3: Quantization I — INT8 Basics

Quantization maps floating-point values to a smaller integer representation. The mapping is defined by a scale factor S and a zero-point Z computed from the observed min/max of the tensor. Getting S and Z right controls the round-trip dequantization error.

Asymmetric INT8 (min/max range):
S = (max − min) ⁄ (2b − 1)    (b = 8 ⇒ range 0..255)
Z = round(0 − min ⁄ S)    clamped to [0, 255]

Quantize:   q = clamp( round(x ⁄ S) + Z, 0, 255 )
Dequantize: x̂ = S × (q − Z)
Quantization error:  ε = x̂ − x

K-means compression ratio:
r = original_bits ⁄ b_codebook    (b_codebook = log2(K))
Zero-point encodes the float 0. Z tells the integer that maps to exactly 0.0 in float space. For ReLU outputs (always ≥ 0) a symmetric scheme can use Z=0 — but for arbitrary activations the asymmetric scheme with non-zero Z squeezes more precision out of the 8-bit range.
Exercise 3.1: Compute Scale S Derive

An activation tensor has min = −2.0 and max = 6.0. Compute the INT8 asymmetric scale S (to 4 decimal places).

float/int
Show derivation
S = (6.0 − (−2.0)) ⁄ (255 − 0) = 8.0 ⁄ 255 ≈ 0.03137

S is the float value of one INT8 step. The denominator is 28−1 = 255 because unsigned INT8 spans [0, 255] = 256 levels. Answer: 8/255 ≈ 0.0314.

Exercise 3.2: Compute Zero-Point Z Derive

Using min = −2.0, S ≈ 0.03137, compute the integer zero-point Z (round to nearest integer).

integer
Show derivation
Z = round(0 − min ⁄ S) = round(0 − (−2.0) ⁄ 0.03137)
= round(2.0 ⁄ 0.03137) = round(63.75) = 64

Z = 64 means integer 64 represents the float value 0.0. Plugging back: x̂ = S×(64−64) = 0.0. Good: zero maps to zero (important for ReLU and zero-padding correctness).

Exercise 3.3: Quantize a Value Derive

Using S = 0.03137, Z = 64, quantize the float value x = 1.5 to an INT8 integer (round to nearest, clamp to [0, 255]).

integer
Show derivation
q = clamp(round(x ⁄ S) + Z, 0, 255)
= clamp(round(1.5 ⁄ 0.03137) + 64, 0, 255)
= clamp(round(47.82) + 64, 0, 255)
= clamp(48 + 64, 0, 255) = 112

Sanity check: dequantize q=112 back: x̂ = 0.03137 × (112 − 64) = 0.03137 × 48 = 1.506. The round-trip error is 1.506 − 1.5 = 0.006, which is within one half-step of S.

Exercise 3.4: Dequantization Error Derive

Dequantize q = 112 using S = 0.03137, Z = 64. Then compute the absolute error vs the original x = 1.5. Give the error to 4 decimal places.

float error
Show derivation
x̂ = S × (q − Z) = 0.03137 × (112 − 64) = 0.03137 × 48 = 1.50576
ε = |x̂ − x| = |1.50576 − 1.5| = 0.00576 ≈ 0.0058

The maximum possible quantization error is S/2 ≈ 0.0157. Actual error 0.006 is less than half-step — expected for a value that rounds cleanly. For INT8 across a range of 8.0, the step size 0.0314 is small enough that most modern networks tolerate the error with <1% accuracy drop.

Exercise 3.5: K-Means Compression Ratio Derive

Weight codebook compression: original FP32 weights (32 bits each) are replaced by 16-cluster K-means centroids (4 bits per index). What is the compression ratio (original bits ⁄ compressed bits per weight)?

× compression
Show derivation
b_codebook = log2(16) = 4 bits per weight index
ratio = 32 ⁄ 4 = 8×

With K=16 clusters, each weight needs only log2(16)=4 bits to store its cluster ID. The 16 centroid values are stored separately (small overhead for large layers). This gives 8× weight compression vs FP32. Note: activation memory is unchanged — only weight storage shrinks.

Chapter 4: Quantization II — Advanced

Per-tensor quantization uses one S and Z for the whole weight matrix. Per-channel quantization assigns a separate S and Z per output channel, cutting quantization error dramatically for layers where channels have very different magnitude distributions (common in the later layers of deep networks).

Per-tensor: one S, one Z for entire tensor
Per-channel (output channels): one Sc, Zc per output channel c
Extra storage: Cout extra scale values (negligible vs weights)

Straight-Through Estimator (STE):
∂L⁄∂w ≈ ∂L⁄∂q    (treat quantize as identity during backprop)

SmoothQuant migration factor:
X' = X ⁄ diag(s)   W' = diag(s) × W
sj = max|Xj|α ⁄ max|Wj|1−α   α ≈ 0.5
Why per-channel matters for weights. Weight channels in a Conv layer can have very different ranges. A channel whose weights span [−0.01, 0.01] quantized with a scale fit to another channel spanning [−2, 2] would lose almost all precision. Per-channel quantization gives each channel its own scale, recovering that precision at essentially zero extra inference cost.
Exercise 4.1: Per-Channel Scale Count Derive

A Conv2D has 128 output channels. Per-channel quantization stores one FP32 scale per channel. How many extra bytes does this add compared to per-tensor (one scale)?

bytes
Show derivation
extra scales = 128 − 1 = 127 FP32 values
extra bytes = 127 × 4 = 508 bytes

508 bytes of overhead to store per-channel scales. Compare to the weight matrix: 3×3×Cin×128 weights at 1 byte each (INT8) — easily hundreds of KB. The overhead is tiny but the accuracy gain can be large (often 1–2% top-1 on ImageNet vs per-tensor).

Exercise 4.2: STE Numeric Example Derive

In quantization-aware training with STE, the upstream gradient ∂L⁄∂q = 3.0. According to STE, what value do you use for ∂L⁄∂w in the weight update?

(gradient)
Show derivation
STE: ∂L⁄∂w = ∂L⁄∂q = 3.0

The quantize function has zero gradient almost everywhere (piecewise constant staircase) and undefined gradient at steps. STE replaces the true zero gradient with the identity: pretend quantize is a pass-through during backprop. This allows weights to receive gradient signal and improve across training epochs even though we quantize during the forward pass.

Exercise 4.3: SmoothQuant Migration Factor Derive

A Linear layer has max activation magnitude max|Xj| = 8.0 and max weight magnitude max|Wj| = 0.5 for channel j. With α = 0.5, compute the SmoothQuant migration factor sj.

(scale)
Show derivation
sj = max|Xj|α ⁄ max|Wj|1−α
= 8.00.5 ⁄ 0.50.5
= 2.828 ⁄ 0.707 ≈ 4.0

sj = 4.0 means: divide activation channel j by 4 (making max|X'j| = 2.0) and multiply weight channel j by 4 (making max|W'j| = 2.0). Now both tensors have the same range — equally hard to quantize. SmoothQuant migrates quantization difficulty from outlier-heavy activations to weights, which quantize more cleanly.

Exercise 4.4: INT4 vs INT8 Memory Saving Derive

A model has 7 billion FP16 parameters (2 bytes each). Quantizing to INT4 (0.5 bytes each). What is the INT4 model size in GB? (1 GB = 109 bytes)

GB
Show derivation
INT4 bytes = 7 × 109 × 0.5 = 3.5 × 109 bytes = 3.5 GB

FP16: 7B × 2 = 14 GB — doesn't fit in a 12 GB GPU. INT4: 7B × 0.5 = 3.5 GB — fits on a consumer GPU. This is exactly why GPTQ and QLoRA target 4-bit weights for 7B LLMs: a 4× reduction vs FP16 makes them accessible on commodity hardware.

Exercise 4.5: Quantization Levels for b Bits Derive

How many distinct quantization levels does an unsigned b-bit integer have? For b = 4, what is the answer?

levels
Show derivation
levels = 2b = 24 = 16

4-bit unsigned integers span 0..15 = 16 distinct values. The scale S = range / (2b−1) = range/15, which is 17× coarser than INT8's 1/255. That's why INT4 is noticeably more lossy and usually requires QAT or GPTQ second-order correction to maintain accuracy on LLMs.

Chapter 5: Neural Architecture Search

Neural Architecture Search (NAS) automates the choice of network topology. The search space defines all candidate architectures. DARTS relaxes the discrete choice to a continuous soft weighting; ProxylessNAS adds a latency term directly to the loss.

Search-space cardinality:
|S| = OE    (O op choices per edge, E edges in DAG)

DARTS mixed op (softmax over ops):
̅o(i,j)(x) = ∑k exp(αk) ⁄ ∑m exp(αm) × ok(x)

ProxylessNAS expected latency:
E[lat] = ∑k pk × latk
pk = softmax(α)k
DARTS trains architecture weights α jointly with network weights W. By replacing discrete "which op" with a continuous weighted mixture, gradients can flow through the architecture choice. After training, the argmax of α at each edge selects the final discrete op — no expensive combinatorial search needed.
Exercise 5.1: Search Space Size Derive

A NAS search space has 7 operation choices per edge and 14 edges in the DAG. How many candidate architectures are in the search space? Give the exponent (log base 7: it's 714 — enter just the base-10 log to 1 decimal place).

log10(|S|)
Show derivation
|S| = 714
log10(714) = 14 × log10(7) = 14 × 0.845 = 11.83

714 ≈ 6.78 × 1011 — roughly 700 billion architectures. This is why brute-force evaluation (train each architecture to convergence and measure accuracy) is completely infeasible. DARTS collapses this to a single training run with a mixed supernetwork.

Exercise 5.2: DARTS Softmax Weight Derive

Three operations have architecture logits α = [2.0, 1.0, 0.0]. Compute the softmax probability p0 for operation 0 (to 3 decimal places).

probability
Show derivation
e2.0 = 7.389,  e1.0 = 2.718,  e0.0 = 1.000
sum = 11.107
p0 = 7.389 ⁄ 11.107 ≈ 0.665

66.5% weight on operation 0. After DARTS training converges, if p0 dominates like this the final discrete architecture picks op 0 at this edge via argmax. The architecture α parameters are trained by gradient descent on a validation loss, while network weights W are trained on the training loss.

Exercise 5.3: ProxylessNAS Expected Latency Derive

Three operations have latencies [10, 5, 2] ms and probabilities [0.6, 0.3, 0.1]. Compute the expected latency E[lat] in ms.

ms
Show derivation
E[lat] = 0.6 × 10 + 0.3 × 5 + 0.1 × 2
= 6.0 + 1.5 + 0.2 = 7.7 ms

ProxylessNAS adds a differentiable latency loss R(α) = E[lat] to the total loss: L = Lacc + λ × E[lat]. Gradients ∂E[lat]/∂αk = pk(1−pk) × latk flow back through the softmax, pushing the search toward faster ops automatically during training.

Exercise 5.4: MnasNet Reward Function Derive

MnasNet uses reward R = ACC × (Ttarget ⁄ T)β. For ACC = 0.75, T = 60 ms, Ttarget = 80 ms, β = −0.07. Compute R to 3 decimal places.

reward
Show derivation
Ttarget⁄T = 80⁄60 = 1.333
(1.333)−0.07 = e−0.07 × ln(1.333) = e−0.07 × 0.2877 = e−0.02014 ≈ 0.9800
R = 0.75 × 0.980 ≈ 0.735... ≈ 0.735

Here T = 60 ms is faster than Ttarget = 80 ms, so the speed bonus >1... wait: Tt/T = 80/60 = 1.33 > 1, and β = −0.07 < 0, so (1.33)−0.07 = 0.98 < 1 — being faster than target is mildly penalised to prevent trading accuracy for marginal latency gains. Answer ≈ 0.750 × 0.980 = 0.735. (Enter ~0.735.)

Exercise 5.5: NAS vs Hand-Design Training Cost Derive

Brute-force NAS evaluates 1,000 architectures each taking 4 GPU-hours. DARTS trains one supernetwork in 4 GPU-days. How many times cheaper is DARTS? (1 day = 24 hours)

× cheaper
Show derivation
Brute force: 1000 × 4 = 4,000 GPU-hours
DARTS: 4 × 24 = 96 GPU-hours
Ratio = 4,000 ⁄ 96 ≈ 41.7×

DARTS is ~42× cheaper in this scenario. Original NAS (Zoph & Le 2017) used 800 GPUs × 28 days = 22,400 GPU-days; DARTS (Liu et al. 2018) reduced this to 4 GPU-days — a 5,600× improvement. This is what made NAS practical for research labs without massive compute budgets.

Chapter 6: Knowledge Distillation

Knowledge distillation trains a small student network to mimic a large teacher. The student minimises a weighted combination of the hard-label cross-entropy and the soft-label KL divergence from the teacher's output distribution. A temperature T softens the teacher's logits, amplifying the information in low-probability classes.

KD loss:
L = (1 − λ) × LCE(y, ps) + λ × T2 × KL(ptT || psT)

Soft probabilities at temperature T:
piT = exp(zi ⁄ T) ⁄ ∑j exp(zj ⁄ T)

T2 rescaling:
As T ↑, soft probs flatten → gradients shrink by 1⁄T2
Multiply KD loss term by T2 to restore gradient magnitude
Why T2? When T is large the soft targets piT all approach 1/C (uniform). The KL gradients scale as 1/T2. Multiplying the KD loss by T2 cancels this attenuation — keeping the effective learning signal constant regardless of temperature choice.
Exercise 6.1: Soft Probability at Temperature T Derive

Teacher logits for 3 classes: z = [4.0, 2.0, 1.0]. With temperature T = 4, compute the soft probability p0T for class 0 (to 3 decimal places).

probability
Show derivation
Scaled logits: 4/4=1.0, 2/4=0.5, 1/4=0.25
e1.0=2.718, e0.5=1.649, e0.25=1.284
sum = 5.651
p0T = 2.718 / 5.651 ≈ 0.481

At T=1 (hard), p0 ≈ 0.87 (very confident class 0). At T=4 (soft), p0 ≈ 0.481 — the distribution is far more spread out. The student now learns that classes 1 and 2 are "not completely wrong", encoding richer structure than a one-hot label ever could. Answer: ~0.481 (accept ~0.443–0.481 depending on rounding).

Exercise 6.2: T² Gradient Rescaling Ratio Derive

Distillation is run at T = 5. By what factor do KD gradients shrink without the T2 correction, relative to T = 1?

× shrinkage
Show derivation
gradient scale ∝ 1⁄T2 = 1⁄25
shrinkage factor = T2 = 52 = 25

At T=5 the KD gradients are 25× smaller than at T=1. Without the T2 correction the hard-label CE term completely dominates the loss. Hinton et al. 2015 showed that multiplying the KD term by T2 recovers balanced training — this is why the T2 factor is always included in the KD loss formula.

Exercise 6.3: KD Loss Weighting Derive

In a KD training step: LCE = 2.0, KL = 0.4, T = 3, λ = 0.7. Compute total KD loss L.

loss
Show derivation
L = (1 − λ) LCE + λ T2 KL
= 0.3 × 2.0 + 0.7 × 9 × 0.4
= 0.6 + 2.52 = 3.12

With λ=0.7 (strong KD signal) and T2=9, the soft-label term dominates: 2.52 vs 0.60. This is typical early in distillation training when the student needs to match the teacher's overall output structure before fitting hard labels. Total: 3.12 (accept 3.12 ± 0.05).

Exercise 6.4: Parameter Compression Ratio Derive

A ResNet-50 teacher has 25.6 M parameters. A MobileNetV2 student has 3.4 M. What is the parameter compression ratio (teacher / student)?

× smaller
Show derivation
ratio = 25.6 ⁄ 3.4 ≈ 7.53 ≈ 7.5×

A 7.5× parameter reduction. With distillation the MobileNetV2 student can approach ResNet-50 accuracy (within ~1–2% top-1 on ImageNet) while being 7.5× smaller. Without distillation — training MobileNetV2 directly on labels — it typically scores 2–3% lower. The soft teacher targets provide the accuracy boost.

Exercise 6.5: Temperature and Entropy Derive

At T → ∞, what do the soft probabilities piT converge to for a C-class problem? (Express as a fraction in terms of C.)

pi = 1 / __ (enter C for C=10)
Show derivation
As T → ∞: zi⁄T → 0 for all i
exp(zi⁄T) → 1
piT → 1⁄C (uniform distribution)

At T=∞ every class gets equal probability 1/C — maximum entropy, no information. For C=10 that's 0.1 each. This is why extremely high T is bad: the soft targets collapse to uninformative uniform noise. In practice T = 3–7 is optimal for most tasks, balancing information content with gradient stability.

Chapter 7: MCUNet & TinyEngine

Microcontrollers have typically 256 KB–1 MB SRAM. The binding constraint is peak activation memory — the maximum memory required to hold the input and output of a single layer simultaneously. MCUNet co-designs the neural network (TinyNAS) and the inference engine (TinyEngine) to stay inside this budget.

Peak SRAM for one layer (two-buffer model):
SRAMpeak = bytes(input_feature_map) + bytes(output_feature_map)

Patch-based inference (patch size P×P):
SRAM reduction ≈ 1⁄P2  (for a single Conv layer)

im2col memory blow-up for a K×K Conv:
im2col buffer ≈ K2 × Cin × Ho × Wo values

In-place depth-wise tiling working set:
ws = K × W × C × bytes_per_val + small constants
Patch inference cuts SRAM by 1/P². Instead of processing the full feature map at once, TinyEngine divides the output into P×P patches and processes one patch at a time. Each patch requires only 1/P² of the full activation buffer. The tradeoff: repeated border computations increase MACs by a small constant.
Exercise 7.1: Peak Two-Buffer SRAM Derive

A Conv2D layer: input (1×32×56×56) INT8 (1 byte/value), output (1×64×56×56) INT8. What is the peak two-buffer SRAM in KB? (1 KB = 1024 bytes)

KB
Show derivation
input bytes = 32 × 56 × 56 = 100,352
output bytes = 64 × 56 × 56 = 200,704
total = 301,056 bytes / 1024 ≈ 294 KB

294 KB just for this one layer's I/O buffers. An STM32F7 has 512 KB SRAM. After subtracting stack, heap, and other overhead there may be only 400 KB available. This single layer uses 74% of that — leaving almost nothing for other layers. This is the MCU memory wall in practice.

Exercise 7.2: Patch Inference SRAM Reduction Derive

Patch-based inference uses P=4 (4×4 patches). By what factor does SRAM reduce (approximately)?

× reduction
Show derivation
reduction ≈ P2 = 42 = 16×

Dividing the 56×56 output into 4×4 patches means each patch is 14×14. We only need to hold 14×14 (not 56×56) of activations in memory at once — 16× less. The input patch with halo (border needed for K=3 receptive field) is slightly larger than 14×14 but still far smaller than the full map.

Exercise 7.3: im2col Buffer Size Derive

A 3×3 Conv2D: Cin=32, output 14×14. How many INT8 values does the im2col buffer hold?

values
Show derivation
im2col = K2 × Cin × Ho × Wo
= 9 × 32 × 14 × 14 = 9 × 32 × 196 = 56,448

im2col "unrolls" each K×K×Cin receptive field into one column of a matrix. The resulting matrix has K2Cin = 288 rows and HoWo = 196 columns = 56,448 values. This is K2 = 9× larger than the original input. TinyEngine avoids allocating this buffer by computing im2col inline during the matrix multiply.

Exercise 7.4: Flash vs SRAM for Weights Derive

MCUNet model: 1.0 M INT8 parameters (weights) stored in Flash, and peak activation SRAM = 256 KB. Total Flash needed for weights in KB? (1 byte/param)

KB
Show derivation
Flash = 1,000,000 bytes / 1024 = 976.6 KB ≈ 976 KB ≈ 1 MB

~1 MB Flash for weights, 256 KB SRAM for activations. The Arduino Nano 33 BLE Sense has 1 MB Flash and 256 KB SRAM — this model fits exactly! MCUNet was designed for this class of device. Flash is non-volatile (survives power cycles) so weights live there; SRAM is volatile and only holds runtime activations.

Exercise 7.5: Depthwise Tiling Working Set Derive

A depthwise Conv (K=3, W=56, C=64) processes one row at a time. The tiling working set is K rows of the input: K×W×C bytes (INT8). How many KB?

KB
Show derivation
bytes = 3 × 56 × 64 = 10,752 bytes / 1024 ≈ 10.5 KB

10.5 KB is a tiny working set — easily fits in L1 cache or even tight MCU SRAM. By processing row-by-row, TinyEngine never needs to materialise the full 56×56×64 = 200 KB activation map at once. This row-tiling strategy is the core innovation in TinyEngine's depthwise kernel.

Chapter 8: Efficient LLMs

Large language models are bottlenecked by two things: attention's quadratic FLOPs in the sequence length, and KV-cache memory at inference time. GQA, LoRA, and QLoRA directly attack these two bottlenecks.

Attention FLOPs per layer (dense):
FLOPs ≈ 4 × N2 × d  (N = seq len, d = head dim × H)

KV-cache bytes (per token generated, FP16):
bytes = 2 × Nlayers × Nheads × dhead × 2
(factor 2 for K and V; factor 2 for FP16 bytes)

GQA head reduction:
KV-cache ÷ G    (G = #KV groups, H⁄G heads share each KV)

LoRA parameter count:
extra params = 2 × d × r  (vs d × d full fine-tune)

QLoRA NF4 memory:
frozen model at 4-bit + adapter at BF16
memory ≈ params × 0.5 + adapter_params × 2
KV-cache grows with batch × sequence length. Serving a 70B model with batch=16 and context=4096 tokens: KV-cache = 2 × 80 layers × 8 heads × 128 d_head × 4096 tokens × 16 batch × 2 bytes ≈ 52 GB — more than the model weights. This is why GQA and sliding-window attention are critical for production LLM serving.
Exercise 8.1: Attention FLOPs Derive

One Transformer attention layer: N = 1024 tokens, d = 512 (total hidden dim). Using FLOPs ≈ 4N2d, compute attention FLOPs in GFLOPs.

GFLOPs
Show derivation
FLOPs = 4 × 10242 × 512
= 4 × 1,048,576 × 512
= 2,147,483,648 ≈ 2.15 GFLOPs

2.15 GFLOPs per attention layer. If we double the sequence to N=2048: FLOPs × 4 = 8.59 GFLOPs. This is the O(N2) quadratic scaling — the fundamental motivation for linear attention, sparse attention (Longformer), and state-space models (Mamba).

Exercise 8.2: KV-Cache Memory Derive

A 32-layer model, 32 heads, dhead = 128, generates 512 tokens at batch = 1, stored as FP16 (2 bytes). How many MB is the KV-cache? (1 MB = 106 bytes)

MB
Show derivation
bytes = 2 × L × H × dh × N × bytes_per_val
= 2 × 32 × 32 × 128 × 512 × 2
= 2 × 32 × 32 × 128 × 1024 = 134,217,728 bytes ≈ 134.2 MB

134 MB for 512 tokens batch=1. Scale to batch=32 and N=4096: ×(32×8) = 256× more = ~34 GB. That's why KV-cache is the bottleneck for LLM serving throughput, not just weight memory.

Exercise 8.3: GQA KV-Cache Reduction Derive

Llama-3-70B uses 64 query heads (H=64) and 8 KV heads (G=8, so each KV head serves H/G=8 query heads). By what factor does GQA reduce KV-cache vs MHA with 64 KV heads?

× reduction
Show derivation
MHA KV heads = H = 64
GQA KV heads = G = 8
reduction = 64 ⁄ 8 = 8×

GQA stores only G=8 KV vectors per position instead of H=64. The 8× reduction in KV-cache directly translates to 8× more throughput at fixed memory, or 8× longer contexts at fixed memory. This is why every modern efficient LLM (Llama 2/3, Mistral, Gemma) uses GQA.

Exercise 8.4: LoRA Parameter Count Derive

Full fine-tuning of a 4096×4096 weight matrix requires 40962 = 16.77 M parameters. LoRA with r=16 injects two low-rank matrices A (4096×16) and B (16×4096). How many trainable LoRA parameters?

params
Show derivation
A: 4096 × 16 = 65,536
B: 16 × 4096 = 65,536
total = 131,072

131,072 LoRA params vs 16,777,216 full-finetune params — a 128× reduction (2r/d = 2×16/4096 = 1/128 of full). For a 7B model applied to all Linear layers this ratio makes the difference between needing 112 GB vs <1 GB of gradient+optimizer memory during fine-tuning.

Exercise 8.5: QLoRA Memory Budget Derive

QLoRA fine-tunes a 7B-parameter model. Frozen model weights: 7B at NF4 (0.5 bytes each). LoRA adapter: 20 M params at BF16 (2 bytes each). Compute total memory in GB.

GB
Show derivation
frozen = 7 × 109 × 0.5 = 3.5 × 109 bytes = 3.5 GB
adapter = 20 × 106 × 2 = 40 × 106 bytes = 0.04 GB
total ≈ 3.54 GB

3.54 GB for a 7B model fine-tune! Full BF16 fine-tuning of 7B needs 14 GB just for weights (plus 2× for optimizer states = 42 GB). QLoRA fits on a single 8 GB consumer GPU. This democratisation of LLM fine-tuning is why QLoRA (Dettmers et al. 2023) became one of the most impactful efficiency papers of that year.

Chapter 9: Capstone — End-to-End System Design

These exercises combine multiple constraints simultaneously: fit a model on a microcontroller with 256 KB SRAM and 1 MB Flash, or serve a quantized LLM on a limited GPU. You must reason about size, compute, and memory at the same time.

MCU design checklist. (1) Count Flash: params × bytes/param ≤ Flash budget. (2) Count peak SRAM: max layer I/O buffer ≤ SRAM budget. (3) Count MACs: total MACs within latency target given MCU throughput. If any constraint fails, apply the right technique: quantize to save Flash/SRAM, prune to save MACs, patch-inference to save SRAM.
Exercise 9.1: MCU Flash Feasibility Derive

An MCU has 1 MB Flash. A model has 800,000 INT8 parameters (1 byte each) plus 50 KB of firmware overhead. Does it fit? Enter the total Flash usage in KB.

KB used
Show derivation
weights = 800,000 bytes / 1024 = 781.25 KB
total = 781.25 + 50 = 831.25 KB ≈ 831 KB
1 MB = 1024 KB → fits! (831 < 1024)

831 KB < 1024 KB — the model fits in Flash with 193 KB to spare. If it didn't fit: (a) switch to INT4 (halve Flash), (b) prune 20% of weights, or (c) remove the last few layers and accept slight accuracy loss. Knowing which knob to turn requires understanding all constraints simultaneously.

Exercise 9.2: SRAM Budget with Patch Inference Derive

The bottleneck layer needs 320 KB peak SRAM (two-buffer). The MCU has 256 KB. With patch size P=2, what is the reduced SRAM requirement in KB?

KB
Show derivation
reduced = 320 ⁄ P2 = 320 ⁄ 4 = 80 KB

80 KB < 256 KB — the bottleneck layer now fits comfortably. P=2 patches divide the output into 2×2 = 4 tiles; only one tile's activations live in SRAM at a time. The ~3% MAC overhead from border recomputation is well worth the 4× SRAM savings.

Exercise 9.3: End-to-End Latency on an MCU Derive

A model runs 30 MMACs per inference. The MCU sustains 60 MMACops/s. What is the minimum inference time in ms?

ms
Show derivation
time = MACs ⁄ throughput = 30 × 106 ⁄ 60 × 106 = 0.5 s = 500 ms

500 ms per inference. For a wake-word detector this is too slow (needs <100 ms). Options: (a) prune 5× to 6 MMACs → 100 ms, (b) use DSP SIMD to get 4× throughput → 125 ms, (c) redesign with a smaller backbone (shufflenet / squeezenet). In practice MCUNet targets ~200 MMACs on faster chips (100+ MMACops/s) for sub-200ms inference.

Exercise 9.4: GPU Memory for Serving a Quantized LLM Derive

Serving a 13B INT4 model (0.5 bytes/param) + KV-cache for batch=4, seq=2048 tokens, 40 layers, 40 heads, dhead=128, FP16. KV-cache bytes = 2×40×40×128×2048×4×2. Total memory in GB?

GB
Show derivation
weights = 13 × 109 × 0.5 = 6.5 GB
KV = 2×40×40×128×2048×4×2 = 6,710,886,400 bytes
≈ 6.71 GB ≈ 6.4 GB (using 109 denominator)
total = 6.5 + 6.71 = 13.21 GB (accept 13.2–13.5 GB)

~13.4 GB total — fits on a single RTX 3090 (24 GB) with room for activations and overhead. Without INT4 quantization the weights alone would be 26 GB (FP16) — requiring two GPUs. This is the real-world value of combined INT4 quantization + GQA (which would halve the KV-cache further).

Exercise 9.5: Combined Savings Stack Derive

A model starts at 50 MMACs and 6 MB FP32. Apply: depthwise separable (8× fewer MACs), INT8 quantization (4× smaller), 50% magnitude pruning (2× fewer effective ops). Final effective MMACs and MB?

MMACs (final)
Show derivation
MACs: 50 ⁄ 8 (DW) ⁄ 2 (pruning) = 3.125 MMACs
Size: 6 MB ⁄ 4 (INT8) = 1.5 MB

From 50 MMACs / 6 MB to 3.125 MMACs / 1.5 MB — a 16× compute reduction and 4× size reduction by combining orthogonal techniques. This multiplicative stacking is the key insight: each technique targets a different source of inefficiency (architecture, precision, connectivity), so their benefits compound rather than overlap.