Every number a TinyML engineer must derive cold: model size, MACs, pruning saliency, INT8 quantization, NAS search-space cardinality, knowledge distillation temperature scaling, MCU SRAM tiling, efficient LLM attention costs, KV-cache bytes, and LoRA parameter counts. All verifiable in-browser with instant feedback.
Before you can shrink a model you need to measure it. Three numbers dominate: parameter count (storage), MACs (compute), and peak activation memory (runtime SRAM). Getting these wrong means shipping a model that silently overflows a microcontroller.
A MobileNetV2 variant has 3.4 million parameters stored in FP32 (32 bits each). What is the model size in megabytes (MB)? (1 MB = 106 bytes)
Each FP32 parameter occupies 4 bytes. 3.4 M × 4 = 13.6 MB. This is why INT8 quantization (1 byte per param) reduces storage to 3.4 MB — a 4× saving.
A layer performs 150 million multiply-accumulate operations (MACs). How many FLOPs is that?
Each MAC fuses a multiply and an add into one operation. When reporting FLOPs (floating-point operations), we count both, so 1 MAC = 2 FLOPs. Many papers report MACs; others report FLOPs — always check which one before comparing numbers.
A matrix multiply moves 400 KB of data and performs 3.2 GFLOPs. What is the arithmetic intensity in FLOPs/byte?
8000 FLOPs/byte is deep in the compute-bound region on any modern chip (most GPU memory rooflines are at ~200 FLOPs/byte for FP32). Large matrix multiplies are almost always compute-bound — this is why batch size matters for GPU utilisation.
A Conv2D layer outputs a feature map of shape (batch=1, C=64, H=56, W=56) stored as FP32. How many KB of activation memory does it consume?
802,816 / 1024 = 784 KB. On a microcontroller with 512 KB SRAM total, this single layer would already overflow! Peak activation memory — not parameter storage — is the binding constraint on MCUs.
A microcontroller has peak compute 200 MFLOPs/s and memory bandwidth 50 MB/s. The ridge point (balance point) is at what arithmetic intensity (FLOPs/byte)?
Operations with I < 4 FLOPs/byte are memory-bound; those with I > 4 are compute-bound. Most depthwise convolutions on MCUs fall below the ridge point, making memory bandwidth the bottleneck, not raw compute.
To shrink a network you must first count exactly what is inside it. Every Linear and Conv2D layer has a deterministic parameter count and MAC count derivable from its shape alone.
A fully-connected layer maps 512 inputs to 256 outputs, with bias. How many trainable parameters does it have?
The weight matrix is 512×256 = 131,072 entries. The bias vector adds one scalar per output neuron: 256 more. Total: 131,328. In practice many recent architectures omit bias in Linear layers that precede a BatchNorm — then params = 512 × 256 = 131,072.
A Conv2D layer has kernel 3×3, 32 input channels, 64 output channels, with bias. How many parameters?
Each of the 64 output filters is a 3×3×32 kernel = 288 weights. 64 filters × 288 = 18,432 weights. Add 64 biases. Contrast with the depthwise version next exercise: 32×32 = 288 DW params + 32×64 PW params = 2336 total — 8× fewer.
The 3×3 Conv2D from Ex 1.2 (32 in, 64 out) processes an output feature map of 28×28. How many MACs (in millions)?
Wait — 14.45 M, not 451. Each output pixel requires K2×Cin MACs to sum over the receptive field. Then we do this for all Cout output channels and all Ho×Wo output positions. Answer: 14.45 MMACs (enter 14.45).
Replace the 3×3 Conv2D (32→64, output 28×28) with a depthwise separable block (3×3 DW + 1×1 PW). Ratio = standard MACs ⁄ separable MACs. Round to 1 decimal. (Ignore biases.)
The theoretical reduction is 1 ⁄ (1⁄Cout + 1⁄K2) = 1/(1/64 + 1/9) ≈ 8.0. This matches MobileNet's empirical ~8× FLOP saving over standard convolutions.
A Transformer FFN first projects from d=512 to 4d=2048 (no bias). Per input token, how many MACs does this first projection require?
For a Linear(d, 4d) layer, each output neuron does d multiplications and d additions (d MACs). With 4d output neurons the total is d × 4d = 4d2 MACs. Here: 4 × 5122 = 1,048,576 MACs = 1,048 KMACs per token.
Pruning removes weights that contribute least to the output. The key quantity is saliency — a score predicting how much the loss increases if a weight is zeroed. Optimal Brain Damage (OBD) approximates saliency using the second derivative of the loss.
A layer has 500,000 parameters. After magnitude pruning 350,000 are set to zero. What is the sparsity ratio as a percentage?
70% sparsity means 30% of weights remain active. This is sometimes written as "30% density." For magnitude pruning, we sort |wi| and zero the bottom 70% by absolute value — a simple but surprisingly effective baseline.
A weight w = 0.4. The diagonal Hessian entry Hii = ∂2L/∂wi2 = 5.0. What is the OBD saliency Si?
The OBD formula is a second-order Taylor expansion of the loss increase when wi is zeroed: ΔL ≈ Hii wi2/2. Prune weights with smallest Si — they cause the least loss increase when removed.
NVIDIA 2:4 structured sparsity keeps 2 weights per group of 4. What sparsity fraction (as %) does this produce?
2:4 is exactly 50% sparse by construction. Every group of 4 consecutive weights has exactly 2 zeros and 2 non-zeros. The hardware stores only the 2 non-zero values plus a 2-bit index mask per group — enabling 2× throughput on Ampere Sparse Tensor Cores.
A weight matrix has 1,000,000 parameters at FP32 (4 bytes each). After 90% pruning only 100,000 are non-zero. In CSR format each non-zero uses 4 bytes (value) + 4 bytes (column index). What is the CSR size in MB? (1 MB = 106 bytes)
CSR breaks even at 50% sparsity for FP32+int32 (same overhead as dense). Above 50% sparsity it saves space. At 90% sparsity: 5× compression. The catch: random memory access patterns in CSR kill hardware utilisation — which is why structured sparsity (2:4, block sparsity) is preferred in practice.
A 10-layer network with 50,000 parameters per layer is pruned to 40% sparsity globally. How many non-zero parameters remain in total?
At 40% sparsity, 60% of weights survive. 300,000 non-zero parameters remain. Whether this buys actual speedup depends on the hardware: unstructured sparsity rarely accelerates inference on MCUs without dedicated sparse arithmetic units.
Quantization maps floating-point values to a smaller integer representation. The mapping is defined by a scale factor S and a zero-point Z computed from the observed min/max of the tensor. Getting S and Z right controls the round-trip dequantization error.
An activation tensor has min = −2.0 and max = 6.0. Compute the INT8 asymmetric scale S (to 4 decimal places).
S is the float value of one INT8 step. The denominator is 28−1 = 255 because unsigned INT8 spans [0, 255] = 256 levels. Answer: 8/255 ≈ 0.0314.
Using min = −2.0, S ≈ 0.03137, compute the integer zero-point Z (round to nearest integer).
Z = 64 means integer 64 represents the float value 0.0. Plugging back: x̂ = S×(64−64) = 0.0. Good: zero maps to zero (important for ReLU and zero-padding correctness).
Using S = 0.03137, Z = 64, quantize the float value x = 1.5 to an INT8 integer (round to nearest, clamp to [0, 255]).
Sanity check: dequantize q=112 back: x̂ = 0.03137 × (112 − 64) = 0.03137 × 48 = 1.506. The round-trip error is 1.506 − 1.5 = 0.006, which is within one half-step of S.
Dequantize q = 112 using S = 0.03137, Z = 64. Then compute the absolute error vs the original x = 1.5. Give the error to 4 decimal places.
The maximum possible quantization error is S/2 ≈ 0.0157. Actual error 0.006 is less than half-step — expected for a value that rounds cleanly. For INT8 across a range of 8.0, the step size 0.0314 is small enough that most modern networks tolerate the error with <1% accuracy drop.
Weight codebook compression: original FP32 weights (32 bits each) are replaced by 16-cluster K-means centroids (4 bits per index). What is the compression ratio (original bits ⁄ compressed bits per weight)?
With K=16 clusters, each weight needs only log2(16)=4 bits to store its cluster ID. The 16 centroid values are stored separately (small overhead for large layers). This gives 8× weight compression vs FP32. Note: activation memory is unchanged — only weight storage shrinks.
Per-tensor quantization uses one S and Z for the whole weight matrix. Per-channel quantization assigns a separate S and Z per output channel, cutting quantization error dramatically for layers where channels have very different magnitude distributions (common in the later layers of deep networks).
A Conv2D has 128 output channels. Per-channel quantization stores one FP32 scale per channel. How many extra bytes does this add compared to per-tensor (one scale)?
508 bytes of overhead to store per-channel scales. Compare to the weight matrix: 3×3×Cin×128 weights at 1 byte each (INT8) — easily hundreds of KB. The overhead is tiny but the accuracy gain can be large (often 1–2% top-1 on ImageNet vs per-tensor).
In quantization-aware training with STE, the upstream gradient ∂L⁄∂q = 3.0. According to STE, what value do you use for ∂L⁄∂w in the weight update?
The quantize function has zero gradient almost everywhere (piecewise constant staircase) and undefined gradient at steps. STE replaces the true zero gradient with the identity: pretend quantize is a pass-through during backprop. This allows weights to receive gradient signal and improve across training epochs even though we quantize during the forward pass.
A Linear layer has max activation magnitude max|Xj| = 8.0 and max weight magnitude max|Wj| = 0.5 for channel j. With α = 0.5, compute the SmoothQuant migration factor sj.
sj = 4.0 means: divide activation channel j by 4 (making max|X'j| = 2.0) and multiply weight channel j by 4 (making max|W'j| = 2.0). Now both tensors have the same range — equally hard to quantize. SmoothQuant migrates quantization difficulty from outlier-heavy activations to weights, which quantize more cleanly.
A model has 7 billion FP16 parameters (2 bytes each). Quantizing to INT4 (0.5 bytes each). What is the INT4 model size in GB? (1 GB = 109 bytes)
FP16: 7B × 2 = 14 GB — doesn't fit in a 12 GB GPU. INT4: 7B × 0.5 = 3.5 GB — fits on a consumer GPU. This is exactly why GPTQ and QLoRA target 4-bit weights for 7B LLMs: a 4× reduction vs FP16 makes them accessible on commodity hardware.
How many distinct quantization levels does an unsigned b-bit integer have? For b = 4, what is the answer?
4-bit unsigned integers span 0..15 = 16 distinct values. The scale S = range / (2b−1) = range/15, which is 17× coarser than INT8's 1/255. That's why INT4 is noticeably more lossy and usually requires QAT or GPTQ second-order correction to maintain accuracy on LLMs.
Neural Architecture Search (NAS) automates the choice of network topology. The search space defines all candidate architectures. DARTS relaxes the discrete choice to a continuous soft weighting; ProxylessNAS adds a latency term directly to the loss.
A NAS search space has 7 operation choices per edge and 14 edges in the DAG. How many candidate architectures are in the search space? Give the exponent (log base 7: it's 714 — enter just the base-10 log to 1 decimal place).
714 ≈ 6.78 × 1011 — roughly 700 billion architectures. This is why brute-force evaluation (train each architecture to convergence and measure accuracy) is completely infeasible. DARTS collapses this to a single training run with a mixed supernetwork.
Three operations have architecture logits α = [2.0, 1.0, 0.0]. Compute the softmax probability p0 for operation 0 (to 3 decimal places).
66.5% weight on operation 0. After DARTS training converges, if p0 dominates like this the final discrete architecture picks op 0 at this edge via argmax. The architecture α parameters are trained by gradient descent on a validation loss, while network weights W are trained on the training loss.
Three operations have latencies [10, 5, 2] ms and probabilities [0.6, 0.3, 0.1]. Compute the expected latency E[lat] in ms.
ProxylessNAS adds a differentiable latency loss R(α) = E[lat] to the total loss: L = Lacc + λ × E[lat]. Gradients ∂E[lat]/∂αk = pk(1−pk) × latk flow back through the softmax, pushing the search toward faster ops automatically during training.
MnasNet uses reward R = ACC × (Ttarget ⁄ T)β. For ACC = 0.75, T = 60 ms, Ttarget = 80 ms, β = −0.07. Compute R to 3 decimal places.
Here T = 60 ms is faster than Ttarget = 80 ms, so the speed bonus >1... wait: Tt/T = 80/60 = 1.33 > 1, and β = −0.07 < 0, so (1.33)−0.07 = 0.98 < 1 — being faster than target is mildly penalised to prevent trading accuracy for marginal latency gains. Answer ≈ 0.750 × 0.980 = 0.735. (Enter ~0.735.)
Brute-force NAS evaluates 1,000 architectures each taking 4 GPU-hours. DARTS trains one supernetwork in 4 GPU-days. How many times cheaper is DARTS? (1 day = 24 hours)
DARTS is ~42× cheaper in this scenario. Original NAS (Zoph & Le 2017) used 800 GPUs × 28 days = 22,400 GPU-days; DARTS (Liu et al. 2018) reduced this to 4 GPU-days — a 5,600× improvement. This is what made NAS practical for research labs without massive compute budgets.
Knowledge distillation trains a small student network to mimic a large teacher. The student minimises a weighted combination of the hard-label cross-entropy and the soft-label KL divergence from the teacher's output distribution. A temperature T softens the teacher's logits, amplifying the information in low-probability classes.
Teacher logits for 3 classes: z = [4.0, 2.0, 1.0]. With temperature T = 4, compute the soft probability p0T for class 0 (to 3 decimal places).
At T=1 (hard), p0 ≈ 0.87 (very confident class 0). At T=4 (soft), p0 ≈ 0.481 — the distribution is far more spread out. The student now learns that classes 1 and 2 are "not completely wrong", encoding richer structure than a one-hot label ever could. Answer: ~0.481 (accept ~0.443–0.481 depending on rounding).
Distillation is run at T = 5. By what factor do KD gradients shrink without the T2 correction, relative to T = 1?
At T=5 the KD gradients are 25× smaller than at T=1. Without the T2 correction the hard-label CE term completely dominates the loss. Hinton et al. 2015 showed that multiplying the KD term by T2 recovers balanced training — this is why the T2 factor is always included in the KD loss formula.
In a KD training step: LCE = 2.0, KL = 0.4, T = 3, λ = 0.7. Compute total KD loss L.
With λ=0.7 (strong KD signal) and T2=9, the soft-label term dominates: 2.52 vs 0.60. This is typical early in distillation training when the student needs to match the teacher's overall output structure before fitting hard labels. Total: 3.12 (accept 3.12 ± 0.05).
A ResNet-50 teacher has 25.6 M parameters. A MobileNetV2 student has 3.4 M. What is the parameter compression ratio (teacher / student)?
A 7.5× parameter reduction. With distillation the MobileNetV2 student can approach ResNet-50 accuracy (within ~1–2% top-1 on ImageNet) while being 7.5× smaller. Without distillation — training MobileNetV2 directly on labels — it typically scores 2–3% lower. The soft teacher targets provide the accuracy boost.
At T → ∞, what do the soft probabilities piT converge to for a C-class problem? (Express as a fraction in terms of C.)
At T=∞ every class gets equal probability 1/C — maximum entropy, no information. For C=10 that's 0.1 each. This is why extremely high T is bad: the soft targets collapse to uninformative uniform noise. In practice T = 3–7 is optimal for most tasks, balancing information content with gradient stability.
Microcontrollers have typically 256 KB–1 MB SRAM. The binding constraint is peak activation memory — the maximum memory required to hold the input and output of a single layer simultaneously. MCUNet co-designs the neural network (TinyNAS) and the inference engine (TinyEngine) to stay inside this budget.
A Conv2D layer: input (1×32×56×56) INT8 (1 byte/value), output (1×64×56×56) INT8. What is the peak two-buffer SRAM in KB? (1 KB = 1024 bytes)
294 KB just for this one layer's I/O buffers. An STM32F7 has 512 KB SRAM. After subtracting stack, heap, and other overhead there may be only 400 KB available. This single layer uses 74% of that — leaving almost nothing for other layers. This is the MCU memory wall in practice.
Patch-based inference uses P=4 (4×4 patches). By what factor does SRAM reduce (approximately)?
Dividing the 56×56 output into 4×4 patches means each patch is 14×14. We only need to hold 14×14 (not 56×56) of activations in memory at once — 16× less. The input patch with halo (border needed for K=3 receptive field) is slightly larger than 14×14 but still far smaller than the full map.
A 3×3 Conv2D: Cin=32, output 14×14. How many INT8 values does the im2col buffer hold?
im2col "unrolls" each K×K×Cin receptive field into one column of a matrix. The resulting matrix has K2Cin = 288 rows and HoWo = 196 columns = 56,448 values. This is K2 = 9× larger than the original input. TinyEngine avoids allocating this buffer by computing im2col inline during the matrix multiply.
MCUNet model: 1.0 M INT8 parameters (weights) stored in Flash, and peak activation SRAM = 256 KB. Total Flash needed for weights in KB? (1 byte/param)
~1 MB Flash for weights, 256 KB SRAM for activations. The Arduino Nano 33 BLE Sense has 1 MB Flash and 256 KB SRAM — this model fits exactly! MCUNet was designed for this class of device. Flash is non-volatile (survives power cycles) so weights live there; SRAM is volatile and only holds runtime activations.
A depthwise Conv (K=3, W=56, C=64) processes one row at a time. The tiling working set is K rows of the input: K×W×C bytes (INT8). How many KB?
10.5 KB is a tiny working set — easily fits in L1 cache or even tight MCU SRAM. By processing row-by-row, TinyEngine never needs to materialise the full 56×56×64 = 200 KB activation map at once. This row-tiling strategy is the core innovation in TinyEngine's depthwise kernel.
Large language models are bottlenecked by two things: attention's quadratic FLOPs in the sequence length, and KV-cache memory at inference time. GQA, LoRA, and QLoRA directly attack these two bottlenecks.
One Transformer attention layer: N = 1024 tokens, d = 512 (total hidden dim). Using FLOPs ≈ 4N2d, compute attention FLOPs in GFLOPs.
2.15 GFLOPs per attention layer. If we double the sequence to N=2048: FLOPs × 4 = 8.59 GFLOPs. This is the O(N2) quadratic scaling — the fundamental motivation for linear attention, sparse attention (Longformer), and state-space models (Mamba).
A 32-layer model, 32 heads, dhead = 128, generates 512 tokens at batch = 1, stored as FP16 (2 bytes). How many MB is the KV-cache? (1 MB = 106 bytes)
134 MB for 512 tokens batch=1. Scale to batch=32 and N=4096: ×(32×8) = 256× more = ~34 GB. That's why KV-cache is the bottleneck for LLM serving throughput, not just weight memory.
Llama-3-70B uses 64 query heads (H=64) and 8 KV heads (G=8, so each KV head serves H/G=8 query heads). By what factor does GQA reduce KV-cache vs MHA with 64 KV heads?
GQA stores only G=8 KV vectors per position instead of H=64. The 8× reduction in KV-cache directly translates to 8× more throughput at fixed memory, or 8× longer contexts at fixed memory. This is why every modern efficient LLM (Llama 2/3, Mistral, Gemma) uses GQA.
Full fine-tuning of a 4096×4096 weight matrix requires 40962 = 16.77 M parameters. LoRA with r=16 injects two low-rank matrices A (4096×16) and B (16×4096). How many trainable LoRA parameters?
131,072 LoRA params vs 16,777,216 full-finetune params — a 128× reduction (2r/d = 2×16/4096 = 1/128 of full). For a 7B model applied to all Linear layers this ratio makes the difference between needing 112 GB vs <1 GB of gradient+optimizer memory during fine-tuning.
QLoRA fine-tunes a 7B-parameter model. Frozen model weights: 7B at NF4 (0.5 bytes each). LoRA adapter: 20 M params at BF16 (2 bytes each). Compute total memory in GB.
3.54 GB for a 7B model fine-tune! Full BF16 fine-tuning of 7B needs 14 GB just for weights (plus 2× for optimizer states = 42 GB). QLoRA fits on a single 8 GB consumer GPU. This democratisation of LLM fine-tuning is why QLoRA (Dettmers et al. 2023) became one of the most impactful efficiency papers of that year.
These exercises combine multiple constraints simultaneously: fit a model on a microcontroller with 256 KB SRAM and 1 MB Flash, or serve a quantized LLM on a limited GPU. You must reason about size, compute, and memory at the same time.
An MCU has 1 MB Flash. A model has 800,000 INT8 parameters (1 byte each) plus 50 KB of firmware overhead. Does it fit? Enter the total Flash usage in KB.
831 KB < 1024 KB — the model fits in Flash with 193 KB to spare. If it didn't fit: (a) switch to INT4 (halve Flash), (b) prune 20% of weights, or (c) remove the last few layers and accept slight accuracy loss. Knowing which knob to turn requires understanding all constraints simultaneously.
The bottleneck layer needs 320 KB peak SRAM (two-buffer). The MCU has 256 KB. With patch size P=2, what is the reduced SRAM requirement in KB?
80 KB < 256 KB — the bottleneck layer now fits comfortably. P=2 patches divide the output into 2×2 = 4 tiles; only one tile's activations live in SRAM at a time. The ~3% MAC overhead from border recomputation is well worth the 4× SRAM savings.
A model runs 30 MMACs per inference. The MCU sustains 60 MMACops/s. What is the minimum inference time in ms?
500 ms per inference. For a wake-word detector this is too slow (needs <100 ms). Options: (a) prune 5× to 6 MMACs → 100 ms, (b) use DSP SIMD to get 4× throughput → 125 ms, (c) redesign with a smaller backbone (shufflenet / squeezenet). In practice MCUNet targets ~200 MMACs on faster chips (100+ MMACops/s) for sub-200ms inference.
Serving a 13B INT4 model (0.5 bytes/param) + KV-cache for batch=4, seq=2048 tokens, 40 layers, 40 heads, dhead=128, FP16. KV-cache bytes = 2×40×40×128×2048×4×2. Total memory in GB?
~13.4 GB total — fits on a single RTX 3090 (24 GB) with room for activations and overhead. Without INT4 quantization the weights alone would be 26 GB (FP16) — requiring two GPUs. This is the real-world value of combined INT4 quantization + GQA (which would halve the KV-cache further).
A model starts at 50 MMACs and 6 MB FP32. Apply: depthwise separable (8× fewer MACs), INT8 quantization (4× smaller), 50% magnitude pruning (2× fewer effective ops). Final effective MMACs and MB?
From 50 MMACs / 6 MB to 3.125 MMACs / 1.5 MB — a 16× compute reduction and 4× size reduction by combining orthogonal techniques. This multiplicative stacking is the key insight: each technique targets a different source of inefficiency (architecture, precision, connectivity), so their benefits compound rather than overlap.