MCUNet: Deep Learning on Microcontrollers

Chapter 0: The Memory Wall

Imagine you are asked to run image classification on an Arduino Nano 33 BLE Sense. The chip has a Cortex-M4 CPU running at 64 MHz, 256 KB of SRAM, and 1 MB of Flash. There is no operating system, no DRAM, no GPU. The total usable memory is smaller than a single uncompressed photograph.

Now consider what a "small" neural network actually costs. MobileNetV2, the go-to efficient model for mobile phones, has 3.4 million parameters. In INT8 that is 3.4 MB — already 3.4× your Flash budget. But parameters aren't the critical constraint. The peak activation memory — the SRAM needed while running the forward pass — reaches 1,372 KB for MobileNetV2. That is 5.4× your entire SRAM.

The gap isn't small. Cloud AI servers have 32 GB of DRAM for activations. Smartphones have 4 GB. A microcontroller has 320 KB. That is a factor of roughly 10,000 between phone and MCU, and over 100,000 between server and MCU. Every neural architecture ever designed assumed you had at least megabytes of working memory. MCUs shatter that assumption.

The key problem: Existing efficient models (MobileNet, NASNet) were designed for the phone-to-GPU regime. Their smallest subnetwork still doesn't fit in MCU SRAM. We can't just shrink them — we need to rethink the design space from scratch under a fundamentally different memory constraint.

Hardware Memory Ladder — The 10,000× Gap

Memory available for neural network activations across hardware tiers. Note the logarithmic scale — each step is orders of magnitude.

The canvas above shows the memory gap on a log scale. Each step between tiers is roughly 10× — meaning the gap from phone to MCU is the same as the gap from server to phone. MCU sits in a different universe from the hardware that neural network research has focused on.

This chapter sets up the central problem. The next chapters analyze why models fail on MCUs (it's mostly the activation memory, not the weights), and then build MCUNet's solution from the ground up.

A MobileNetV2 model in INT8 has 3.4 MB of parameters and 1,372 KB peak activation memory. An MCU has 256 KB SRAM and 1 MB Flash. Which constraint is harder to satisfy for MobileNetV2?

Both are violated: 3.4 MB > 1 MB Flash AND 1,372 KB > 256 KB SRAM. But SRAM is more severely violated (5.4× over budget vs 3.4× for Flash). Only Flash is violated — 3.4 MB > 1 MB. SRAM is fine because weights are stored in Flash. Neither is violated if you use INT4 quantization instead of INT8. Only SRAM is violated — activations don't fit. Parameters can stream from Flash.

Chapter 1: SRAM vs Flash — Two Constraints, Two Roles

A microcontroller has two fundamentally different types of memory, and they impose two separate constraints on neural network deployment. Confusing them is the most common mistake when reasoning about MCU limitations.

Flash (also called ROM or program memory) is read-only at runtime. It is non-volatile — it persists when power is off. Flash holds your compiled program, your constant data, and crucially: the model weights. For a typical MCU, Flash is 1–2 MB. Flash is the model-size constraint. An INT8 model with W weights needs W bytes of Flash. You check this once per model deployment and it never changes.

SRAM (Static RAM) is read-write working memory. It holds your program's stack, heap, temporary variables, and crucially: the layer activations during inference. SRAM is the peak activation memory constraint. Unlike Flash, SRAM usage is dynamic — it changes layer by layer as the forward pass executes. The question isn't total activation size across all layers; it's the maximum SRAM needed at any single point in the forward pass.

Why "peak" matters: You don't need to store all activations simultaneously. While computing layer L, you need only: (1) the input activations to layer L, (2) the output activations from layer L, plus (3) a small buffer for the kernel weights (which can be streamed from Flash a few rows at a time). Once you move to layer L+1, the input to layer L can be freed. So peak SRAM = max over all layers of (input_activation_size + output_activation_size).

Let's formalize. For a convolutional layer with input feature map H×W×C_in and output feature map H'×W'×C_out, and assuming INT8 (1 byte per value):

SRAM_{layer L} = H × W × C_in + H' × W' × C_out (bytes, INT8)

Peak SRAM = max_{L ∈ {0..N}} SRAM_{layer L}

Flash needed = ∑_L (kernel_h × kernel_w × C_in × C_out) (total weight bytes)

The weights don't contribute to SRAM because they can be fetched from Flash row-by-row during computation. Only the activation tensors (input and output of the current layer) must live in SRAM simultaneously.

Memory Type	Read/Write?	Holds	Constraint on NN	Typical Size (MCU)
Flash	Read-only at runtime	Weights, program code	Total model size	1–2 MB
SRAM	Read + Write	Activations, buffers, stack	Peak activation memory	256–512 KB

Misconception to kill early: "If I quantize the model from FP32 to INT8, I get 4× smaller weights — that should fix the memory problem." INT8 quantization helps Flash (4× smaller weights) but barely touches peak SRAM. The activations are still the same size in INT8 as in FP32 byte-count terms — you saved 4× on per-value storage, but the activation tensor shapes are identical. On an MCU, it's SRAM (activations) that kills you, not Flash (weights).

A conv layer processes a 64×64×32 input feature map and produces a 64×64×64 output. What is its SRAM footprint (in KB) assuming INT8 quantization?

64×64×32 = 131,072 bytes (128 KB) — just the input 64×64×64 = 262,144 bytes (256 KB) — just the output (64×64×32) + (64×64×64) = 393,216 bytes (384 KB) — input + output 3×3×32×64 = 18,432 bytes (18 KB) — the kernel weights

Chapter 2: Peak SRAM Analysis — Layer by Layer

Let's work through a concrete example from scratch. Consider a small convolutional network with 5 layers processing a 96×96 grayscale image. We'll compute the SRAM footprint for every layer and find the bottleneck.

The network architecture:

Layer	Input (H×W×C)	Output (H×W×C)	Stride
Conv0 (3×3, s=2)	96×96×1	48×48×16	2
Conv1 (3×3, s=1)	48×48×16	48×48×32	1
Conv2 (3×3, s=2)	48×48×32	24×24×64	2
Conv3 (3×3, s=1)	24×24×64	24×24×64	1
Conv4 (3×3, s=2)	24×24×64	12×12×128	2

Now compute SRAM for each layer (INT8, bytes = H×W×C for each tensor):

Conv0: 96×96×1 + 48×48×16 = 9,216 + 36,864 = 46,080 bytes (45 KB)

Conv1: 48×48×16 + 48×48×32 = 36,864 + 73,728 = 110,592 bytes (108 KB)

Conv2: 48×48×32 + 24×24×64 = 73,728 + 36,864 = 110,592 bytes (108 KB)

Conv3: 24×24×64 + 24×24×64 = 36,864 + 36,864 = 73,728 bytes (72 KB)

Conv4: 24×24×64 + 12×12×128 = 36,864 + 18,432 = 55,296 bytes (54 KB)

The peak is Conv1 and Conv2, tied at 108 KB. With 256 KB of SRAM on an Arduino Nano, this tiny 5-layer net fits — but barely, and only because we used a 96×96 input instead of the standard 224×224. Scale it up to 224×224 and every number increases by (224/96)² ≈ 5.4× — now Conv1 alone needs 596 KB, more than double the MCU's total SRAM.

The scaling intuition: SRAM grows as input resolution squared. Doubling resolution (96→192) quadruples peak SRAM. This is why standard mobile inputs of 224×224 are impossible on MCUs, and why TinyNAS explicitly searches over input resolution as a first-class design variable.

Layer-by-Layer SRAM Profile

Per-layer SRAM = input activation + output activation. Drag the resolution and channel-width sliders to see how peak SRAM shifts. The red bar is the peak (bottleneck) layer.

Resolution 96

Width × 1.0

In the 5-layer network above, which layer is the bottleneck and why?

Conv0 — it processes the largest spatial resolution (96×96) so it has the most pixels. Conv1 — it has a large spatial resolution (48×48) AND a large output channel count (32), so both input and output activations are large. Conv4 — it has the most parameters (128 output channels) so it needs the most memory. Conv3 — same input and output size, so the memory usage is symmetric and maximum.

Chapter 3: Why Standard Models Fail

Now that we can compute peak SRAM analytically, let's understand concretely why every existing efficient model fails on MCUs — and why the failure is fundamental, not fixable by minor tweaks.

Take MobileNetV2. It was designed for smartphones and achieves 72% ImageNet top-1 with only 3.4M parameters (3.4 MB in INT8). On a Pixel phone with 4 GB DRAM, this is extremely efficient. Let's compute its peak SRAM. The early stage of MobileNetV2:

Block	Input	Output	SRAM (INT8)
Conv-stem (s=2)	224×224×3	112×112×32	150,528 + 401,408 = 551 KB
InvRes-1 (t=1, s=1)	112×112×32	112×112×16	401,408 + 200,704 = 588 KB
InvRes-2 expand (t=6)	112×112×16	112×112×96	200,704 + 1,204,224 = 1,367 KB

InvRes-2's depthwise expansion layer alone requires 1,367 KB of SRAM — 5.3× the entire 256 KB SRAM of an Arduino. This is the inverted residual bottleneck design at its worst: MobileNetV2 expands channels by 6× within each block to increase representational capacity, but that expansion creates enormous intermediate activations.

The irony of MobileNet on MCUs: The inverted residual design that makes MobileNetV2 efficient on phones (low param count, high accuracy) makes it catastrophically expensive on MCUs. The 6× channel expansion designed to pack more computation into fewer parameters causes exactly the peak-SRAM explosion that MCUs cannot handle. The phone-friendly design choice is the MCU-hostile choice.

What about quantizing more aggressively? Even INT4 quantization only helps model size (Flash), not SRAM layout. The activation tensor shapes don't change with quantization — a 112×112×96 tensor is 1,204,224 values regardless of bitwidth. INT8 = 1.1 MB SRAM; INT4 = 590 KB SRAM; you'd need INT2 to get near 256 KB, and INT2 quantization of activations causes catastrophic accuracy loss.

What about pruning? Structured pruning reduces channel counts, which helps. But to get MobileNetV2 under 256 KB SRAM would require pruning ≥80% of channels in the early expansion layers — destroying accuracy entirely. Pruning alone cannot bridge a 5.3× gap while preserving useful representations.

The conclusion: we need an entirely different architecture that is designed with MCU SRAM constraints as a first-class objective, not a post-hoc compression target. This is what TinyNAS provides.

MobileNetV2's early inverted residual blocks use a 6× channel expansion to build representational capacity. Why does this design, which is efficient for phones, become a problem specifically on MCUs?

The 6× expansion requires 6× more floating-point multiply-accumulate operations, exceeding the MCU's compute budget. The 6× expansion requires 6× more weight parameters, exceeding Flash storage. The 6× channel expansion creates large intermediate activation tensors that must fit in SRAM simultaneously with the input tensor, causing peak SRAM to spike to 5× the MCU's total SRAM budget. MCUs don't support depthwise-separable convolutions, so the expansion cannot be executed.

Chapter 4: TinyNAS — Designing the Search Space

Neural Architecture Search (NAS) automates the hunt for efficient architectures. You define a search space of possible architectures, a search strategy to explore it, and a performance estimator to rank candidates without full training. But here's the key insight the MCUNet paper makes: the search space itself must be designed for MCU constraints, not just the search within it.

The standard NAS approach for mobile models (ProxylessNAS, MnasNet) defines a search space around MobileNetV2-like blocks at (resolution R=224, width multiplier W=1.0). The smallest architecture reachable within this space still needs 4+ GB DRAM. You could run NAS forever inside this space and never find anything that fits a 256 KB MCU — the search space itself is wrong.

TinyNAS introduces a two-stage approach:

Stage 1: Optimize the Search Space

Pick the right (resolution R*, width multiplier W*) pair for the memory budget. This selects which sub-space to search within.

↓

Stage 2: Search Within the Space

Run resource-constrained NAS (one-shot weight sharing) inside the optimized sub-space, under explicit SRAM+Flash limits.

↓

Output: MCU-Deployable Architecture

An architecture that fits the SRAM+Flash budget and achieves maximum accuracy within those constraints.

Why does Stage 1 matter so much? Because the peak SRAM and total Flash of a network scale predictably with R and W. Given a target MCU (say, STM32F746: 320 KB SRAM, 1 MB Flash), Stage 1 finds the (R*, W*) combination that puts the search space in a "sweet spot" where most of the sub-networks in the space satisfy the memory budget and are large enough to be accurate.

The key metric is the FLOPs distribution of the satisfying sub-networks. Higher FLOPs (within budget) means more expressive models, which means better accuracy. Stage 1 picks the (R, W) pair that maximizes the 80th percentile FLOPs among sub-networks that satisfy the SRAM+Flash constraints.

The FLOPs proxy: We can't train all sub-networks to evaluate accuracy — that would defeat the purpose of search. Instead, TinyNAS uses a proxy: higher FLOPs within the memory budget correlates strongly with higher accuracy. Maximizing FLOPs at the 80th percentile of the satisfying distribution gives a good search space without any training.

Worked example from the paper. For a 320 KB SRAM, 1 MB Flash budget:

w0.3-r160 (width 0.3×, resolution 160): median FLOPs = 32.5M. Small, fits easily, but low capacity.
w0.5-r144: median FLOPs = 47M. Better capacity, 80th percentile of satisfying models ≈ 50M FLOPs.
w0.6-r112: median FLOPs = 41M. Wider but lower resolution — less capacity than w0.5-r144.

Stage 1 selects w0.5-r144 (or w0.4-r144, depending on exact Flash budget) as the optimal search space for this MCU. All the NAS in Stage 2 will happen within this sub-space.

TinyNAS Feasible Region — Resolution × Width

Each cell shows whether a (resolution, width-multiplier) search space is feasible for the given SRAM budget. Green = 80%+ of sub-networks fit. The star marks the optimal cell (max FLOPs while feasible). Adjust the SRAM budget to see how the feasible window shifts.

SRAM Budget (KB) 320

Flash Budget (MB) 1.0

TinyNAS Stage 1 selects the search space by maximizing which metric among sub-networks that satisfy the memory constraint?

Lowest peak SRAM — pick the search space where the smallest sub-networks fit in memory. Highest FLOPs at the 80th percentile — pick the search space where the satisfying models tend to be most computationally expressive, as a proxy for accuracy. Highest accuracy on ImageNet after 5 training epochs. Smallest number of parameters — minimize Flash usage to leave more budget for SRAM.

Chapter 5: TinyNAS — Resource-Constrained Search

With the search space optimized in Stage 1, TinyNAS Stage 2 runs the actual architecture search within that space. The goal: find the architecture that maximizes accuracy subject to explicit SRAM and Flash constraints.

Stage 2 uses one-shot NAS with weight sharing. Instead of training each candidate architecture from scratch (which would be computationally impossible), TinyNAS trains a single supernet that contains all candidate architectures as sub-networks sharing weights. The supernet covers choices in:

Kernel size: 3×3 or 5×5 per layer
Expansion ratio: 3 or 6 (channel expansion within inverted residual blocks)
Number of blocks per stage: 1, 2, 3, or 4

During training, random sub-networks are sampled from the supernet at each step. All sampled sub-networks share the same weights for their shared operations. This progressive fine-tuning of many overlapping sub-networks gives each sub-network approximately trained weights without requiring separate training.

After supernet training, evaluation proceeds as follows. For any candidate architecture c = (kernel choices, expansion ratios, block counts):

Peak SRAM(c) = max_L(in_L(c) + out_L(c)) computed analytically from c

Flash(c) = ∑_L weights_L(c) computed analytically from c

This analytic cost model means we can check SRAM and Flash constraints without running the network at all — just from the architecture specification. Candidates that violate constraints are immediately rejected. The remaining candidates are ranked by their accuracy on a held-out validation set, using the shared weights.

Key result: TinyNAS better utilizes available memory. On MobileNetV2 (the baseline), the peak SRAM for the first two stages averages ~200 KB with max spikes to 300 KB. TinyNAS finds networks with a more uniform memory distribution — average ≈180 KB, max ≈190 KB — no layer wastes memory, no layer creates a spike. This lets TinyNAS fit a larger model in the same SRAM budget. The best accuracy per memory byte goes to the network with the flattest SRAM profile across layers.

Flash budget check with real numbers. Suppose the search finds a network with these layers (kernel 3×3, INT8):

Conv-stem: 3×3×3×16 = 432 bytes
8 InvRes blocks, avg 4 layers each, avg 3×3×32×32 per layer = 9,216 bytes each → 8×4×9,216 = 294,912 bytes (~288 KB)
Head (1×1 conv + FC): 1×1×96×10 = 960 bytes
Total Flash ≈ 289 KB < 1 MB ✓

In one-shot NAS with weight sharing, how does TinyNAS evaluate whether a candidate architecture satisfies the SRAM and Flash constraints?

By running the candidate architecture on the actual MCU hardware and measuring memory usage. By training the candidate from scratch and monitoring peak SRAM during training. By computing SRAM and Flash analytically from the architecture specification (tensor shapes, layer configs) — no execution needed, just arithmetic on the design. By using the supernet's validation accuracy as a proxy for whether the network will fit on hardware.

Chapter 6: Imbalanced Memory Distribution

Even with TinyNAS finding a better architecture, there remains a structural problem that affects all CNNs on MCUs: the imbalanced memory distribution. The first few layers consume disproportionately more SRAM than all subsequent layers combined.

Why? Early CNN layers process large spatial feature maps (high H×W) with relatively few channels. Later layers have smaller spatial maps but more channels. Because SRAM scales as H×W×C, and H×W shrinks rapidly (each stride-2 conv halves both dimensions, cutting H×W to 1/4), the early layers dominate peak SRAM.

Measured on MobileNetV2 processing a 224×224 input on an STM32F746 (320 KB SRAM):

Block	Approx SRAM (KB)	vs MCU budget
Block 0 (conv-stem)	551	1.7× over budget
Block 1–2 (first inverted residuals)	~1,372 peak	4.3× over budget
Blocks 3–6	80–200	within budget
Blocks 7–17	8–80	well within budget

The picture is stark: blocks 1–2 consume 8× more SRAM than the MCU's budget, while blocks 7–17 use <80 KB total. The tail of the network is not the problem. The head is.

The imbalance insight: If we could somehow reduce SRAM for the first 2–3 blocks, the overall peak SRAM would drop dramatically — because those early blocks ARE the peak. Everything else is already well within budget. This localization of the problem is what makes the MCUNetV2 solution elegant: we don't need to redesign the whole network, just the early blocks.

This imbalance appears across all standard architectures, not just MobileNetV2. Any CNN with a large input image will have this property because:

SRAM_early = H₀ × W₀ × C_out + H₀/2 × W₀/2 × C_expand

For MobileNetV2 with input 224×224: H₀ = 112 after stem, C_expand = 96 after 6× expansion → SRAM = 112×112×96 ≈ 1.1 MB. No amount of pruning or quantization can change this without changing the architecture.

Per-Block SRAM Profile — The Imbalanced Distribution

SRAM usage across MobileNetV2 blocks. The red dashed line is the MCU budget. Early blocks spike far above it; later blocks are well under. Toggle between MobileNetV2 and MCUNet to see how TinyNAS flattens the distribution.

Why do early CNN layers dominate peak SRAM even though they have fewer channels than later layers?

Early layers use larger kernels (7×7 vs 3×3) which require more memory for the convolution operation. Early layers have more parameters because they process the raw input with high precision. Early layers process large spatial feature maps (H×W has not been downsampled yet), so activation memory scales as large_H × large_W × C — the high spatial resolution outweighs the lower channel count. Early layers use the inverted residual design with 6× channel expansion, which is only present at the beginning of the network.

Chapter 7: Patch-Based Inference — MCUNetV2

MCUNetV2 (Lin et al., NeurIPS 2021) introduces a simple but powerful idea to break the early-layer SRAM bottleneck: instead of processing the entire feature map at once, process it one patch at a time.

Here is the standard inference flow for an early conv layer (call it L1) followed by another (L2):

Per-Layer Inference (Standard)

Feed full 224×224 input → L1 outputs 112×112 feature map (all in SRAM) → L2 processes all 112×112 → done. Peak SRAM = input(L1) + output(L1) = 1,372 KB.

Now consider patch-based inference with a 2×2 patch grid (4 patches):

Per-Patch Inference

Divide the input into a 2×2 grid of 112×112 patches (with padding for border). For each patch: feed 112×112 patch through L1 → 56×56 partial output → feed through L2 → accumulate 28×28 output tile. Only one patch's activations are in SRAM at a time. Peak SRAM ≈ 1,372/4 ≈ 343 KB — much better.

More patches → lower peak SRAM. With 9 patches (3×3 grid): SRAM ≈ 1,372/9 ≈ 153 KB — under the 256 KB budget.

But patches aren't free — the halo problem: Convolutional layers have receptive fields. A 3×3 conv at layer L+1 needs a 1-pixel border around each output patch's corresponding input region. With multiple conv layers stacked, the halo (the extra input needed per patch) grows proportionally to the number of layers and kernel sizes. Each halo pixel is computed redundantly across patches — some computation is repeated. MCUNetV2 calls this "halo overhead."

The halo overhead for N layers of 3×3 convolutions is ≈ 2N pixels on each side. For 4 stacked layers with a 2×2 patch grid on a 224×224 input: halo width = 2×4 = 8 pixels per side. Each patch is 112×112; halo adds (112+16)×(112+16) = 128×128 = 16,384 extra pixels, or about 14% overhead. Empirically, MobileNetV2 with 2×2 patches incurs only ~10% extra MACs — a small price for 4.9× lower SRAM.

MCUNetV2 solves the halo overhead problem with network redistribution: instead of running patch-based inference on a standard network, it redesigns the early-stage architecture to have smaller receptive fields, reducing the halo. The NAS search jointly optimizes architecture AND the patch inference schedule — finding networks that naturally have low-halo early stages and benefit most from patch-based inference.

Measured results on STM32F746 (320 KB SRAM), INT8:

Model	Inference Mode	Peak SRAM	Reduction
MobileNetV2	Per-layer	315 KB	—
MobileNetV2	Per-patch 2×2	64 KB	4.9×
MCUNet	Per-layer	113 KB	2.8× vs MbV2
MCUNetV2	Per-patch 2×2	30 KB	10.5× vs MbV2

Patch-based inference reduces peak SRAM by processing one patch at a time. What is the main computational cost of this approach, and approximately how large is it for a 2×2 patch split with 4 early conv layers?

The cost is extra Flash storage for patch offsets — about 4 KB per patch boundary. The halo overhead: each patch needs a border region from adjacent patches to compute valid convolution outputs, leading to redundant computation at patch boundaries. For 4 conv layers in a 2×2 split, this overhead is roughly 10–14% extra MACs. The cost is a 4× increase in latency because each patch is processed sequentially on a single-core CPU. The main cost is synchronization overhead between patches, which requires a full SRAM flush between patches.

Chapter 8: Showcase — MCUNet Memory Explorer

This showcase integrates everything from Chapters 1–7 into one interactive simulation. Configure a network and memory budget, watch the per-layer SRAM profile update live, and see how patch-based inference slashes the peak.

MCUNet Memory Explorer

Adjust the input resolution, width multiplier, and patch grid. Watch the SRAM profile change. The red dashed line is the MCU budget. The highlighted bar is the current bottleneck layer.

Input Res 128

Width × 0.5

Patch Grid 1×1

SRAM Budget (KB) 256

Use the presets to walk through the MCUNet story:

MobileNetV2 preset — see the massive early-layer spikes, far above the 256 KB line.
MCUNet preset — TinyNAS found 128px, 0.5× as the optimal space. Bars flatten significantly.
MCUNetV2 preset — enable 2×2 patches. The early spikes are divided by ~4. Peak drops below 256 KB.
Try the SRAM budget slider — find the minimum budget each preset needs to fit entirely within.

The full MCUNet story in numbers: MobileNetV2 at 224px needs 1,372 KB peak SRAM → 5.4× over 256 KB MCU budget. MCUNet (TinyNAS) at 128px, 0.5× needs ~113 KB → fits with room to spare. MCUNetV2 with 2×2 patches at 144px needs ~30 KB → 8.6× headroom, enabling larger, more accurate networks at the same memory budget. At the same SRAM constraint, MCUNetV2 achieves +4.6% ImageNet top-1 over MCUNet and +12% over off-the-shelf MobileNetV2-int8.

Chapter 9: Connections & Cheat Sheet

MCU Cheat Sheet

Concept	Formula / Rule	Binding constraint?
Flash (model size)	∑_L (k²·C_in·C_out) bytes INT8	Yes — total weights must fit
Peak SRAM	max_L (H·W·C_in + H'·W'·C_out)	Yes — the critical binding constraint
SRAM scales as	R² (doubling res = 4× SRAM)	Why resolution must be small on MCUs
Inverted residual expansion	6× channels → 6× larger activation tensor	Source of MobileNetV2's MCU failure
TinyNAS Stage 1	Pick (R, W) = argmax FLOPs at 80th pctile of satisfying sub-networks	Ensures search space is feasible
TinyNAS Stage 2	One-shot NAS with weight sharing under analytic SRAM+Flash cost model	Finds best arch within feasible space
Patch inference SRAM	≈ per-layer SRAM / (P×P) + halo overhead (~10–14%)	Trades small compute for large SRAM savings
MCUNetV2 result	30 KB peak SRAM @ 144px, 0.5×, 2×2 patches	10.5× smaller than MobileNetV2

The Two-Stage TinyNAS Summary

Input: MCU spec (SRAM_budget, Flash_budget)

e.g., STM32F746: 320 KB SRAM, 1 MB Flash

↓ Stage 1

Search Space Optimization

Enumerate (R, W) pairs. For each, compute the FLOPs CDF of satisfying sub-networks. Pick (R*, W*) = argmax FLOPs at 80th percentile. No training needed.

↓ Stage 2

Resource-Constrained NAS

Train supernet at (R*, W*). Sample sub-networks, evaluate analytically for SRAM+Flash, rank by accuracy using shared weights. Return best feasible arch.

↓ MCUNetV2

Patch-Based Inference

Jointly search architecture + patch schedule. Early layers run patch-by-patch to slash peak SRAM by 4–10×. Network redistribution minimizes halo overhead.

Flash Budget Check (Code)

python
def flash_budget(layers, bitwidth=8):
    """layers: list of (k, Cin, Cout) tuples. Returns Flash in bytes."""
    bits_per_param = bitwidth
    total_params = 0
    for k, c_in, c_out in layers:
        total_params += k * k * c_in * c_out
    return total_params * (bits_per_param // 8)

def peak_sram(layer_specs, bitwidth=8):
    """layer_specs: list of (H_in, W_in, C_in, H_out, W_out, C_out).
    Returns (peak_sram_bytes, bottleneck_layer_idx)."""
    bytes_per_elem = bitwidth // 8
    peak = 0
    bottleneck = 0
    for i, (Hi, Wi, Ci, Ho, Wo, Co) in enumerate(layer_specs):
        sram = (Hi * Wi * Ci + Ho * Wo * Co) * bytes_per_elem
        if sram > peak:
            peak, bottleneck = sram, i
    return peak, bottleneck

# Example: small MCUNet-like network
layers = [
    (96, 96, 1,  48, 48, 16),   # conv0, s=2
    (48, 48, 16, 48, 48, 32),   # conv1, s=1 ← likely peak
    (48, 48, 32, 24, 24, 64),   # conv2, s=2
    (24, 24, 64, 24, 24, 64),   # conv3, s=1
    (24, 24, 64, 12, 12, 128),  # conv4, s=2
]

peak, idx = peak_sram(layers)
print(f"Peak SRAM: {peak/1024:.1f} KB at layer {idx}")
# Peak SRAM: 108.0 KB at layer 1

def patch_inference_sram(peak_layer_sram, num_patches, halo_overhead=0.12):
    """Estimate peak SRAM after patch-based inference."""
    return peak_layer_sram * (1 + halo_overhead) / num_patches

# 2x2 patches (4 patches) on MobileNetV2 peak of 1,372 KB
mbv2_peak = 1372 * 1024  # bytes
patched = patch_inference_sram(mbv2_peak, num_patches=4)
print(f"After 2x2 patches: {patched/1024:.0f} KB")
# After 2x2 patches: 385 KB — still needs 3x3 patches!

patched_3x3 = patch_inference_sram(mbv2_peak, num_patches=9)
print(f"After 3x3 patches: {patched_3x3/1024:.0f} KB")
# After 3x3 patches: 171 KB — fits in 256 KB!

What's Next: TinyEngine (Lecture 11)

MCUNet finds the right architecture via TinyNAS. TinyEngine (L11) is the companion compiler/runtime that executes that architecture efficiently on MCU hardware. TinyEngine implements:

Operator fusion — merge conv + BN + ReLU into one pass over the data, eliminating intermediate activation storage
In-place depth-wise convolution — reuse the input buffer for the output when the shapes match
Im2col-free convolution — avoid the memory-intensive tensor rearrangement step standard libraries use
Patch-based execution — schedules the patch-based inference discovered by MCUNetV2 search

Together, TinyNAS + TinyEngine form the full MCUNet co-design loop: the algorithm produces an MCU-fitting architecture, and the runtime executes it with zero wasted memory or compute.

Related Gleams

TinyML L1 — Efficiency Metrics: activation memory, MACs, roofline — the vocabulary for everything in this lesson
TinyML L7 — NAS I: Searching Architectures: the NAS framework (search space, strategy, estimator) that TinyNAS extends
TinyML L8 — NAS II: Hardware-Aware & OFA: ProxylessNAS, Once-for-All — the mobile NAS techniques MCUNet builds on
TinyML L5 — Quantization I: INT8 quantization reduces Flash usage but SRAM activation analysis holds in INT8

"We want to enable every microcontroller to run deep learning. That means not just fitting the weights — but fitting the activations, in 256 KB, with no OS, no DRAM, and no compromise on accuracy." — Song Han, MIT 6.5940

MCUNetV2 achieves 30 KB peak SRAM at 144px resolution with 2×2 patch-based inference. MobileNetV2 at 224px needs 1,372 KB. What are the TWO main techniques that together explain this ~46× reduction?

INT4 quantization (4× SRAM reduction) and model pruning (11.5× reduction) applied sequentially. TinyEngine operator fusion (2× reduction) and knowledge distillation from a teacher network (23× reduction). Reducing input resolution from 224 to 144 (≈ 2.4× reduction in SRAM) and pruning 80% of channels. TinyNAS architecture co-design: lower resolution (144 vs 224) + lower width (0.5×) reduces activation tensors (~6× smaller vs full MobileNetV2), PLUS patch-based inference (2×2 = 4 patches) divides peak SRAM by ~4. Combined: ~24× from arch + further gains from MCUNetV2's network redistribution, reaching ~46×.