You have 256 KB of RAM and 1 MB of storage. A single uncompressed 224×224 RGB image is 150 KB. A MobileNetV2 activation peak is 1,372 KB — five times your entire memory. Every technique from pruning to quantization to NAS was designed for phones and servers, not for this brutal constraint. MCUNet is a co-design system that finally cracks it: TinyNAS automatically finds the right search space, and patch-based inference slashes peak memory 8× by running the heavy early layers one patch at a time.
Imagine you are asked to run image classification on an Arduino Nano 33 BLE Sense. The chip has a Cortex-M4 CPU running at 64 MHz, 256 KB of SRAM, and 1 MB of Flash. There is no operating system, no DRAM, no GPU. The total usable memory is smaller than a single uncompressed photograph.
Now consider what a "small" neural network actually costs. MobileNetV2, the go-to efficient model for mobile phones, has 3.4 million parameters. In INT8 that is 3.4 MB — already 3.4× your Flash budget. But parameters aren't the critical constraint. The peak activation memory — the SRAM needed while running the forward pass — reaches 1,372 KB for MobileNetV2. That is 5.4× your entire SRAM.
The gap isn't small. Cloud AI servers have 32 GB of DRAM for activations. Smartphones have 4 GB. A microcontroller has 320 KB. That is a factor of roughly 10,000 between phone and MCU, and over 100,000 between server and MCU. Every neural architecture ever designed assumed you had at least megabytes of working memory. MCUs shatter that assumption.
Memory available for neural network activations across hardware tiers. Note the logarithmic scale — each step is orders of magnitude.
The canvas above shows the memory gap on a log scale. Each step between tiers is roughly 10× — meaning the gap from phone to MCU is the same as the gap from server to phone. MCU sits in a different universe from the hardware that neural network research has focused on.
This chapter sets up the central problem. The next chapters analyze why models fail on MCUs (it's mostly the activation memory, not the weights), and then build MCUNet's solution from the ground up.
A microcontroller has two fundamentally different types of memory, and they impose two separate constraints on neural network deployment. Confusing them is the most common mistake when reasoning about MCU limitations.
Flash (also called ROM or program memory) is read-only at runtime. It is non-volatile — it persists when power is off. Flash holds your compiled program, your constant data, and crucially: the model weights. For a typical MCU, Flash is 1–2 MB. Flash is the model-size constraint. An INT8 model with W weights needs W bytes of Flash. You check this once per model deployment and it never changes.
SRAM (Static RAM) is read-write working memory. It holds your program's stack, heap, temporary variables, and crucially: the layer activations during inference. SRAM is the peak activation memory constraint. Unlike Flash, SRAM usage is dynamic — it changes layer by layer as the forward pass executes. The question isn't total activation size across all layers; it's the maximum SRAM needed at any single point in the forward pass.
Let's formalize. For a convolutional layer with input feature map H×W×Cin and output feature map H'×W'×Cout, and assuming INT8 (1 byte per value):
The weights don't contribute to SRAM because they can be fetched from Flash row-by-row during computation. Only the activation tensors (input and output of the current layer) must live in SRAM simultaneously.
| Memory Type | Read/Write? | Holds | Constraint on NN | Typical Size (MCU) |
|---|---|---|---|---|
| Flash | Read-only at runtime | Weights, program code | Total model size | 1–2 MB |
| SRAM | Read + Write | Activations, buffers, stack | Peak activation memory | 256–512 KB |
Let's work through a concrete example from scratch. Consider a small convolutional network with 5 layers processing a 96×96 grayscale image. We'll compute the SRAM footprint for every layer and find the bottleneck.
The network architecture:
| Layer | Input (H×W×C) | Output (H×W×C) | Stride |
|---|---|---|---|
| Conv0 (3×3, s=2) | 96×96×1 | 48×48×16 | 2 |
| Conv1 (3×3, s=1) | 48×48×16 | 48×48×32 | 1 |
| Conv2 (3×3, s=2) | 48×48×32 | 24×24×64 | 2 |
| Conv3 (3×3, s=1) | 24×24×64 | 24×24×64 | 1 |
| Conv4 (3×3, s=2) | 24×24×64 | 12×12×128 | 2 |
Now compute SRAM for each layer (INT8, bytes = H×W×C for each tensor):
The peak is Conv1 and Conv2, tied at 108 KB. With 256 KB of SRAM on an Arduino Nano, this tiny 5-layer net fits — but barely, and only because we used a 96×96 input instead of the standard 224×224. Scale it up to 224×224 and every number increases by (224/96)² ≈ 5.4× — now Conv1 alone needs 596 KB, more than double the MCU's total SRAM.
Per-layer SRAM = input activation + output activation. Drag the resolution and channel-width sliders to see how peak SRAM shifts. The red bar is the peak (bottleneck) layer.
Now that we can compute peak SRAM analytically, let's understand concretely why every existing efficient model fails on MCUs — and why the failure is fundamental, not fixable by minor tweaks.
Take MobileNetV2. It was designed for smartphones and achieves 72% ImageNet top-1 with only 3.4M parameters (3.4 MB in INT8). On a Pixel phone with 4 GB DRAM, this is extremely efficient. Let's compute its peak SRAM. The early stage of MobileNetV2:
| Block | Input | Output | SRAM (INT8) |
|---|---|---|---|
| Conv-stem (s=2) | 224×224×3 | 112×112×32 | 150,528 + 401,408 = 551 KB |
| InvRes-1 (t=1, s=1) | 112×112×32 | 112×112×16 | 401,408 + 200,704 = 588 KB |
| InvRes-2 expand (t=6) | 112×112×16 | 112×112×96 | 200,704 + 1,204,224 = 1,367 KB |
InvRes-2's depthwise expansion layer alone requires 1,367 KB of SRAM — 5.3× the entire 256 KB SRAM of an Arduino. This is the inverted residual bottleneck design at its worst: MobileNetV2 expands channels by 6× within each block to increase representational capacity, but that expansion creates enormous intermediate activations.
What about quantizing more aggressively? Even INT4 quantization only helps model size (Flash), not SRAM layout. The activation tensor shapes don't change with quantization — a 112×112×96 tensor is 1,204,224 values regardless of bitwidth. INT8 = 1.1 MB SRAM; INT4 = 590 KB SRAM; you'd need INT2 to get near 256 KB, and INT2 quantization of activations causes catastrophic accuracy loss.
What about pruning? Structured pruning reduces channel counts, which helps. But to get MobileNetV2 under 256 KB SRAM would require pruning ≥80% of channels in the early expansion layers — destroying accuracy entirely. Pruning alone cannot bridge a 5.3× gap while preserving useful representations.
The conclusion: we need an entirely different architecture that is designed with MCU SRAM constraints as a first-class objective, not a post-hoc compression target. This is what TinyNAS provides.
Neural Architecture Search (NAS) automates the hunt for efficient architectures. You define a search space of possible architectures, a search strategy to explore it, and a performance estimator to rank candidates without full training. But here's the key insight the MCUNet paper makes: the search space itself must be designed for MCU constraints, not just the search within it.
The standard NAS approach for mobile models (ProxylessNAS, MnasNet) defines a search space around MobileNetV2-like blocks at (resolution R=224, width multiplier W=1.0). The smallest architecture reachable within this space still needs 4+ GB DRAM. You could run NAS forever inside this space and never find anything that fits a 256 KB MCU — the search space itself is wrong.
TinyNAS introduces a two-stage approach:
Why does Stage 1 matter so much? Because the peak SRAM and total Flash of a network scale predictably with R and W. Given a target MCU (say, STM32F746: 320 KB SRAM, 1 MB Flash), Stage 1 finds the (R*, W*) combination that puts the search space in a "sweet spot" where most of the sub-networks in the space satisfy the memory budget and are large enough to be accurate.
The key metric is the FLOPs distribution of the satisfying sub-networks. Higher FLOPs (within budget) means more expressive models, which means better accuracy. Stage 1 picks the (R, W) pair that maximizes the 80th percentile FLOPs among sub-networks that satisfy the SRAM+Flash constraints.
Worked example from the paper. For a 320 KB SRAM, 1 MB Flash budget:
Stage 1 selects w0.5-r144 (or w0.4-r144, depending on exact Flash budget) as the optimal search space for this MCU. All the NAS in Stage 2 will happen within this sub-space.
Each cell shows whether a (resolution, width-multiplier) search space is feasible for the given SRAM budget. Green = 80%+ of sub-networks fit. The star marks the optimal cell (max FLOPs while feasible). Adjust the SRAM budget to see how the feasible window shifts.
With the search space optimized in Stage 1, TinyNAS Stage 2 runs the actual architecture search within that space. The goal: find the architecture that maximizes accuracy subject to explicit SRAM and Flash constraints.
Stage 2 uses one-shot NAS with weight sharing. Instead of training each candidate architecture from scratch (which would be computationally impossible), TinyNAS trains a single supernet that contains all candidate architectures as sub-networks sharing weights. The supernet covers choices in:
During training, random sub-networks are sampled from the supernet at each step. All sampled sub-networks share the same weights for their shared operations. This progressive fine-tuning of many overlapping sub-networks gives each sub-network approximately trained weights without requiring separate training.
After supernet training, evaluation proceeds as follows. For any candidate architecture c = (kernel choices, expansion ratios, block counts):
This analytic cost model means we can check SRAM and Flash constraints without running the network at all — just from the architecture specification. Candidates that violate constraints are immediately rejected. The remaining candidates are ranked by their accuracy on a held-out validation set, using the shared weights.
Flash budget check with real numbers. Suppose the search finds a network with these layers (kernel 3×3, INT8):
Even with TinyNAS finding a better architecture, there remains a structural problem that affects all CNNs on MCUs: the imbalanced memory distribution. The first few layers consume disproportionately more SRAM than all subsequent layers combined.
Why? Early CNN layers process large spatial feature maps (high H×W) with relatively few channels. Later layers have smaller spatial maps but more channels. Because SRAM scales as H×W×C, and H×W shrinks rapidly (each stride-2 conv halves both dimensions, cutting H×W to 1/4), the early layers dominate peak SRAM.
Measured on MobileNetV2 processing a 224×224 input on an STM32F746 (320 KB SRAM):
| Block | Approx SRAM (KB) | vs MCU budget |
|---|---|---|
| Block 0 (conv-stem) | 551 | 1.7× over budget |
| Block 1–2 (first inverted residuals) | ~1,372 peak | 4.3× over budget |
| Blocks 3–6 | 80–200 | within budget |
| Blocks 7–17 | 8–80 | well within budget |
The picture is stark: blocks 1–2 consume 8× more SRAM than the MCU's budget, while blocks 7–17 use <80 KB total. The tail of the network is not the problem. The head is.
This imbalance appears across all standard architectures, not just MobileNetV2. Any CNN with a large input image will have this property because:
For MobileNetV2 with input 224×224: H₀ = 112 after stem, C_expand = 96 after 6× expansion → SRAM = 112×112×96 ≈ 1.1 MB. No amount of pruning or quantization can change this without changing the architecture.
SRAM usage across MobileNetV2 blocks. The red dashed line is the MCU budget. Early blocks spike far above it; later blocks are well under. Toggle between MobileNetV2 and MCUNet to see how TinyNAS flattens the distribution.
MCUNetV2 (Lin et al., NeurIPS 2021) introduces a simple but powerful idea to break the early-layer SRAM bottleneck: instead of processing the entire feature map at once, process it one patch at a time.
Here is the standard inference flow for an early conv layer (call it L1) followed by another (L2):
Now consider patch-based inference with a 2×2 patch grid (4 patches):
More patches → lower peak SRAM. With 9 patches (3×3 grid): SRAM ≈ 1,372/9 ≈ 153 KB — under the 256 KB budget.
The halo overhead for N layers of 3×3 convolutions is ≈ 2N pixels on each side. For 4 stacked layers with a 2×2 patch grid on a 224×224 input: halo width = 2×4 = 8 pixels per side. Each patch is 112×112; halo adds (112+16)×(112+16) = 128×128 = 16,384 extra pixels, or about 14% overhead. Empirically, MobileNetV2 with 2×2 patches incurs only ~10% extra MACs — a small price for 4.9× lower SRAM.
MCUNetV2 solves the halo overhead problem with network redistribution: instead of running patch-based inference on a standard network, it redesigns the early-stage architecture to have smaller receptive fields, reducing the halo. The NAS search jointly optimizes architecture AND the patch inference schedule — finding networks that naturally have low-halo early stages and benefit most from patch-based inference.
Measured results on STM32F746 (320 KB SRAM), INT8:
| Model | Inference Mode | Peak SRAM | Reduction |
|---|---|---|---|
| MobileNetV2 | Per-layer | 315 KB | — |
| MobileNetV2 | Per-patch 2×2 | 64 KB | 4.9× |
| MCUNet | Per-layer | 113 KB | 2.8× vs MbV2 |
| MCUNetV2 | Per-patch 2×2 | 30 KB | 10.5× vs MbV2 |
This showcase integrates everything from Chapters 1–7 into one interactive simulation. Configure a network and memory budget, watch the per-layer SRAM profile update live, and see how patch-based inference slashes the peak.
Adjust the input resolution, width multiplier, and patch grid. Watch the SRAM profile change. The red dashed line is the MCU budget. The highlighted bar is the current bottleneck layer.
Use the presets to walk through the MCUNet story:
| Concept | Formula / Rule | Binding constraint? |
|---|---|---|
| Flash (model size) | ∑_L (k²·C_in·C_out) bytes INT8 | Yes — total weights must fit |
| Peak SRAM | max_L (H·W·C_in + H'·W'·C_out) | Yes — the critical binding constraint |
| SRAM scales as | R² (doubling res = 4× SRAM) | Why resolution must be small on MCUs |
| Inverted residual expansion | 6× channels → 6× larger activation tensor | Source of MobileNetV2's MCU failure |
| TinyNAS Stage 1 | Pick (R*, W*) = argmax FLOPs at 80th pctile of satisfying sub-networks | Ensures search space is feasible |
| TinyNAS Stage 2 | One-shot NAS with weight sharing under analytic SRAM+Flash cost model | Finds best arch within feasible space |
| Patch inference SRAM | ≈ per-layer SRAM / (P×P) + halo overhead (~10–14%) | Trades small compute for large SRAM savings |
| MCUNetV2 result | 30 KB peak SRAM @ 144px, 0.5×, 2×2 patches | 10.5× smaller than MobileNetV2 |
python def flash_budget(layers, bitwidth=8): """layers: list of (k, Cin, Cout) tuples. Returns Flash in bytes.""" bits_per_param = bitwidth total_params = 0 for k, c_in, c_out in layers: total_params += k * k * c_in * c_out return total_params * (bits_per_param // 8) def peak_sram(layer_specs, bitwidth=8): """layer_specs: list of (H_in, W_in, C_in, H_out, W_out, C_out). Returns (peak_sram_bytes, bottleneck_layer_idx).""" bytes_per_elem = bitwidth // 8 peak = 0 bottleneck = 0 for i, (Hi, Wi, Ci, Ho, Wo, Co) in enumerate(layer_specs): sram = (Hi * Wi * Ci + Ho * Wo * Co) * bytes_per_elem if sram > peak: peak, bottleneck = sram, i return peak, bottleneck # Example: small MCUNet-like network layers = [ (96, 96, 1, 48, 48, 16), # conv0, s=2 (48, 48, 16, 48, 48, 32), # conv1, s=1 ← likely peak (48, 48, 32, 24, 24, 64), # conv2, s=2 (24, 24, 64, 24, 24, 64), # conv3, s=1 (24, 24, 64, 12, 12, 128), # conv4, s=2 ] peak, idx = peak_sram(layers) print(f"Peak SRAM: {peak/1024:.1f} KB at layer {idx}") # Peak SRAM: 108.0 KB at layer 1 def patch_inference_sram(peak_layer_sram, num_patches, halo_overhead=0.12): """Estimate peak SRAM after patch-based inference.""" return peak_layer_sram * (1 + halo_overhead) / num_patches # 2x2 patches (4 patches) on MobileNetV2 peak of 1,372 KB mbv2_peak = 1372 * 1024 # bytes patched = patch_inference_sram(mbv2_peak, num_patches=4) print(f"After 2x2 patches: {patched/1024:.0f} KB") # After 2x2 patches: 385 KB — still needs 3x3 patches! patched_3x3 = patch_inference_sram(mbv2_peak, num_patches=9) print(f"After 3x3 patches: {patched_3x3/1024:.0f} KB") # After 3x3 patches: 171 KB — fits in 256 KB!
MCUNet finds the right architecture via TinyNAS. TinyEngine (L11) is the companion compiler/runtime that executes that architecture efficiently on MCU hardware. TinyEngine implements:
Together, TinyNAS + TinyEngine form the full MCUNet co-design loop: the algorithm produces an MCU-fitting architecture, and the runtime executes it with zero wasted memory or compute.
"We want to enable every microcontroller to run deep learning. That means not just fitting the weights — but fitting the activations, in 256 KB, with no OS, no DRAM, and no compromise on accuracy." — Song Han, MIT 6.5940