TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 10

MCUNet: Deep Learning on Microcontrollers

You have 256 KB of RAM and 1 MB of storage. A single uncompressed 224×224 RGB image is 150 KB. A MobileNetV2 activation peak is 1,372 KB — five times your entire memory. Every technique from pruning to quantization to NAS was designed for phones and servers, not for this brutal constraint. MCUNet is a co-design system that finally cracks it: TinyNAS automatically finds the right search space, and patch-based inference slashes peak memory 8× by running the heavy early layers one patch at a time.

Prerequisites: TinyML L1 (Efficiency Metrics), TinyML L7–L8 (NAS) — activation memory, MACs, search spaces helpful.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Memory Wall

Imagine you are asked to run image classification on an Arduino Nano 33 BLE Sense. The chip has a Cortex-M4 CPU running at 64 MHz, 256 KB of SRAM, and 1 MB of Flash. There is no operating system, no DRAM, no GPU. The total usable memory is smaller than a single uncompressed photograph.

Now consider what a "small" neural network actually costs. MobileNetV2, the go-to efficient model for mobile phones, has 3.4 million parameters. In INT8 that is 3.4 MB — already 3.4× your Flash budget. But parameters aren't the critical constraint. The peak activation memory — the SRAM needed while running the forward pass — reaches 1,372 KB for MobileNetV2. That is 5.4× your entire SRAM.

The gap isn't small. Cloud AI servers have 32 GB of DRAM for activations. Smartphones have 4 GB. A microcontroller has 320 KB. That is a factor of roughly 10,000 between phone and MCU, and over 100,000 between server and MCU. Every neural architecture ever designed assumed you had at least megabytes of working memory. MCUs shatter that assumption.

The key problem: Existing efficient models (MobileNet, NASNet) were designed for the phone-to-GPU regime. Their smallest subnetwork still doesn't fit in MCU SRAM. We can't just shrink them — we need to rethink the design space from scratch under a fundamentally different memory constraint.
Hardware Memory Ladder — The 10,000× Gap

Memory available for neural network activations across hardware tiers. Note the logarithmic scale — each step is orders of magnitude.

The canvas above shows the memory gap on a log scale. Each step between tiers is roughly 10× — meaning the gap from phone to MCU is the same as the gap from server to phone. MCU sits in a different universe from the hardware that neural network research has focused on.

This chapter sets up the central problem. The next chapters analyze why models fail on MCUs (it's mostly the activation memory, not the weights), and then build MCUNet's solution from the ground up.

A MobileNetV2 model in INT8 has 3.4 MB of parameters and 1,372 KB peak activation memory. An MCU has 256 KB SRAM and 1 MB Flash. Which constraint is harder to satisfy for MobileNetV2?

Chapter 1: SRAM vs Flash — Two Constraints, Two Roles

A microcontroller has two fundamentally different types of memory, and they impose two separate constraints on neural network deployment. Confusing them is the most common mistake when reasoning about MCU limitations.

Flash (also called ROM or program memory) is read-only at runtime. It is non-volatile — it persists when power is off. Flash holds your compiled program, your constant data, and crucially: the model weights. For a typical MCU, Flash is 1–2 MB. Flash is the model-size constraint. An INT8 model with W weights needs W bytes of Flash. You check this once per model deployment and it never changes.

SRAM (Static RAM) is read-write working memory. It holds your program's stack, heap, temporary variables, and crucially: the layer activations during inference. SRAM is the peak activation memory constraint. Unlike Flash, SRAM usage is dynamic — it changes layer by layer as the forward pass executes. The question isn't total activation size across all layers; it's the maximum SRAM needed at any single point in the forward pass.

Why "peak" matters: You don't need to store all activations simultaneously. While computing layer L, you need only: (1) the input activations to layer L, (2) the output activations from layer L, plus (3) a small buffer for the kernel weights (which can be streamed from Flash a few rows at a time). Once you move to layer L+1, the input to layer L can be freed. So peak SRAM = max over all layers of (input_activation_size + output_activation_size).

Let's formalize. For a convolutional layer with input feature map H×W×Cin and output feature map H'×W'×Cout, and assuming INT8 (1 byte per value):

SRAMlayer L = H × W × Cin + H' × W' × Cout (bytes, INT8)
Peak SRAM = maxL ∈ {0..N} SRAMlayer L
Flash needed = ∑L (kernel_h × kernel_w × Cin × Cout) (total weight bytes)

The weights don't contribute to SRAM because they can be fetched from Flash row-by-row during computation. Only the activation tensors (input and output of the current layer) must live in SRAM simultaneously.

Memory TypeRead/Write?HoldsConstraint on NNTypical Size (MCU)
FlashRead-only at runtimeWeights, program codeTotal model size1–2 MB
SRAMRead + WriteActivations, buffers, stackPeak activation memory256–512 KB
Misconception to kill early: "If I quantize the model from FP32 to INT8, I get 4× smaller weights — that should fix the memory problem." INT8 quantization helps Flash (4× smaller weights) but barely touches peak SRAM. The activations are still the same size in INT8 as in FP32 byte-count terms — you saved 4× on per-value storage, but the activation tensor shapes are identical. On an MCU, it's SRAM (activations) that kills you, not Flash (weights).
A conv layer processes a 64×64×32 input feature map and produces a 64×64×64 output. What is its SRAM footprint (in KB) assuming INT8 quantization?

Chapter 2: Peak SRAM Analysis — Layer by Layer

Let's work through a concrete example from scratch. Consider a small convolutional network with 5 layers processing a 96×96 grayscale image. We'll compute the SRAM footprint for every layer and find the bottleneck.

The network architecture:

LayerInput (H×W×C)Output (H×W×C)Stride
Conv0 (3×3, s=2)96×96×148×48×162
Conv1 (3×3, s=1)48×48×1648×48×321
Conv2 (3×3, s=2)48×48×3224×24×642
Conv3 (3×3, s=1)24×24×6424×24×641
Conv4 (3×3, s=2)24×24×6412×12×1282

Now compute SRAM for each layer (INT8, bytes = H×W×C for each tensor):

Conv0: 96×96×1 + 48×48×16 = 9,216 + 36,864 = 46,080 bytes (45 KB)
Conv1: 48×48×16 + 48×48×32 = 36,864 + 73,728 = 110,592 bytes (108 KB)
Conv2: 48×48×32 + 24×24×64 = 73,728 + 36,864 = 110,592 bytes (108 KB)
Conv3: 24×24×64 + 24×24×64 = 36,864 + 36,864 = 73,728 bytes (72 KB)
Conv4: 24×24×64 + 12×12×128 = 36,864 + 18,432 = 55,296 bytes (54 KB)

The peak is Conv1 and Conv2, tied at 108 KB. With 256 KB of SRAM on an Arduino Nano, this tiny 5-layer net fits — but barely, and only because we used a 96×96 input instead of the standard 224×224. Scale it up to 224×224 and every number increases by (224/96)² ≈ 5.4× — now Conv1 alone needs 596 KB, more than double the MCU's total SRAM.

The scaling intuition: SRAM grows as input resolution squared. Doubling resolution (96→192) quadruples peak SRAM. This is why standard mobile inputs of 224×224 are impossible on MCUs, and why TinyNAS explicitly searches over input resolution as a first-class design variable.
Layer-by-Layer SRAM Profile

Per-layer SRAM = input activation + output activation. Drag the resolution and channel-width sliders to see how peak SRAM shifts. The red bar is the peak (bottleneck) layer.

Resolution 96
Width × 1.0
In the 5-layer network above, which layer is the bottleneck and why?

Chapter 3: Why Standard Models Fail

Now that we can compute peak SRAM analytically, let's understand concretely why every existing efficient model fails on MCUs — and why the failure is fundamental, not fixable by minor tweaks.

Take MobileNetV2. It was designed for smartphones and achieves 72% ImageNet top-1 with only 3.4M parameters (3.4 MB in INT8). On a Pixel phone with 4 GB DRAM, this is extremely efficient. Let's compute its peak SRAM. The early stage of MobileNetV2:

BlockInputOutputSRAM (INT8)
Conv-stem (s=2)224×224×3112×112×32150,528 + 401,408 = 551 KB
InvRes-1 (t=1, s=1)112×112×32112×112×16401,408 + 200,704 = 588 KB
InvRes-2 expand (t=6)112×112×16112×112×96200,704 + 1,204,224 = 1,367 KB

InvRes-2's depthwise expansion layer alone requires 1,367 KB of SRAM — 5.3× the entire 256 KB SRAM of an Arduino. This is the inverted residual bottleneck design at its worst: MobileNetV2 expands channels by 6× within each block to increase representational capacity, but that expansion creates enormous intermediate activations.

The irony of MobileNet on MCUs: The inverted residual design that makes MobileNetV2 efficient on phones (low param count, high accuracy) makes it catastrophically expensive on MCUs. The 6× channel expansion designed to pack more computation into fewer parameters causes exactly the peak-SRAM explosion that MCUs cannot handle. The phone-friendly design choice is the MCU-hostile choice.

What about quantizing more aggressively? Even INT4 quantization only helps model size (Flash), not SRAM layout. The activation tensor shapes don't change with quantization — a 112×112×96 tensor is 1,204,224 values regardless of bitwidth. INT8 = 1.1 MB SRAM; INT4 = 590 KB SRAM; you'd need INT2 to get near 256 KB, and INT2 quantization of activations causes catastrophic accuracy loss.

What about pruning? Structured pruning reduces channel counts, which helps. But to get MobileNetV2 under 256 KB SRAM would require pruning ≥80% of channels in the early expansion layers — destroying accuracy entirely. Pruning alone cannot bridge a 5.3× gap while preserving useful representations.

The conclusion: we need an entirely different architecture that is designed with MCU SRAM constraints as a first-class objective, not a post-hoc compression target. This is what TinyNAS provides.

MobileNetV2's early inverted residual blocks use a 6× channel expansion to build representational capacity. Why does this design, which is efficient for phones, become a problem specifically on MCUs?

Chapter 4: TinyNAS — Designing the Search Space

Neural Architecture Search (NAS) automates the hunt for efficient architectures. You define a search space of possible architectures, a search strategy to explore it, and a performance estimator to rank candidates without full training. But here's the key insight the MCUNet paper makes: the search space itself must be designed for MCU constraints, not just the search within it.

The standard NAS approach for mobile models (ProxylessNAS, MnasNet) defines a search space around MobileNetV2-like blocks at (resolution R=224, width multiplier W=1.0). The smallest architecture reachable within this space still needs 4+ GB DRAM. You could run NAS forever inside this space and never find anything that fits a 256 KB MCU — the search space itself is wrong.

TinyNAS introduces a two-stage approach:

Stage 1: Optimize the Search Space
Pick the right (resolution R*, width multiplier W*) pair for the memory budget. This selects which sub-space to search within.
Stage 2: Search Within the Space
Run resource-constrained NAS (one-shot weight sharing) inside the optimized sub-space, under explicit SRAM+Flash limits.
Output: MCU-Deployable Architecture
An architecture that fits the SRAM+Flash budget and achieves maximum accuracy within those constraints.

Why does Stage 1 matter so much? Because the peak SRAM and total Flash of a network scale predictably with R and W. Given a target MCU (say, STM32F746: 320 KB SRAM, 1 MB Flash), Stage 1 finds the (R*, W*) combination that puts the search space in a "sweet spot" where most of the sub-networks in the space satisfy the memory budget and are large enough to be accurate.

The key metric is the FLOPs distribution of the satisfying sub-networks. Higher FLOPs (within budget) means more expressive models, which means better accuracy. Stage 1 picks the (R, W) pair that maximizes the 80th percentile FLOPs among sub-networks that satisfy the SRAM+Flash constraints.

The FLOPs proxy: We can't train all sub-networks to evaluate accuracy — that would defeat the purpose of search. Instead, TinyNAS uses a proxy: higher FLOPs within the memory budget correlates strongly with higher accuracy. Maximizing FLOPs at the 80th percentile of the satisfying distribution gives a good search space without any training.

Worked example from the paper. For a 320 KB SRAM, 1 MB Flash budget:

Stage 1 selects w0.5-r144 (or w0.4-r144, depending on exact Flash budget) as the optimal search space for this MCU. All the NAS in Stage 2 will happen within this sub-space.

TinyNAS Feasible Region — Resolution × Width

Each cell shows whether a (resolution, width-multiplier) search space is feasible for the given SRAM budget. Green = 80%+ of sub-networks fit. The star marks the optimal cell (max FLOPs while feasible). Adjust the SRAM budget to see how the feasible window shifts.

SRAM Budget (KB) 320
Flash Budget (MB) 1.0
TinyNAS Stage 1 selects the search space by maximizing which metric among sub-networks that satisfy the memory constraint?

Chapter 5: TinyNAS — Resource-Constrained Search

With the search space optimized in Stage 1, TinyNAS Stage 2 runs the actual architecture search within that space. The goal: find the architecture that maximizes accuracy subject to explicit SRAM and Flash constraints.

Stage 2 uses one-shot NAS with weight sharing. Instead of training each candidate architecture from scratch (which would be computationally impossible), TinyNAS trains a single supernet that contains all candidate architectures as sub-networks sharing weights. The supernet covers choices in:

During training, random sub-networks are sampled from the supernet at each step. All sampled sub-networks share the same weights for their shared operations. This progressive fine-tuning of many overlapping sub-networks gives each sub-network approximately trained weights without requiring separate training.

After supernet training, evaluation proceeds as follows. For any candidate architecture c = (kernel choices, expansion ratios, block counts):

Peak SRAM(c) = maxL(inL(c) + outL(c)) computed analytically from c
Flash(c) = ∑L weightsL(c) computed analytically from c

This analytic cost model means we can check SRAM and Flash constraints without running the network at all — just from the architecture specification. Candidates that violate constraints are immediately rejected. The remaining candidates are ranked by their accuracy on a held-out validation set, using the shared weights.

Key result: TinyNAS better utilizes available memory. On MobileNetV2 (the baseline), the peak SRAM for the first two stages averages ~200 KB with max spikes to 300 KB. TinyNAS finds networks with a more uniform memory distribution — average ≈180 KB, max ≈190 KB — no layer wastes memory, no layer creates a spike. This lets TinyNAS fit a larger model in the same SRAM budget. The best accuracy per memory byte goes to the network with the flattest SRAM profile across layers.

Flash budget check with real numbers. Suppose the search finds a network with these layers (kernel 3×3, INT8):

In one-shot NAS with weight sharing, how does TinyNAS evaluate whether a candidate architecture satisfies the SRAM and Flash constraints?

Chapter 6: Imbalanced Memory Distribution

Even with TinyNAS finding a better architecture, there remains a structural problem that affects all CNNs on MCUs: the imbalanced memory distribution. The first few layers consume disproportionately more SRAM than all subsequent layers combined.

Why? Early CNN layers process large spatial feature maps (high H×W) with relatively few channels. Later layers have smaller spatial maps but more channels. Because SRAM scales as H×W×C, and H×W shrinks rapidly (each stride-2 conv halves both dimensions, cutting H×W to 1/4), the early layers dominate peak SRAM.

Measured on MobileNetV2 processing a 224×224 input on an STM32F746 (320 KB SRAM):

BlockApprox SRAM (KB)vs MCU budget
Block 0 (conv-stem)5511.7× over budget
Block 1–2 (first inverted residuals)~1,372 peak4.3× over budget
Blocks 3–680–200within budget
Blocks 7–178–80well within budget

The picture is stark: blocks 1–2 consume 8× more SRAM than the MCU's budget, while blocks 7–17 use <80 KB total. The tail of the network is not the problem. The head is.

The imbalance insight: If we could somehow reduce SRAM for the first 2–3 blocks, the overall peak SRAM would drop dramatically — because those early blocks ARE the peak. Everything else is already well within budget. This localization of the problem is what makes the MCUNetV2 solution elegant: we don't need to redesign the whole network, just the early blocks.

This imbalance appears across all standard architectures, not just MobileNetV2. Any CNN with a large input image will have this property because:

SRAMearly = H0 × W0 × Cout + H0/2 × W0/2 × Cexpand

For MobileNetV2 with input 224×224: H₀ = 112 after stem, C_expand = 96 after 6× expansion → SRAM = 112×112×96 ≈ 1.1 MB. No amount of pruning or quantization can change this without changing the architecture.

Per-Block SRAM Profile — The Imbalanced Distribution

SRAM usage across MobileNetV2 blocks. The red dashed line is the MCU budget. Early blocks spike far above it; later blocks are well under. Toggle between MobileNetV2 and MCUNet to see how TinyNAS flattens the distribution.

Why do early CNN layers dominate peak SRAM even though they have fewer channels than later layers?

Chapter 7: Patch-Based Inference — MCUNetV2

MCUNetV2 (Lin et al., NeurIPS 2021) introduces a simple but powerful idea to break the early-layer SRAM bottleneck: instead of processing the entire feature map at once, process it one patch at a time.

Here is the standard inference flow for an early conv layer (call it L1) followed by another (L2):

Per-Layer Inference (Standard)
Feed full 224×224 input → L1 outputs 112×112 feature map (all in SRAM) → L2 processes all 112×112 → done. Peak SRAM = input(L1) + output(L1) = 1,372 KB.

Now consider patch-based inference with a 2×2 patch grid (4 patches):

Per-Patch Inference
Divide the input into a 2×2 grid of 112×112 patches (with padding for border). For each patch: feed 112×112 patch through L1 → 56×56 partial output → feed through L2 → accumulate 28×28 output tile. Only one patch's activations are in SRAM at a time. Peak SRAM ≈ 1,372/4 ≈ 343 KB — much better.

More patches → lower peak SRAM. With 9 patches (3×3 grid): SRAM ≈ 1,372/9 ≈ 153 KB — under the 256 KB budget.

But patches aren't free — the halo problem: Convolutional layers have receptive fields. A 3×3 conv at layer L+1 needs a 1-pixel border around each output patch's corresponding input region. With multiple conv layers stacked, the halo (the extra input needed per patch) grows proportionally to the number of layers and kernel sizes. Each halo pixel is computed redundantly across patches — some computation is repeated. MCUNetV2 calls this "halo overhead."

The halo overhead for N layers of 3×3 convolutions is ≈ 2N pixels on each side. For 4 stacked layers with a 2×2 patch grid on a 224×224 input: halo width = 2×4 = 8 pixels per side. Each patch is 112×112; halo adds (112+16)×(112+16) = 128×128 = 16,384 extra pixels, or about 14% overhead. Empirically, MobileNetV2 with 2×2 patches incurs only ~10% extra MACs — a small price for 4.9× lower SRAM.

MCUNetV2 solves the halo overhead problem with network redistribution: instead of running patch-based inference on a standard network, it redesigns the early-stage architecture to have smaller receptive fields, reducing the halo. The NAS search jointly optimizes architecture AND the patch inference schedule — finding networks that naturally have low-halo early stages and benefit most from patch-based inference.

Measured results on STM32F746 (320 KB SRAM), INT8:

ModelInference ModePeak SRAMReduction
MobileNetV2Per-layer315 KB
MobileNetV2Per-patch 2×264 KB4.9×
MCUNetPer-layer113 KB2.8× vs MbV2
MCUNetV2Per-patch 2×230 KB10.5× vs MbV2
Patch-based inference reduces peak SRAM by processing one patch at a time. What is the main computational cost of this approach, and approximately how large is it for a 2×2 patch split with 4 early conv layers?

Chapter 8: Showcase — MCUNet Memory Explorer

This showcase integrates everything from Chapters 1–7 into one interactive simulation. Configure a network and memory budget, watch the per-layer SRAM profile update live, and see how patch-based inference slashes the peak.

MCUNet Memory Explorer

Adjust the input resolution, width multiplier, and patch grid. Watch the SRAM profile change. The red dashed line is the MCU budget. The highlighted bar is the current bottleneck layer.

Input Res 128
Width × 0.5
Patch Grid 1×1
SRAM Budget (KB) 256

Use the presets to walk through the MCUNet story:

  1. MobileNetV2 preset — see the massive early-layer spikes, far above the 256 KB line.
  2. MCUNet preset — TinyNAS found 128px, 0.5× as the optimal space. Bars flatten significantly.
  3. MCUNetV2 preset — enable 2×2 patches. The early spikes are divided by ~4. Peak drops below 256 KB.
  4. Try the SRAM budget slider — find the minimum budget each preset needs to fit entirely within.
The full MCUNet story in numbers: MobileNetV2 at 224px needs 1,372 KB peak SRAM → 5.4× over 256 KB MCU budget. MCUNet (TinyNAS) at 128px, 0.5× needs ~113 KB → fits with room to spare. MCUNetV2 with 2×2 patches at 144px needs ~30 KB → 8.6× headroom, enabling larger, more accurate networks at the same memory budget. At the same SRAM constraint, MCUNetV2 achieves +4.6% ImageNet top-1 over MCUNet and +12% over off-the-shelf MobileNetV2-int8.

Chapter 9: Connections & Cheat Sheet

MCU Cheat Sheet

ConceptFormula / RuleBinding constraint?
Flash (model size)∑_L (k²·C_in·C_out) bytes INT8Yes — total weights must fit
Peak SRAMmax_L (H·W·C_in + H'·W'·C_out)Yes — the critical binding constraint
SRAM scales asR² (doubling res = 4× SRAM)Why resolution must be small on MCUs
Inverted residual expansion6× channels → 6× larger activation tensorSource of MobileNetV2's MCU failure
TinyNAS Stage 1Pick (R*, W*) = argmax FLOPs at 80th pctile of satisfying sub-networksEnsures search space is feasible
TinyNAS Stage 2One-shot NAS with weight sharing under analytic SRAM+Flash cost modelFinds best arch within feasible space
Patch inference SRAM≈ per-layer SRAM / (P×P) + halo overhead (~10–14%)Trades small compute for large SRAM savings
MCUNetV2 result30 KB peak SRAM @ 144px, 0.5×, 2×2 patches10.5× smaller than MobileNetV2

The Two-Stage TinyNAS Summary

Input: MCU spec (SRAM_budget, Flash_budget)
e.g., STM32F746: 320 KB SRAM, 1 MB Flash
↓ Stage 1
Search Space Optimization
Enumerate (R, W) pairs. For each, compute the FLOPs CDF of satisfying sub-networks. Pick (R*, W*) = argmax FLOPs at 80th percentile. No training needed.
↓ Stage 2
Resource-Constrained NAS
Train supernet at (R*, W*). Sample sub-networks, evaluate analytically for SRAM+Flash, rank by accuracy using shared weights. Return best feasible arch.
↓ MCUNetV2
Patch-Based Inference
Jointly search architecture + patch schedule. Early layers run patch-by-patch to slash peak SRAM by 4–10×. Network redistribution minimizes halo overhead.

Flash Budget Check (Code)

python
def flash_budget(layers, bitwidth=8):
    """layers: list of (k, Cin, Cout) tuples. Returns Flash in bytes."""
    bits_per_param = bitwidth
    total_params = 0
    for k, c_in, c_out in layers:
        total_params += k * k * c_in * c_out
    return total_params * (bits_per_param // 8)

def peak_sram(layer_specs, bitwidth=8):
    """layer_specs: list of (H_in, W_in, C_in, H_out, W_out, C_out).
    Returns (peak_sram_bytes, bottleneck_layer_idx)."""
    bytes_per_elem = bitwidth // 8
    peak = 0
    bottleneck = 0
    for i, (Hi, Wi, Ci, Ho, Wo, Co) in enumerate(layer_specs):
        sram = (Hi * Wi * Ci + Ho * Wo * Co) * bytes_per_elem
        if sram > peak:
            peak, bottleneck = sram, i
    return peak, bottleneck

# Example: small MCUNet-like network
layers = [
    (96, 96, 1,  48, 48, 16),   # conv0, s=2
    (48, 48, 16, 48, 48, 32),   # conv1, s=1 ← likely peak
    (48, 48, 32, 24, 24, 64),   # conv2, s=2
    (24, 24, 64, 24, 24, 64),   # conv3, s=1
    (24, 24, 64, 12, 12, 128),  # conv4, s=2
]

peak, idx = peak_sram(layers)
print(f"Peak SRAM: {peak/1024:.1f} KB at layer {idx}")
# Peak SRAM: 108.0 KB at layer 1

def patch_inference_sram(peak_layer_sram, num_patches, halo_overhead=0.12):
    """Estimate peak SRAM after patch-based inference."""
    return peak_layer_sram * (1 + halo_overhead) / num_patches

# 2x2 patches (4 patches) on MobileNetV2 peak of 1,372 KB
mbv2_peak = 1372 * 1024  # bytes
patched = patch_inference_sram(mbv2_peak, num_patches=4)
print(f"After 2x2 patches: {patched/1024:.0f} KB")
# After 2x2 patches: 385 KB — still needs 3x3 patches!

patched_3x3 = patch_inference_sram(mbv2_peak, num_patches=9)
print(f"After 3x3 patches: {patched_3x3/1024:.0f} KB")
# After 3x3 patches: 171 KB — fits in 256 KB!

What's Next: TinyEngine (Lecture 11)

MCUNet finds the right architecture via TinyNAS. TinyEngine (L11) is the companion compiler/runtime that executes that architecture efficiently on MCU hardware. TinyEngine implements:

Together, TinyNAS + TinyEngine form the full MCUNet co-design loop: the algorithm produces an MCU-fitting architecture, and the runtime executes it with zero wasted memory or compute.

Related Gleams

"We want to enable every microcontroller to run deep learning. That means not just fitting the weights — but fitting the activations, in 256 KB, with no OS, no DRAM, and no compromise on accuracy." — Song Han, MIT 6.5940
MCUNetV2 achieves 30 KB peak SRAM at 144px resolution with 2×2 patch-based inference. MobileNetV2 at 224px needs 1,372 KB. What are the TWO main techniques that together explain this ~46× reduction?