MCUNet found an architecture that fits in 320 KB of SRAM. But fitting and running fast are two different things. The same model can be 10× slower on a stock runtime library than on TinyEngine — with zero accuracy loss. This lesson derives every kernel trick that makes the difference: loop reordering for cache locality, tiling to fit the working set in SRAM, im2col vs. direct convolution, NCHW/NHWC layout for vectorization, in-place depthwise convolution to halve the activation buffer, and code generation to eliminate interpreter overhead.
You just ran MCUNet's TinyNAS search. The output: a MobileNetV2-style architecture with a 128×128 input, 0.5 width multiplier, and a peak SRAM of 113 KB — comfortably under the STM32F746's 320 KB. You flash the model. You run inference. It takes 1.2 seconds per image.
For visual wake-word detection, you need <200 ms. Your model fits; it just doesn't run fast enough. You haven't touched the architecture, the weights, or the quantization. You change only the inference engine — the runtime library that executes the model — and the time drops to 93 ms. A 12.9× speedup with zero accuracy change.
Two distinct problems conspire to make stock runtimes slow on MCUs:
Problem 1 — The kernel is cache-unfriendly. A Cortex-M7 has 8 KB of L1 data cache. A naive matrix multiply reads memory in a pattern that thrashes this cache. Each row of B is loaded, evicted, and reloaded N² times. Reordering three loop indices — a trivial 3-line code change — can give a 12× speedup purely from better cache behavior.
Problem 2 — The library itself is too big. TensorFlow Lite Micro (TFLM) is a general interpreter: it reads an ops list at runtime, dispatches to function pointers, allocates buffers dynamically. This runtime machinery occupies 300–400 KB of Flash, leaving no room for weights. TinyEngine generates code — it compiles the model's specific layer sequence into a C function, eliminating all dispatch overhead and cutting the binary to <100 KB.
Data access latency on a Cortex-M7 MCU vs a laptop. Notice how deep the cliff is — DRAM is 200× slower than L1 cache. Every cache miss is a stall.
The canvas above shows why cache behavior dominates MCU performance. An MCU has only 8 KB of L1 cache. If your working set exceeds this, data must come from SRAM (100 ns) or Flash (150+ ns). Loop tiling's entire purpose is to keep the working set inside L1.
Matrix multiplication is the inner loop of every dense layer and every convolution (once rewritten as GEMM). The naive triple loop looks like this:
c // Naive: i, j, k order for (i = 0; i < M; i++) for (j = 0; j < N; j++) { float acc = 0; for (k = 0; k < K; k++) acc += A[i][k] * B[k][j]; // B accessed column-wise: BAD C[i][j] = acc; }
This looks innocent, but the inner loop k accesses B[k][j] — that's a column of B. In row-major layout (how C/C++ store 2D arrays), consecutive elements of a row are adjacent in memory. But a column means jumping by N elements between accesses. For a 512×512 matrix, each step in k jumps 512 floats = 2048 bytes. With an 8 KB L1 cache, the working row of B is evicted after every 4 steps. For N=512: 512³ ÷ 4 = 33 million cache misses in the inner loop alone.
j, accessing B[k][j] as a row — sequential, cache-friendly. The same arithmetic, the same FLOPs. Measured speedup on Intel Xeon: 12×. On ARM Cortex-M7 (8 KB L1): even higher because the cache is smaller and misses hurt more.c // Loop-reordered: i, k, j order (B accessed row-wise: GOOD) for (i = 0; i < M; i++) for (k = 0; k < K; k++) { float Aik = A[i][k]; // hoist scalar out of inner loop for (j = 0; j < N; j++) C[i][j] += Aik * B[k][j]; // B[k][j]: row j — sequential! }
Why does hoisting Aik out of the inner loop matter? Because A[i][k] is now accessed once per k iteration instead of N times. Compilers sometimes do this automatically, but explicit hoisting guarantees it. On real MCUs, the scalar fits in a register — zero memory accesses for it in the inner loop.
The principle generalizes: when two indices both contribute to memory access, order the loops so the innermost index drives sequential (stride-1) memory access. For NHWC layout tensors, this means the channel (C) index should be innermost; for NCHW tensors, the spatial (W) index.
B[K][N] stores elements as: B[0][0], B[0][1], ..., B[0][N-1], B[1][0], ... The naive loop accesses B[k][j] with k in the inner loop. Why is this a cache miss pattern and what is the stride between consecutive inner-loop accesses?Loop reordering fixed column-access patterns, but there's a subtler problem: even with row-major access, if N is large (say 512), a single row of B is 512×4 = 2 KB. The loop over i accesses M=512 rows of A and M=512 rows of C. The total working set for a block of k iterations is M×K + K×N + M×N floats = hundreds of KB — far exceeding the 8 KB L1 cache. Once B's row is evicted, re-loading it is a cache miss.
Loop tiling (also called loop blocking) partitions the iteration space into tiles that fit in the cache. Instead of iterating all N elements of j in sequence, you iterate in tiles of size Tj. Each tile accesses only Tj columns of B and Tj columns of C — a working set of K×Tj + M×Tj floats.
c #define BLK 32 for (ti = 0; ti < M; ti += BLK) for (tk = 0; tk < K; tk += BLK) for (tj = 0; tj < N; tj += BLK) // inner micro-kernel: BLK×BLK tile stays in L1 for (i = ti; i < ti+BLK; i++) for (k = tk; k < tk+BLK; k++) { float Aik = A[i][k]; for (j = tj; j < tj+BLK; j++) C[i][j] += Aik * B[k][j]; }
With BLK=32: each B tile is 32×32×4 = 4 KB; A tile is 4 KB; C tile is 4 KB. Total 12 KB — exceeds 8 KB slightly. In practice BLK=16 (3 KB/tile, 9 KB total — ok for 8 KB) or careful register-level allocation. Measured speedup over naive: 19× on Intel Xeon; 8–12× on ARM Cortex-M.
For MCUs, there's an additional benefit: tiling can be set so the entire working set fits in SRAM without needing to bring more data from Flash. Since Flash has ~150 ns access latency vs SRAM's 0.5 ns, this is a 300× latency improvement for each cache miss that is eliminated.
Drag the tile-size slider to see how much of the working set fits in the L1 cache. The blue box = tile being computed; the red region = data spilling out of cache.
Tiling gets data into the right memory tier. But inside the micro-kernel, there's still overhead: every iteration of the inner loop pays for an increment (j++), a comparison (j < tj+BLK), and a branch. For a tight inner loop that runs billions of times, these 3 extra instructions per MAC add up. Loop unrolling replicates the loop body N times and steps the loop counter by N — eliminating 7/8 of the branching overhead for 8-way unrolling.
c // Unrolled 4× over k — k steps by 4 each iteration for (k = 0; k < K; k += 4) { acc += A[i][k] * B[k][j]; acc += A[i][k+1] * B[k+1][j]; // replicated body ×4 acc += A[i][k+2] * B[k+2][j]; acc += A[i][k+3] * B[k+3][j]; }
Loop unrolling alone gives ~2.85× speedup. But modern CPUs and MCUs go further: SIMD (Single Instruction, Multiple Data) applies one instruction to a vector of values simultaneously. ARM Cortex-M7 supports NEON intrinsics — 128-bit vector registers that hold 4 FP32 or 8 INT16 values. One vmulq_f32 instruction multiplies four pairs in one cycle instead of one pair.
c (ARM NEON) // SISD: 1 multiply per instruction for (k = 0; k < K; k++) acc += A[i][k] * B[k][j]; // SIMD with NEON: 4 multiplies per instruction float32x4_t vacc = vdupq_n_f32(0); for (k = 0; k < K; k += 4) { float32x4_t vA = vld1q_f32(&A[i][k]); // load 4 floats from A float32x4_t vB = vld1q_f32(&B_T[j][k]); // load 4 floats from B^T vacc = vmlaq_f32(vacc, vA, vB); // 4 fused multiply-adds } // horizontal sum: acc = vacc[0]+vacc[1]+vacc[2]+vacc[3]
vmull_s8 (multiply 8-wide INT8) achieves 16 MACs per cycle. This is the primary reason TinyEngine with INT8 runs 4–8× faster than TinyEngine with FP32 on ARM Cortex-M — not just smaller data, but more MACs per cycle.Everything in the previous chapters optimized matrix multiplication. But a convolution layer is not a matrix multiply — it's a sliding window operation. The most common approach on GPUs is to convert it into a matrix multiply via im2col ("image to column"). Understanding im2col's memory cost reveals why TinyEngine avoids it on MCUs.
Consider a convolution: input X of shape H×W×Cin, kernel K of shape Kh×Kw×Cin×Cout, output Y of shape H'×W'×Cout. For each output position (h,w), the convolution computes a dot product between the kernel and a Kh×Kw×Cin patch of input. im2col "unrolls" these patches into columns of a matrix:
The appeal: once X is unrolled into the im2col matrix, you can use any optimized GEMM routine (BLAS, cuBLAS, etc.). On GPUs this is a huge win because BLAS is highly optimized and massively parallel.
Direct convolution avoids this entirely. It computes the sliding window directly in a 6-deep nested loop, never materializing the im2col matrix:
c // Direct convolution — zero extra memory for (n = 0; n < N_batch; n++) for (oh = 0; oh < H_out; oh++) for (ow = 0; ow < W_out; ow++) for (oc = 0; oc < C_out; oc++) { int acc = 0; for (kh = 0; kh < K_h; kh++) for (kw = 0; kw < K_w; kw++) for (ic = 0; ic < C_in; ic++) acc += X[n][oh*s+kh][ow*s+kw][ic] * W[oc][kh][kw][ic]; Y[n][oh][ow][oc] = acc; }
Direct convolution still benefits from loop reordering (put ic innermost for NHWC layout), tiling (fit a tile of the input and kernel in L1), and unrolling/SIMD on the ic dimension. TinyEngine implements direct conv as its default for all layers, supplemented by layer-specific hand-tuned kernels for point-wise (1×1) and depthwise convolutions.
Drag sliders to see how the im2col matrix size scales with K (kernel size) and C_in (input channels). The red bar shows SRAM needed — the green bar shows total MCU SRAM (320 KB).
A 4D activation tensor N×C×H×W can be stored in two standard layouts. The choice determines which indices are contiguous in memory — and therefore which loops are cache-friendly.
NCHW (PyTorch default): elements are stored as: all pixels of channel 0 (row 0 col 0, row 0 col 1, ..., row H-1 col W-1), then all pixels of channel 1, etc. Consecutive memory = consecutive spatial positions within one channel.
NHWC (TensorFlow default, TinyEngine default for point-wise conv): elements are stored as: all channels at position (row 0, col 0), then all channels at (row 0, col 1), etc. Consecutive memory = all channels at one spatial position.
TinyEngine uses NHWC for point-wise convolution and NCHW for depthwise convolution. This is not a single compromise — it's a split layout strategy. The model's activations are stored in NHWC, but before running a depthwise layer, TinyEngine reorders the data to NCHW. The transpose cost (one pass over the activation) is paid back many times over by the improved kernel speed.
A 3×3×4 activation (H=3, W=3, C=4). Click to animate the access sequence for point-wise (inner loop = C) and depthwise (inner loop = spatial) convolutions. Green = cache hit, red = cache miss.
MobileNetV2 uses inverted residual blocks. Each block has three layers: a 1×1 pointwise expand (channels × expansion factor, typically 6×), a 3×3 depthwise conv, and a 1×1 pointwise project (back to bottleneck channels). The peak SRAM within one block is dominated by the expanded tensor after the first 1×1 layer.
Consider a typical MobileNetV2 early block: bottleneck channels = 16, expansion = 6×, spatial = 56×56. After the pointwise expand, the tensor is 56×56×96 (96 = 16×6). To run the depthwise conv, you need both the input tensor (56×56×96) and the output tensor (56×56×96) in SRAM simultaneously — 2 × 56 × 56 × 96 = 602,112 bytes = 588 KB. That's nearly 2× the entire 320 KB SRAM.
This leads to in-place depthwise convolution: process channels one at a time, writing each output channel back into the input buffer. You only need a single-channel temporary buffer to hold the 3×3 window of input while you're computing one output position.
For C=96, H=W=56: in-place SRAM = (96+1) × 56 × 56 = 304,304 bytes = 297 KB. That's a reduction from 588 KB to 297 KB — nearly half. For larger expansion factors the saving grows: at 6× expansion (C=96), a depthwise conv that would have needed 588 KB now needs only 297 KB, just fitting in 320 KB SRAM.
The implementation requires careful ordering: to write the output of channel k into the input's channel-k slot, you must have finished reading all input values in channel k's 3×3 convolution window. Since the 3×3 kernel has stride 1, and you process positions top-to-bottom left-to-right, the position (h,w) is safe to overwrite after the kernel pass that computes output at (h + K_h/2, w + K_w/2) — i.e., you're always 1 row of kernel "ahead" of your overwrite. A one-row temporary buffer ensures correctness.
Adjust expansion ratio and spatial size. The chart shows standard vs in-place SRAM usage against the 320 KB MCU budget.
After a convolution layer, a neural network almost always applies batch normalization and then an activation function (ReLU or ReLU6). Three sequential operations. In a naive framework, each is a separate kernel call: conv writes an intermediate activation tensor, BN reads it and writes another intermediate tensor, ReLU reads that and writes the final output. Two extra write-then-read cycles for every layer.
On an MCU, every write to SRAM takes ~0.5 ns per byte. A 56×56×96 intermediate tensor is 300 KB. Writing it and reading it back costs 300 KB × 2 (write + read) × 0.5 ns/byte = 300 μs per layer. In a 52-layer MobileNetV2, this is 15.6 ms just in redundant memory traffic — not counting any computation.
Batch normalization at inference is simple: each output channel c has a learned scale γ[c] and bias β[c] (folded in from the running mean/variance during model export). The fused conv-BN-ReLU computes:
In code, this means the inner accumulation loop simply multiplies by γ[c] and adds β[c] before the ReLU clamp, with no intermediate buffer:
c — fused conv+BN+ReLU for (oc = 0; oc < C_out; oc++) { int32_t acc = bias[oc]; for (kh..kw..ic) // inner conv loops acc += X[oh+kh][ow+kw][ic] * W[oc][kh][kw][ic]; // BN: scale + shift (γ, β are folded into INT8 scale factor) acc = acc * scale[oc] + shift[oc]; // ReLU6 clamp: output range [0, 6] in quantized domain Y[oh][ow][oc] = clamp(acc, 0, 255); // write ONCE to SRAM }
This chapter brings together all the optimizations. Each bar in the profiler represents the time contribution of one layer of MCUNet-tiny, broken down by which bottleneck limits it: memory-bound (cache misses dominate) or compute-bound (ALU utilization dominates). Toggle optimizations on/off to see how each one changes the profile — and the total inference time.
Check/uncheck optimizations to see their impact on per-layer inference time. The roofline bar (right) shows where each layer sits: memory-bound layers benefit from layout/in-place fixes; compute-bound layers benefit from SIMD and tiling.
A few patterns to observe as you toggle:
| Technique | What it fixes | Key number | MCU-specific note |
|---|---|---|---|
| Loop reordering | Column-wise (stride-N) cache access | 12× speedup (i,k,j vs i,j,k) | Critical for NCHW matmul on MCU with 8 KB L1 |
| Loop tiling | Working set exceeds L1 cache | 19× cumulative; tile size T ≤ sqrt(L1/(3×elem_size)) | MCU L1 = 8 KB → T≤26 for FP32, T≤52 for INT8 |
| Loop unrolling | Branch/pointer overhead in inner loop | 2.85× speedup; code size ×N | Tradeoff: Flash is small on MCUs — don't over-unroll |
| SIMD (NEON) | 1 MAC per cycle → 16 MACs/cycle (INT8) | 4× FP32, 16× INT8 per lane | ARM Cortex-M7/M55 has 128-bit NEON. Use vmull_s8 |
| Direct conv | im2col K²×C SRAM blow-up | K²×C× duplication → zero with direct | im2col is infeasible on MCUs. Direct conv is the default. |
| NHWC for PW / NCHW for DW | Wrong layout for the conv type | PW inner loop = C (want NHWC); DW inner loop = spatial (want NCHW) | TinyEngine splits layout: stores NHWC, transposes for DW |
| In-place depthwise | 2×C×H×W SRAM for DW conv | Saves ~50% SRAM: 2×C → (1+1/C)×C ≈ C | Needs 1-row temp buffer; requires NCHW channel-independent order |
| Operator fusion | Intermediate buffer writes (BN, ReLU) | Eliminates 2 buffer write/reads per conv-BN-ReLU group | Values stay in registers; only applicable when ops are point-wise |
| Code generation | Interpreter overhead; large runtime binary | TFLM: 300+ KB Flash; TinyEngine: <100 KB; no dispatch overhead | Generates layer-specific C code per model — no general dispatch |
TensorFlow Lite Micro is a general-purpose interpreter: it reads a flatbuffer model file at runtime, iterates over an ops list, and dispatches each op to a registered function via function pointers. This generality costs Flash space (300–400 KB for the runtime alone) and runtime overhead (op dispatch, buffer allocation, type checking).
TinyEngine's approach is different: given a specific model, it generates a C function that hardcodes the exact sequence of layer calls, with parameters inlined as constants. The generated code looks like:
c — TinyEngine generated code (model-specific, no interpreter) void run_mcunet(const uint8_t* input) { // Layer 0: pointwise expand, C_in=32, C_out=192, HW=28 tinyengine_conv_pw_nhwc(input, WEIGHTS_L0, buf0, 28, 28, 32, 192); // Layer 1: in-place depthwise, K=3, C=192, HW=28 tinyengine_dw_inplace_nchw(buf0, WEIGHTS_L1, 28, 28, 192); // Layer 2: pointwise project, C_in=192, C_out=32, HW=28 tinyengine_conv_pw_nhwc(buf0, WEIGHTS_L2, buf1, 28, 28, 192, 32); // ... 49 more layers }
No runtime model parsing, no op dispatch, no buffer allocator. The kernel functions themselves are also code-generated with dimensions as compile-time constants, enabling the compiler to fully unroll inner loops and allocate registers optimally.
TinyEngine does not operate in isolation. It is the "execution" half of MCUNet's co-design loop. TinyNAS (L10) finds the architecture; TinyEngine executes it. The optimizations here — in-place depthwise, NCHW for depthwise — directly informed TinyNAS's search space: the NAS must produce architectures that use depthwise convolution (to benefit from in-place) and inverted residuals (to benefit from the NHWC/NCHW split).
Quantization (L8–L9) is the third co-design axis. INT8 quantization makes SIMD 4× more efficient (16 values per 128-bit NEON register vs 4 for FP32) and reduces the working set size for tiling. In-place depthwise's savings are also 4× larger in byte terms with INT8 vs FP32.
Every technique here reappears at LLM scale, just with different constants. Flash Attention (L12) is operator fusion applied to the QK^T matmul and softmax, keeping the attention matrix in SRAM tiles. vLLM's paged attention is loop tiling over the KV cache. INT4/INT8 LLM quantization exploits SIMD via tensor-core instructions. The code-generation philosophy lives in kernel libraries like Triton and TensorRT, which JIT-compile model-specific CUDA kernels. The physics is the same: memory hierarchy, SIMD width, and working-set size constrain everything.
"What I cannot create, I do not understand. What I cannot run in 93 milliseconds on 320 KB of SRAM, I have not truly optimized."
— paraphrase of Richard Feynman, adapted for TinyML