TinyEngine: Inference Systems for Tiny Devices

Chapter 0: The Runtime Gap

You just ran MCUNet's TinyNAS search. The output: a MobileNetV2-style architecture with a 128×128 input, 0.5 width multiplier, and a peak SRAM of 113 KB — comfortably under the STM32F746's 320 KB. You flash the model. You run inference. It takes 1.2 seconds per image.

For visual wake-word detection, you need <200 ms. Your model fits; it just doesn't run fast enough. You haven't touched the architecture, the weights, or the quantization. You change only the inference engine — the runtime library that executes the model — and the time drops to 93 ms. A 12.9× speedup with zero accuracy change.

The key insight: FLOPs (floating-point operations) measure how much arithmetic a model requires. But wall-clock time is determined by how efficiently the hardware executes that arithmetic. The same FLOPs can take 10× longer if the data is in the wrong memory tier (DRAM vs cache), the loop order causes repeated cache misses, or the runtime framework brings megabytes of interpreter overhead that doesn't fit in Flash.

Two distinct problems conspire to make stock runtimes slow on MCUs:

Problem 1 — The kernel is cache-unfriendly. A Cortex-M7 has 8 KB of L1 data cache. A naive matrix multiply reads memory in a pattern that thrashes this cache. Each row of B is loaded, evicted, and reloaded N² times. Reordering three loop indices — a trivial 3-line code change — can give a 12× speedup purely from better cache behavior.

Problem 2 — The library itself is too big. TensorFlow Lite Micro (TFLM) is a general interpreter: it reads an ops list at runtime, dispatches to function pointers, allocates buffers dynamically. This runtime machinery occupies 300–400 KB of Flash, leaving no room for weights. TinyEngine generates code — it compiles the model's specific layer sequence into a C function, eliminating all dispatch overhead and cutting the binary to <100 KB.

Roofline intuition: Every layer of a neural network is either compute-bound (the FLOPs are the bottleneck — you max out the ALU before memory can keep up) or memory-bound (data movement is the bottleneck — memory can't feed the ALU fast enough). Loop tiling and SIMD attack compute-bound layers. Data layout and in-place tricks attack memory-bound layers. You must know which regime each layer is in before choosing an optimization.

The Memory Hierarchy Latency Cliff

Data access latency on a Cortex-M7 MCU vs a laptop. Notice how deep the cliff is — DRAM is 200× slower than L1 cache. Every cache miss is a stall.

The canvas above shows why cache behavior dominates MCU performance. An MCU has only 8 KB of L1 cache. If your working set exceeds this, data must come from SRAM (100 ns) or Flash (150+ ns). Loop tiling's entire purpose is to keep the working set inside L1.

TinyEngine achieves a 12.9× speedup over TensorFlow Lite Micro on MCUNet-tiny. The model, weights, and quantization are identical in both cases. What two categories of improvement account for this speedup?

(A) Better quantization and smaller activations. (B) Cache-friendly kernel implementations (loop reordering, tiling, SIMD) and elimination of interpreter overhead via code generation. (C) Running fewer FLOPs by skipping zero weights and using structured pruning. (D) Using FP16 instead of INT8 to exploit the MCU's native floating-point unit.

Chapter 1: Loop Reordering — Free Speedup from Cache Locality

Matrix multiplication is the inner loop of every dense layer and every convolution (once rewritten as GEMM). The naive triple loop looks like this:

c
// Naive: i, j, k order
for (i = 0; i < M; i++)
  for (j = 0; j < N; j++) {
    float acc = 0;
    for (k = 0; k < K; k++)
      acc += A[i][k] * B[k][j];   // B accessed column-wise: BAD
    C[i][j] = acc;
  }

This looks innocent, but the inner loop k accesses B[k][j] — that's a column of B. In row-major layout (how C/C++ store 2D arrays), consecutive elements of a row are adjacent in memory. But a column means jumping by N elements between accesses. For a 512×512 matrix, each step in k jumps 512 floats = 2048 bytes. With an 8 KB L1 cache, the working row of B is evicted after every 4 steps. For N=512: 512³ ÷ 4 = 33 million cache misses in the inner loop alone.

The fix is embarrassingly simple: swap j and k in the loop order. Now the inner loop runs j, accessing B[k][j] as a row — sequential, cache-friendly. The same arithmetic, the same FLOPs. Measured speedup on Intel Xeon: 12×. On ARM Cortex-M7 (8 KB L1): even higher because the cache is smaller and misses hurt more.

c
// Loop-reordered: i, k, j order (B accessed row-wise: GOOD)
for (i = 0; i < M; i++)
  for (k = 0; k < K; k++) {
    float Aik = A[i][k];           // hoist scalar out of inner loop
    for (j = 0; j < N; j++)
      C[i][j] += Aik * B[k][j];   // B[k][j]: row j — sequential!
  }

Why does hoisting Aik out of the inner loop matter? Because A[i][k] is now accessed once per k iteration instead of N times. Compilers sometimes do this automatically, but explicit hoisting guarantees it. On real MCUs, the scalar fits in a register — zero memory accesses for it in the inner loop.

The principle generalizes: when two indices both contribute to memory access, order the loops so the innermost index drives sequential (stride-1) memory access. For NHWC layout tensors, this means the channel (C) index should be innermost; for NCHW tensors, the spatial (W) index.

In row-major (C-style) storage, array B[K][N] stores elements as: B[0][0], B[0][1], ..., B[0][N-1], B[1][0], ... The naive loop accesses B[k][j] with k in the inner loop. Why is this a cache miss pattern and what is the stride between consecutive inner-loop accesses?

Stride-1: B[k][j] and B[k+1][j] differ by 1 element — this is fine. Stride-N: B[k][j] and B[k+1][j] differ by N elements (one full row apart) — only bad if N > cache line size. Stride-N: B[k][j] and B[k+1][j] differ by N elements — for N=512, that's 2 KB per step, evicting the previous row before it can be reused. This is the cache miss pattern. The stride varies because B is stored as a pointer-to-pointer, not a contiguous array.

Chapter 2: Loop Tiling — Fitting the Working Set in Cache

Loop reordering fixed column-access patterns, but there's a subtler problem: even with row-major access, if N is large (say 512), a single row of B is 512×4 = 2 KB. The loop over i accesses M=512 rows of A and M=512 rows of C. The total working set for a block of k iterations is M×K + K×N + M×N floats = hundreds of KB — far exceeding the 8 KB L1 cache. Once B's row is evicted, re-loading it is a cache miss.

Loop tiling (also called loop blocking) partitions the iteration space into tiles that fit in the cache. Instead of iterating all N elements of j in sequence, you iterate in tiles of size T_j. Each tile accesses only T_j columns of B and T_j columns of C — a working set of K×T_j + M×T_j floats.

Deriving the tile size from cache capacity. For an 8 KB L1 cache and FP32 (4 bytes/value): working set ≤ 8192 / 4 = 2048 floats. If you tile only j: working set for inner loops (i, k, j) = T_i×T_k (for A tile) + T_k×T_j (for B tile) + T_i×T_j (for C tile). Setting T_i=T_k=T_j=T: 3T² ≤ 2048 → T ≤ 26. Real implementations use T=16 or T=32 (power of 2 for alignment), with multi-level tiling for L2 as well.

c
#define BLK 32
for (ti = 0; ti < M; ti += BLK)
  for (tk = 0; tk < K; tk += BLK)
    for (tj = 0; tj < N; tj += BLK)
      // inner micro-kernel: BLK×BLK tile stays in L1
      for (i = ti; i < ti+BLK; i++)
        for (k = tk; k < tk+BLK; k++) {
          float Aik = A[i][k];
          for (j = tj; j < tj+BLK; j++)
            C[i][j] += Aik * B[k][j];
        }

With BLK=32: each B tile is 32×32×4 = 4 KB; A tile is 4 KB; C tile is 4 KB. Total 12 KB — exceeds 8 KB slightly. In practice BLK=16 (3 KB/tile, 9 KB total — ok for 8 KB) or careful register-level allocation. Measured speedup over naive: 19× on Intel Xeon; 8–12× on ARM Cortex-M.

For MCUs, there's an additional benefit: tiling can be set so the entire working set fits in SRAM without needing to bring more data from Flash. Since Flash has ~150 ns access latency vs SRAM's 0.5 ns, this is a 300× latency improvement for each cache miss that is eliminated.

Loop Tiling Visualizer — Tile Size vs Cache Utilization

Drag the tile-size slider to see how much of the working set fits in the L1 cache. The blue box = tile being computed; the red region = data spilling out of cache.

Tile size T 32

Matrix dim N 256

An MCU has 8 KB of L1 cache. You're tiling a matrix multiply for 3 tiles (A, B, C tiles) with tile size T×T in FP32 (4 bytes). What is the largest integer T such that all three tiles fit simultaneously in cache?

T=8: working set = 3×8²×4 = 768 bytes — fits, but this is too small to be useful. T=26: 3×26²×4 = 8,112 bytes ≤ 8,192 bytes (8 KB). T=27 gives 8,748 bytes > 8 KB — just over budget. T=32: 3×32²×4 = 12,288 bytes. This fits in 12 KB L2 but not 8 KB L1. T=64: cache size doesn't matter because loop tiling removes all misses regardless of tile size.

Chapter 3: Loop Unrolling & SIMD — One Instruction, Eight MACs

Tiling gets data into the right memory tier. But inside the micro-kernel, there's still overhead: every iteration of the inner loop pays for an increment (j++), a comparison (j < tj+BLK), and a branch. For a tight inner loop that runs billions of times, these 3 extra instructions per MAC add up. Loop unrolling replicates the loop body N times and steps the loop counter by N — eliminating 7/8 of the branching overhead for 8-way unrolling.

c
// Unrolled 4× over k — k steps by 4 each iteration
for (k = 0; k < K; k += 4) {
  acc += A[i][k]   * B[k][j];
  acc += A[i][k+1] * B[k+1][j];   // replicated body ×4
  acc += A[i][k+2] * B[k+2][j];
  acc += A[i][k+3] * B[k+3][j];
}

Loop unrolling alone gives ~2.85× speedup. But modern CPUs and MCUs go further: SIMD (Single Instruction, Multiple Data) applies one instruction to a vector of values simultaneously. ARM Cortex-M7 supports NEON intrinsics — 128-bit vector registers that hold 4 FP32 or 8 INT16 values. One vmulq_f32 instruction multiplies four pairs in one cycle instead of one pair.

c (ARM NEON)
// SISD: 1 multiply per instruction
for (k = 0; k < K; k++)
  acc += A[i][k] * B[k][j];

// SIMD with NEON: 4 multiplies per instruction
float32x4_t vacc = vdupq_n_f32(0);
for (k = 0; k < K; k += 4) {
  float32x4_t vA = vld1q_f32(&A[i][k]);     // load 4 floats from A
  float32x4_t vB = vld1q_f32(&B_T[j][k]);   // load 4 floats from B^T
  vacc = vmlaq_f32(vacc, vA, vB);            // 4 fused multiply-adds
}
// horizontal sum: acc = vacc[0]+vacc[1]+vacc[2]+vacc[3]

Measured cumulative speedups (Intel Xeon, loop-reordering as baseline): Loop reordering alone: 12×. + Loop tiling (BLK=32): 19×. + Loop unrolling (×8): 54×. + SIMD (SSE/NEON ×4): ~100–200×. These multiply, not add. The MCU gains are smaller (fewer pipeline stages, smaller SIMD width) but the ratios are similar.

Misconception — SIMD is only for floats: On MCUs running INT8 quantized models (which is standard for TinyML), SIMD is even more powerful. NEON supports 8-way INT16 or 16-way INT8 SIMD in 128-bit registers. A single vmull_s8 (multiply 8-wide INT8) achieves 16 MACs per cycle. This is the primary reason TinyEngine with INT8 runs 4–8× faster than TinyEngine with FP32 on ARM Cortex-M — not just smaller data, but more MACs per cycle.

ARM Cortex-M7 has 128-bit NEON SIMD registers. A loop processes INT8 values (1 byte each). How many INT8 elements fit in one NEON register, and by what factor does SIMD reduce the number of loop iterations compared to scalar INT8 processing?

4 elements — NEON is designed for 32-bit floats only. 8 elements — 128 / 16 = 8. Reduces iterations by 8×. 16 elements — 128 / 8 = 16. Reduces iterations by 16×. 32 elements — NEON uses 4-bit packing for INT8.

Chapter 4: im2col vs Direct Convolution

Everything in the previous chapters optimized matrix multiplication. But a convolution layer is not a matrix multiply — it's a sliding window operation. The most common approach on GPUs is to convert it into a matrix multiply via im2col ("image to column"). Understanding im2col's memory cost reveals why TinyEngine avoids it on MCUs.

Consider a convolution: input X of shape H×W×C_in, kernel K of shape K_h×K_w×C_in×C_out, output Y of shape H'×W'×C_out. For each output position (h,w), the convolution computes a dot product between the kernel and a K_h×K_w×C_in patch of input. im2col "unrolls" these patches into columns of a matrix:

im2col(X): shape [H'×W', K_h×K_w×C_in]

Kernels reshaped: shape [K_h×K_w×C_in, C_out]

GEMM(im2col(X), kernels) = output Y, shape [H'×W', C_out]

The appeal: once X is unrolled into the im2col matrix, you can use any optimized GEMM routine (BLAS, cuBLAS, etc.). On GPUs this is a huge win because BLAS is highly optimized and massively parallel.

The fatal flaw on MCUs — memory duplication. A single input pixel might appear in up to K_h×K_w = 9 different patches (for a 3×3 kernel). The im2col matrix duplicates each input element up to K² times. For a typical early-layer input: H=56, W=56, C=32, K=3: original input = 56×56×32 = 100,352 bytes. im2col matrix = 56×56 × 3×3×32 = 903,168 bytes — 9× larger. On an MCU with 320 KB SRAM, this is impossible. A 3×3 conv would need 883 KB just for im2col — 2.8× the entire SRAM.

Direct convolution avoids this entirely. It computes the sliding window directly in a 6-deep nested loop, never materializing the im2col matrix:

c
// Direct convolution — zero extra memory
for (n = 0; n < N_batch; n++)
  for (oh = 0; oh < H_out; oh++)
    for (ow = 0; ow < W_out; ow++)
      for (oc = 0; oc < C_out; oc++) {
        int acc = 0;
        for (kh = 0; kh < K_h; kh++)
          for (kw = 0; kw < K_w; kw++)
            for (ic = 0; ic < C_in; ic++)
              acc += X[n][oh*s+kh][ow*s+kw][ic]
                   * W[oc][kh][kw][ic];
        Y[n][oh][ow][oc] = acc;
      }

Direct convolution still benefits from loop reordering (put ic innermost for NHWC layout), tiling (fit a tile of the input and kernel in L1), and unrolling/SIMD on the ic dimension. TinyEngine implements direct conv as its default for all layers, supplemented by layer-specific hand-tuned kernels for point-wise (1×1) and depthwise convolutions.

im2col Memory Explosion vs Direct Conv

Drag sliders to see how the im2col matrix size scales with K (kernel size) and C_in (input channels). The red bar shows SRAM needed — the green bar shows total MCU SRAM (320 KB).

Kernel K 3

C_in channels 32

Spatial H=W 56

A conv layer has input 56×56×64 (NHWC, INT8). A 3×3 kernel, stride 1, same padding gives output 56×56×128. Compute: (a) input activation bytes, (b) im2col matrix bytes, (c) the duplication factor. Does the im2col matrix fit in a 320 KB MCU?

(a) 200 KB input; (b) 200 KB im2col — no duplication because stride=1 means no overlap. (a) 200 KB input; (b) 56×56×(3×3×64) = 1,806 KB im2col; 9× factor. Does not fit (1806 KB >> 320 KB). (a) 56×56×64 = 200 KB; (b) 56×56×(3×3×64) = 1,806 KB; 9× duplication. No — 1,806 KB >> 320 KB total SRAM. (a) 56×56×64×4 = 800 KB FP32; (b) 7,224 KB im2col FP32. In INT8 both are 4× smaller but still don't fit.

Chapter 5: NCHW vs NHWC — Layout is a Performance Choice

A 4D activation tensor N×C×H×W can be stored in two standard layouts. The choice determines which indices are contiguous in memory — and therefore which loops are cache-friendly.

NCHW (PyTorch default): elements are stored as: all pixels of channel 0 (row 0 col 0, row 0 col 1, ..., row H-1 col W-1), then all pixels of channel 1, etc. Consecutive memory = consecutive spatial positions within one channel.

NHWC (TensorFlow default, TinyEngine default for point-wise conv): elements are stored as: all channels at position (row 0, col 0), then all channels at (row 0, col 1), etc. Consecutive memory = all channels at one spatial position.

Which layout wins depends on the convolution type:
Point-wise conv (1×1): Each output pixel = dot product of all C input channels at that position. The inner loop iterates over channels. NHWC puts all channels at one position contiguously → inner loop is stride-1. NCHW scatters channels across H×W stride → inner loop jumps by H×W = expensive.
Depthwise conv (C independent 3×3 filters): Each output pixel = 3×3 spatial window within one channel. The inner loop iterates over the spatial window. NCHW puts all pixels of one channel contiguously → inner loop is stride-1. NHWC interleaves channels between spatial positions → inner loop jumps by C = expensive.

TinyEngine uses NHWC for point-wise convolution and NCHW for depthwise convolution. This is not a single compromise — it's a split layout strategy. The model's activations are stored in NHWC, but before running a depthwise layer, TinyEngine reorders the data to NCHW. The transpose cost (one pass over the activation) is paid back many times over by the improved kernel speed.

NCHW vs NHWC Memory Access Pattern

A 3×3×4 activation (H=3, W=3, C=4). Click to animate the access sequence for point-wise (inner loop = C) and depthwise (inner loop = spatial) convolutions. Green = cache hit, red = cache miss.

A depthwise convolution loops over: (outer) channels C, (inner) spatial window positions within the channel. Which layout is cache-friendly for this access pattern and why?

NHWC: all channels at one spatial position are contiguous, so the channel loop is stride-1. NCHW: all spatial positions of one channel are contiguous. The inner spatial loop is stride-1, and the outer channel loop moves to the next contiguous channel block — cache-friendly. Neither layout makes a difference for depthwise — depthwise conv has no inner channel loop so layout doesn't affect locality. NHWC is always better because it matches PyTorch's default and PyTorch is well-optimized.

Chapter 6: In-place Depthwise Convolution

MobileNetV2 uses inverted residual blocks. Each block has three layers: a 1×1 pointwise expand (channels × expansion factor, typically 6×), a 3×3 depthwise conv, and a 1×1 pointwise project (back to bottleneck channels). The peak SRAM within one block is dominated by the expanded tensor after the first 1×1 layer.

Consider a typical MobileNetV2 early block: bottleneck channels = 16, expansion = 6×, spatial = 56×56. After the pointwise expand, the tensor is 56×56×96 (96 = 16×6). To run the depthwise conv, you need both the input tensor (56×56×96) and the output tensor (56×56×96) in SRAM simultaneously — 2 × 56 × 56 × 96 = 602,112 bytes = 588 KB. That's nearly 2× the entire 320 KB SRAM.

The observation that enables in-place: Depthwise convolution is channel-independent. Channel k of the output depends only on channel k of the input (and the 3×3 kernel for channel k). There is zero dependency between different channels. So you can compute channel k of the output, overwrite channel k of the input, and then compute channel k+1 — the output of channel k never needs to coexist with the input of channel k+1.

This leads to in-place depthwise convolution: process channels one at a time, writing each output channel back into the input buffer. You only need a single-channel temporary buffer to hold the 3×3 window of input while you're computing one output position.

Standard depthwise SRAM = 2 × C × H × W = 2 × 96 × 56 × 56 = 602,112 bytes

In-place depthwise SRAM = (1 + ¹⁄_C) × C × H × W ≈ (C+1) × H × W

For C=96, H=W=56: in-place SRAM = (96+1) × 56 × 56 = 304,304 bytes = 297 KB. That's a reduction from 588 KB to 297 KB — nearly half. For larger expansion factors the saving grows: at 6× expansion (C=96), a depthwise conv that would have needed 588 KB now needs only 297 KB, just fitting in 320 KB SRAM.

The implementation requires careful ordering: to write the output of channel k into the input's channel-k slot, you must have finished reading all input values in channel k's 3×3 convolution window. Since the 3×3 kernel has stride 1, and you process positions top-to-bottom left-to-right, the position (h,w) is safe to overwrite after the kernel pass that computes output at (h + K_h/2, w + K_w/2) — i.e., you're always 1 row of kernel "ahead" of your overwrite. A one-row temporary buffer ensures correctness.

In-place Depthwise — SRAM Savings Visualizer

Adjust expansion ratio and spatial size. The chart shows standard vs in-place SRAM usage against the 320 KB MCU budget.

Expansion ratio 6×

Bottleneck C 16

Spatial H=W 56

A MobileNetV2 block has bottleneck channels B=32, expansion factor t=6, spatial 28×28. (a) What is the expanded channel count? (b) What is the standard depthwise SRAM? (c) What is the in-place depthwise SRAM (in KB)?

(a) C=192; (b) 2×192×28×28 = 301,056 bytes = 294 KB; (c) 193×28×28 = 151,312 bytes = 148 KB. (a) C=192; (b) 192×28×28 = 150,528 bytes — only need to store the input. (c) Same as (b). (a) C = B×t = 32×6 = 192; (b) 2×192×28×28 = 301,056 bytes ≈ 294 KB; (c) (192+1)×28×28 = 151,312 bytes ≈ 148 KB. Saving: 146 KB. (a) C=32; (b) 2×32×28×28 = 50,176 bytes; (c) 33×28×28 = 25,872 bytes. (The expansion isn't applied before depthwise.)

Chapter 7: Operator Fusion — Eliminating Intermediate Writes

After a convolution layer, a neural network almost always applies batch normalization and then an activation function (ReLU or ReLU6). Three sequential operations. In a naive framework, each is a separate kernel call: conv writes an intermediate activation tensor, BN reads it and writes another intermediate tensor, ReLU reads that and writes the final output. Two extra write-then-read cycles for every layer.

On an MCU, every write to SRAM takes ~0.5 ns per byte. A 56×56×96 intermediate tensor is 300 KB. Writing it and reading it back costs 300 KB × 2 (write + read) × 0.5 ns/byte = 300 μs per layer. In a 52-layer MobileNetV2, this is 15.6 ms just in redundant memory traffic — not counting any computation.

Operator fusion merges conv + BN + ReLU into a single fused kernel. After computing each output value of the convolution, it immediately applies the BN scale-and-shift and the ReLU clamp — before writing the result to SRAM. The intermediate values live in CPU registers, never in SRAM. Memory traffic reduces by 2/3 for this group of operations.

Batch normalization at inference is simple: each output channel c has a learned scale γ[c] and bias β[c] (folded in from the running mean/variance during model export). The fused conv-BN-ReLU computes:

output[n,h,w,c] = ReLU6( conv(X)[n,h,w,c] × γ[c] + β[c] )

In code, this means the inner accumulation loop simply multiplies by γ[c] and adds β[c] before the ReLU clamp, with no intermediate buffer:

c — fused conv+BN+ReLU
for (oc = 0; oc < C_out; oc++) {
  int32_t acc = bias[oc];
  for (kh..kw..ic)                    // inner conv loops
    acc += X[oh+kh][ow+kw][ic] * W[oc][kh][kw][ic];
  // BN: scale + shift (γ, β are folded into INT8 scale factor)
  acc = acc * scale[oc] + shift[oc];
  // ReLU6 clamp: output range [0, 6] in quantized domain
  Y[oh][ow][oc] = clamp(acc, 0, 255);  // write ONCE to SRAM
}

What fusion saves: For a 52-layer MobileNetV2 deployed with conv+BN+ReLU groups: unfused = 3 separate kernel calls × 52 layers = 156 kernel launches + 104 intermediate tensor writes (2 per group × 52). Fused = 52 kernel calls, 0 intermediate writes. Memory traffic for intermediates: ~15.6 ms eliminated. Kernel launch overhead: ~2× reduction. Combined speedup contribution: 1.5–2× for this component alone.

A conv-BN-ReLU fusion eliminates intermediate activation buffers by keeping intermediate values in registers. This works because conv, BN, and ReLU are all "point-wise" in the output dimension — each output value depends on no other output value. Which of the following operators CANNOT be fused this way?

ReLU — it depends on the previous layer's output. Batch normalization at inference — it has per-channel scale and shift. Bias addition — just an add per output value. Global average pooling — the output value at position (0,0,c) depends on ALL spatial positions of channel c, requiring the full spatial feature map to be read before any output can be written.

Chapter 8: Showcase: TinyEngine Profiler

This chapter brings together all the optimizations. Each bar in the profiler represents the time contribution of one layer of MCUNet-tiny, broken down by which bottleneck limits it: memory-bound (cache misses dominate) or compute-bound (ALU utilization dominates). Toggle optimizations on/off to see how each one changes the profile — and the total inference time.

TinyEngine Layer Profiler — Toggle Optimizations

Check/uncheck optimizations to see their impact on per-layer inference time. The roofline bar (right) shows where each layer sits: memory-bound layers benefit from layout/in-place fixes; compute-bound layers benefit from SIMD and tiling.

Loop reorder Tiling (T=32) SIMD INT8 In-place DW Op fusion

A few patterns to observe as you toggle:

Loop reordering helps most on large pointwise conv layers (high C, moderate H/W) — these are most cache-miss-limited in the naive loop order.
Tiling compounds with reordering — the combined speedup exceeds the product because tiling enables register allocation that the compiler can exploit.
SIMD INT8 has the largest single-optimization impact on compute-bound layers (channels > 64, small spatial).
In-place depthwise doesn't reduce time directly — it reduces SRAM so that layers don't need to spill to slower memory tiers.
Operator fusion shrinks the "overhead" bars (launch cost + intermediate write) that otherwise accumulate across all 52 layers.

Chapter 9: Connections & Cheat Sheet

Systems Cheat Sheet

Technique	What it fixes	Key number	MCU-specific note
Loop reordering	Column-wise (stride-N) cache access	12× speedup (i,k,j vs i,j,k)	Critical for NCHW matmul on MCU with 8 KB L1
Loop tiling	Working set exceeds L1 cache	19× cumulative; tile size T ≤ sqrt(L1/(3×elem_size))	MCU L1 = 8 KB → T≤26 for FP32, T≤52 for INT8
Loop unrolling	Branch/pointer overhead in inner loop	2.85× speedup; code size ×N	Tradeoff: Flash is small on MCUs — don't over-unroll
SIMD (NEON)	1 MAC per cycle → 16 MACs/cycle (INT8)	4× FP32, 16× INT8 per lane	ARM Cortex-M7/M55 has 128-bit NEON. Use `vmull_s8`
Direct conv	im2col K²×C SRAM blow-up	K²×C× duplication → zero with direct	im2col is infeasible on MCUs. Direct conv is the default.
NHWC for PW / NCHW for DW	Wrong layout for the conv type	PW inner loop = C (want NHWC); DW inner loop = spatial (want NCHW)	TinyEngine splits layout: stores NHWC, transposes for DW
In-place depthwise	2×C×H×W SRAM for DW conv	Saves ~50% SRAM: 2×C → (1+1/C)×C ≈ C	Needs 1-row temp buffer; requires NCHW channel-independent order
Operator fusion	Intermediate buffer writes (BN, ReLU)	Eliminates 2 buffer write/reads per conv-BN-ReLU group	Values stay in registers; only applicable when ops are point-wise
Code generation	Interpreter overhead; large runtime binary	TFLM: 300+ KB Flash; TinyEngine: <100 KB; no dispatch overhead	Generates layer-specific C code per model — no general dispatch

Code Generation vs Interpretation

TensorFlow Lite Micro is a general-purpose interpreter: it reads a flatbuffer model file at runtime, iterates over an ops list, and dispatches each op to a registered function via function pointers. This generality costs Flash space (300–400 KB for the runtime alone) and runtime overhead (op dispatch, buffer allocation, type checking).

TinyEngine's approach is different: given a specific model, it generates a C function that hardcodes the exact sequence of layer calls, with parameters inlined as constants. The generated code looks like:

c — TinyEngine generated code (model-specific, no interpreter)
void run_mcunet(const uint8_t* input) {
  // Layer 0: pointwise expand, C_in=32, C_out=192, HW=28
  tinyengine_conv_pw_nhwc(input, WEIGHTS_L0, buf0, 28, 28, 32, 192);
  // Layer 1: in-place depthwise, K=3, C=192, HW=28
  tinyengine_dw_inplace_nchw(buf0, WEIGHTS_L1, 28, 28, 192);
  // Layer 2: pointwise project, C_in=192, C_out=32, HW=28
  tinyengine_conv_pw_nhwc(buf0, WEIGHTS_L2, buf1, 28, 28, 192, 32);
  // ... 49 more layers
}

No runtime model parsing, no op dispatch, no buffer allocator. The kernel functions themselves are also code-generated with dimensions as compile-time constants, enabling the compiler to fully unroll inner loops and allocate registers optimally.

Co-design with NAS and Quantization

TinyEngine does not operate in isolation. It is the "execution" half of MCUNet's co-design loop. TinyNAS (L10) finds the architecture; TinyEngine executes it. The optimizations here — in-place depthwise, NCHW for depthwise — directly informed TinyNAS's search space: the NAS must produce architectures that use depthwise convolution (to benefit from in-place) and inverted residuals (to benefit from the NHWC/NCHW split).

Quantization (L8–L9) is the third co-design axis. INT8 quantization makes SIMD 4× more efficient (16 values per 128-bit NEON register vs 4 for FP32) and reduces the working set size for tiling. In-place depthwise's savings are also 4× larger in byte terms with INT8 vs FP32.

Bridge to LLM Efficiency (L12+)

Every technique here reappears at LLM scale, just with different constants. Flash Attention (L12) is operator fusion applied to the QK^T matmul and softmax, keeping the attention matrix in SRAM tiles. vLLM's paged attention is loop tiling over the KV cache. INT4/INT8 LLM quantization exploits SIMD via tensor-core instructions. The code-generation philosophy lives in kernel libraries like Triton and TensorRT, which JIT-compile model-specific CUDA kernels. The physics is the same: memory hierarchy, SIMD width, and working-set size constrain everything.

Related Gleams to continue:
• TinyML L10: MCUNet — the architecture that TinyEngine runs (TinyNAS + patch-based inference)
• TinyML L2: Pruning — reducing FLOPs so the engine has less work to optimize
• CS229S: Efficient Architectures — FlashAttention and GPU-side operator fusion
• Day in the Life: ML Inference Engineer — production deployment applying these techniques

"What I cannot create, I do not understand. What I cannot run in 93 milliseconds on 320 KB of SRAM, I have not truly optimized."
— paraphrase of Richard Feynman, adapted for TinyML