TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 11

TinyEngine: Inference Systems for Tiny Devices

MCUNet found an architecture that fits in 320 KB of SRAM. But fitting and running fast are two different things. The same model can be 10× slower on a stock runtime library than on TinyEngine — with zero accuracy loss. This lesson derives every kernel trick that makes the difference: loop reordering for cache locality, tiling to fit the working set in SRAM, im2col vs. direct convolution, NCHW/NHWC layout for vectorization, in-place depthwise convolution to halve the activation buffer, and code generation to eliminate interpreter overhead.

Prerequisites: TinyML L1 (Efficiency Metrics), TinyML L10 (MCUNet) — SRAM/Flash constraints, conv layers, activation memory.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Runtime Gap

You just ran MCUNet's TinyNAS search. The output: a MobileNetV2-style architecture with a 128×128 input, 0.5 width multiplier, and a peak SRAM of 113 KB — comfortably under the STM32F746's 320 KB. You flash the model. You run inference. It takes 1.2 seconds per image.

For visual wake-word detection, you need <200 ms. Your model fits; it just doesn't run fast enough. You haven't touched the architecture, the weights, or the quantization. You change only the inference engine — the runtime library that executes the model — and the time drops to 93 ms. A 12.9× speedup with zero accuracy change.

The key insight: FLOPs (floating-point operations) measure how much arithmetic a model requires. But wall-clock time is determined by how efficiently the hardware executes that arithmetic. The same FLOPs can take 10× longer if the data is in the wrong memory tier (DRAM vs cache), the loop order causes repeated cache misses, or the runtime framework brings megabytes of interpreter overhead that doesn't fit in Flash.

Two distinct problems conspire to make stock runtimes slow on MCUs:

Problem 1 — The kernel is cache-unfriendly. A Cortex-M7 has 8 KB of L1 data cache. A naive matrix multiply reads memory in a pattern that thrashes this cache. Each row of B is loaded, evicted, and reloaded N² times. Reordering three loop indices — a trivial 3-line code change — can give a 12× speedup purely from better cache behavior.

Problem 2 — The library itself is too big. TensorFlow Lite Micro (TFLM) is a general interpreter: it reads an ops list at runtime, dispatches to function pointers, allocates buffers dynamically. This runtime machinery occupies 300–400 KB of Flash, leaving no room for weights. TinyEngine generates code — it compiles the model's specific layer sequence into a C function, eliminating all dispatch overhead and cutting the binary to <100 KB.

Roofline intuition: Every layer of a neural network is either compute-bound (the FLOPs are the bottleneck — you max out the ALU before memory can keep up) or memory-bound (data movement is the bottleneck — memory can't feed the ALU fast enough). Loop tiling and SIMD attack compute-bound layers. Data layout and in-place tricks attack memory-bound layers. You must know which regime each layer is in before choosing an optimization.
The Memory Hierarchy Latency Cliff

Data access latency on a Cortex-M7 MCU vs a laptop. Notice how deep the cliff is — DRAM is 200× slower than L1 cache. Every cache miss is a stall.

The canvas above shows why cache behavior dominates MCU performance. An MCU has only 8 KB of L1 cache. If your working set exceeds this, data must come from SRAM (100 ns) or Flash (150+ ns). Loop tiling's entire purpose is to keep the working set inside L1.

TinyEngine achieves a 12.9× speedup over TensorFlow Lite Micro on MCUNet-tiny. The model, weights, and quantization are identical in both cases. What two categories of improvement account for this speedup?

Chapter 1: Loop Reordering — Free Speedup from Cache Locality

Matrix multiplication is the inner loop of every dense layer and every convolution (once rewritten as GEMM). The naive triple loop looks like this:

c
// Naive: i, j, k order
for (i = 0; i < M; i++)
  for (j = 0; j < N; j++) {
    float acc = 0;
    for (k = 0; k < K; k++)
      acc += A[i][k] * B[k][j];   // B accessed column-wise: BAD
    C[i][j] = acc;
  }

This looks innocent, but the inner loop k accesses B[k][j] — that's a column of B. In row-major layout (how C/C++ store 2D arrays), consecutive elements of a row are adjacent in memory. But a column means jumping by N elements between accesses. For a 512×512 matrix, each step in k jumps 512 floats = 2048 bytes. With an 8 KB L1 cache, the working row of B is evicted after every 4 steps. For N=512: 512³ ÷ 4 = 33 million cache misses in the inner loop alone.

The fix is embarrassingly simple: swap j and k in the loop order. Now the inner loop runs j, accessing B[k][j] as a row — sequential, cache-friendly. The same arithmetic, the same FLOPs. Measured speedup on Intel Xeon: 12×. On ARM Cortex-M7 (8 KB L1): even higher because the cache is smaller and misses hurt more.
c
// Loop-reordered: i, k, j order (B accessed row-wise: GOOD)
for (i = 0; i < M; i++)
  for (k = 0; k < K; k++) {
    float Aik = A[i][k];           // hoist scalar out of inner loop
    for (j = 0; j < N; j++)
      C[i][j] += Aik * B[k][j];   // B[k][j]: row j — sequential!
  }

Why does hoisting Aik out of the inner loop matter? Because A[i][k] is now accessed once per k iteration instead of N times. Compilers sometimes do this automatically, but explicit hoisting guarantees it. On real MCUs, the scalar fits in a register — zero memory accesses for it in the inner loop.

The principle generalizes: when two indices both contribute to memory access, order the loops so the innermost index drives sequential (stride-1) memory access. For NHWC layout tensors, this means the channel (C) index should be innermost; for NCHW tensors, the spatial (W) index.

In row-major (C-style) storage, array B[K][N] stores elements as: B[0][0], B[0][1], ..., B[0][N-1], B[1][0], ... The naive loop accesses B[k][j] with k in the inner loop. Why is this a cache miss pattern and what is the stride between consecutive inner-loop accesses?

Chapter 2: Loop Tiling — Fitting the Working Set in Cache

Loop reordering fixed column-access patterns, but there's a subtler problem: even with row-major access, if N is large (say 512), a single row of B is 512×4 = 2 KB. The loop over i accesses M=512 rows of A and M=512 rows of C. The total working set for a block of k iterations is M×K + K×N + M×N floats = hundreds of KB — far exceeding the 8 KB L1 cache. Once B's row is evicted, re-loading it is a cache miss.

Loop tiling (also called loop blocking) partitions the iteration space into tiles that fit in the cache. Instead of iterating all N elements of j in sequence, you iterate in tiles of size Tj. Each tile accesses only Tj columns of B and Tj columns of C — a working set of K×Tj + M×Tj floats.

Deriving the tile size from cache capacity. For an 8 KB L1 cache and FP32 (4 bytes/value): working set ≤ 8192 / 4 = 2048 floats. If you tile only j: working set for inner loops (i, k, j) = Ti×Tk (for A tile) + Tk×Tj (for B tile) + Ti×Tj (for C tile). Setting Ti=Tk=Tj=T: 3T² ≤ 2048 → T ≤ 26. Real implementations use T=16 or T=32 (power of 2 for alignment), with multi-level tiling for L2 as well.
c
#define BLK 32
for (ti = 0; ti < M; ti += BLK)
  for (tk = 0; tk < K; tk += BLK)
    for (tj = 0; tj < N; tj += BLK)
      // inner micro-kernel: BLK×BLK tile stays in L1
      for (i = ti; i < ti+BLK; i++)
        for (k = tk; k < tk+BLK; k++) {
          float Aik = A[i][k];
          for (j = tj; j < tj+BLK; j++)
            C[i][j] += Aik * B[k][j];
        }

With BLK=32: each B tile is 32×32×4 = 4 KB; A tile is 4 KB; C tile is 4 KB. Total 12 KB — exceeds 8 KB slightly. In practice BLK=16 (3 KB/tile, 9 KB total — ok for 8 KB) or careful register-level allocation. Measured speedup over naive: 19× on Intel Xeon; 8–12× on ARM Cortex-M.

For MCUs, there's an additional benefit: tiling can be set so the entire working set fits in SRAM without needing to bring more data from Flash. Since Flash has ~150 ns access latency vs SRAM's 0.5 ns, this is a 300× latency improvement for each cache miss that is eliminated.

Loop Tiling Visualizer — Tile Size vs Cache Utilization

Drag the tile-size slider to see how much of the working set fits in the L1 cache. The blue box = tile being computed; the red region = data spilling out of cache.

Tile size T 32
Matrix dim N 256
An MCU has 8 KB of L1 cache. You're tiling a matrix multiply for 3 tiles (A, B, C tiles) with tile size T×T in FP32 (4 bytes). What is the largest integer T such that all three tiles fit simultaneously in cache?

Chapter 3: Loop Unrolling & SIMD — One Instruction, Eight MACs

Tiling gets data into the right memory tier. But inside the micro-kernel, there's still overhead: every iteration of the inner loop pays for an increment (j++), a comparison (j < tj+BLK), and a branch. For a tight inner loop that runs billions of times, these 3 extra instructions per MAC add up. Loop unrolling replicates the loop body N times and steps the loop counter by N — eliminating 7/8 of the branching overhead for 8-way unrolling.

c
// Unrolled 4× over k — k steps by 4 each iteration
for (k = 0; k < K; k += 4) {
  acc += A[i][k]   * B[k][j];
  acc += A[i][k+1] * B[k+1][j];   // replicated body ×4
  acc += A[i][k+2] * B[k+2][j];
  acc += A[i][k+3] * B[k+3][j];
}

Loop unrolling alone gives ~2.85× speedup. But modern CPUs and MCUs go further: SIMD (Single Instruction, Multiple Data) applies one instruction to a vector of values simultaneously. ARM Cortex-M7 supports NEON intrinsics — 128-bit vector registers that hold 4 FP32 or 8 INT16 values. One vmulq_f32 instruction multiplies four pairs in one cycle instead of one pair.

c (ARM NEON)
// SISD: 1 multiply per instruction
for (k = 0; k < K; k++)
  acc += A[i][k] * B[k][j];

// SIMD with NEON: 4 multiplies per instruction
float32x4_t vacc = vdupq_n_f32(0);
for (k = 0; k < K; k += 4) {
  float32x4_t vA = vld1q_f32(&A[i][k]);     // load 4 floats from A
  float32x4_t vB = vld1q_f32(&B_T[j][k]);   // load 4 floats from B^T
  vacc = vmlaq_f32(vacc, vA, vB);            // 4 fused multiply-adds
}
// horizontal sum: acc = vacc[0]+vacc[1]+vacc[2]+vacc[3]
Measured cumulative speedups (Intel Xeon, loop-reordering as baseline): Loop reordering alone: 12×. + Loop tiling (BLK=32): 19×. + Loop unrolling (×8): 54×. + SIMD (SSE/NEON ×4): ~100–200×. These multiply, not add. The MCU gains are smaller (fewer pipeline stages, smaller SIMD width) but the ratios are similar.
Misconception — SIMD is only for floats: On MCUs running INT8 quantized models (which is standard for TinyML), SIMD is even more powerful. NEON supports 8-way INT16 or 16-way INT8 SIMD in 128-bit registers. A single vmull_s8 (multiply 8-wide INT8) achieves 16 MACs per cycle. This is the primary reason TinyEngine with INT8 runs 4–8× faster than TinyEngine with FP32 on ARM Cortex-M — not just smaller data, but more MACs per cycle.
ARM Cortex-M7 has 128-bit NEON SIMD registers. A loop processes INT8 values (1 byte each). How many INT8 elements fit in one NEON register, and by what factor does SIMD reduce the number of loop iterations compared to scalar INT8 processing?

Chapter 4: im2col vs Direct Convolution

Everything in the previous chapters optimized matrix multiplication. But a convolution layer is not a matrix multiply — it's a sliding window operation. The most common approach on GPUs is to convert it into a matrix multiply via im2col ("image to column"). Understanding im2col's memory cost reveals why TinyEngine avoids it on MCUs.

Consider a convolution: input X of shape H×W×Cin, kernel K of shape Kh×Kw×Cin×Cout, output Y of shape H'×W'×Cout. For each output position (h,w), the convolution computes a dot product between the kernel and a Kh×Kw×Cin patch of input. im2col "unrolls" these patches into columns of a matrix:

im2col(X): shape [H'×W', Kh×Kw×Cin]
Kernels reshaped: shape [Kh×Kw×Cin, Cout]
GEMM(im2col(X), kernels) = output Y, shape [H'×W', Cout]

The appeal: once X is unrolled into the im2col matrix, you can use any optimized GEMM routine (BLAS, cuBLAS, etc.). On GPUs this is a huge win because BLAS is highly optimized and massively parallel.

The fatal flaw on MCUs — memory duplication. A single input pixel might appear in up to Kh×Kw = 9 different patches (for a 3×3 kernel). The im2col matrix duplicates each input element up to K² times. For a typical early-layer input: H=56, W=56, C=32, K=3: original input = 56×56×32 = 100,352 bytes. im2col matrix = 56×56 × 3×3×32 = 903,168 bytes — 9× larger. On an MCU with 320 KB SRAM, this is impossible. A 3×3 conv would need 883 KB just for im2col — 2.8× the entire SRAM.

Direct convolution avoids this entirely. It computes the sliding window directly in a 6-deep nested loop, never materializing the im2col matrix:

c
// Direct convolution — zero extra memory
for (n = 0; n < N_batch; n++)
  for (oh = 0; oh < H_out; oh++)
    for (ow = 0; ow < W_out; ow++)
      for (oc = 0; oc < C_out; oc++) {
        int acc = 0;
        for (kh = 0; kh < K_h; kh++)
          for (kw = 0; kw < K_w; kw++)
            for (ic = 0; ic < C_in; ic++)
              acc += X[n][oh*s+kh][ow*s+kw][ic]
                   * W[oc][kh][kw][ic];
        Y[n][oh][ow][oc] = acc;
      }

Direct convolution still benefits from loop reordering (put ic innermost for NHWC layout), tiling (fit a tile of the input and kernel in L1), and unrolling/SIMD on the ic dimension. TinyEngine implements direct conv as its default for all layers, supplemented by layer-specific hand-tuned kernels for point-wise (1×1) and depthwise convolutions.

im2col Memory Explosion vs Direct Conv

Drag sliders to see how the im2col matrix size scales with K (kernel size) and C_in (input channels). The red bar shows SRAM needed — the green bar shows total MCU SRAM (320 KB).

Kernel K 3
C_in channels 32
Spatial H=W 56
A conv layer has input 56×56×64 (NHWC, INT8). A 3×3 kernel, stride 1, same padding gives output 56×56×128. Compute: (a) input activation bytes, (b) im2col matrix bytes, (c) the duplication factor. Does the im2col matrix fit in a 320 KB MCU?

Chapter 5: NCHW vs NHWC — Layout is a Performance Choice

A 4D activation tensor N×C×H×W can be stored in two standard layouts. The choice determines which indices are contiguous in memory — and therefore which loops are cache-friendly.

NCHW (PyTorch default): elements are stored as: all pixels of channel 0 (row 0 col 0, row 0 col 1, ..., row H-1 col W-1), then all pixels of channel 1, etc. Consecutive memory = consecutive spatial positions within one channel.

NHWC (TensorFlow default, TinyEngine default for point-wise conv): elements are stored as: all channels at position (row 0, col 0), then all channels at (row 0, col 1), etc. Consecutive memory = all channels at one spatial position.

Which layout wins depends on the convolution type:
Point-wise conv (1×1): Each output pixel = dot product of all C input channels at that position. The inner loop iterates over channels. NHWC puts all channels at one position contiguously → inner loop is stride-1. NCHW scatters channels across H×W stride → inner loop jumps by H×W = expensive.
Depthwise conv (C independent 3×3 filters): Each output pixel = 3×3 spatial window within one channel. The inner loop iterates over the spatial window. NCHW puts all pixels of one channel contiguously → inner loop is stride-1. NHWC interleaves channels between spatial positions → inner loop jumps by C = expensive.

TinyEngine uses NHWC for point-wise convolution and NCHW for depthwise convolution. This is not a single compromise — it's a split layout strategy. The model's activations are stored in NHWC, but before running a depthwise layer, TinyEngine reorders the data to NCHW. The transpose cost (one pass over the activation) is paid back many times over by the improved kernel speed.

NCHW vs NHWC Memory Access Pattern

A 3×3×4 activation (H=3, W=3, C=4). Click to animate the access sequence for point-wise (inner loop = C) and depthwise (inner loop = spatial) convolutions. Green = cache hit, red = cache miss.

A depthwise convolution loops over: (outer) channels C, (inner) spatial window positions within the channel. Which layout is cache-friendly for this access pattern and why?

Chapter 6: In-place Depthwise Convolution

MobileNetV2 uses inverted residual blocks. Each block has three layers: a 1×1 pointwise expand (channels × expansion factor, typically 6×), a 3×3 depthwise conv, and a 1×1 pointwise project (back to bottleneck channels). The peak SRAM within one block is dominated by the expanded tensor after the first 1×1 layer.

Consider a typical MobileNetV2 early block: bottleneck channels = 16, expansion = 6×, spatial = 56×56. After the pointwise expand, the tensor is 56×56×96 (96 = 16×6). To run the depthwise conv, you need both the input tensor (56×56×96) and the output tensor (56×56×96) in SRAM simultaneously — 2 × 56 × 56 × 96 = 602,112 bytes = 588 KB. That's nearly 2× the entire 320 KB SRAM.

The observation that enables in-place: Depthwise convolution is channel-independent. Channel k of the output depends only on channel k of the input (and the 3×3 kernel for channel k). There is zero dependency between different channels. So you can compute channel k of the output, overwrite channel k of the input, and then compute channel k+1 — the output of channel k never needs to coexist with the input of channel k+1.

This leads to in-place depthwise convolution: process channels one at a time, writing each output channel back into the input buffer. You only need a single-channel temporary buffer to hold the 3×3 window of input while you're computing one output position.

Standard depthwise SRAM = 2 × C × H × W = 2 × 96 × 56 × 56 = 602,112 bytes
In-place depthwise SRAM = (1 + 1C) × C × H × W ≈ (C+1) × H × W

For C=96, H=W=56: in-place SRAM = (96+1) × 56 × 56 = 304,304 bytes = 297 KB. That's a reduction from 588 KB to 297 KB — nearly half. For larger expansion factors the saving grows: at 6× expansion (C=96), a depthwise conv that would have needed 588 KB now needs only 297 KB, just fitting in 320 KB SRAM.

The implementation requires careful ordering: to write the output of channel k into the input's channel-k slot, you must have finished reading all input values in channel k's 3×3 convolution window. Since the 3×3 kernel has stride 1, and you process positions top-to-bottom left-to-right, the position (h,w) is safe to overwrite after the kernel pass that computes output at (h + K_h/2, w + K_w/2) — i.e., you're always 1 row of kernel "ahead" of your overwrite. A one-row temporary buffer ensures correctness.

In-place Depthwise — SRAM Savings Visualizer

Adjust expansion ratio and spatial size. The chart shows standard vs in-place SRAM usage against the 320 KB MCU budget.

Expansion ratio
Bottleneck C 16
Spatial H=W 56
A MobileNetV2 block has bottleneck channels B=32, expansion factor t=6, spatial 28×28. (a) What is the expanded channel count? (b) What is the standard depthwise SRAM? (c) What is the in-place depthwise SRAM (in KB)?

Chapter 7: Operator Fusion — Eliminating Intermediate Writes

After a convolution layer, a neural network almost always applies batch normalization and then an activation function (ReLU or ReLU6). Three sequential operations. In a naive framework, each is a separate kernel call: conv writes an intermediate activation tensor, BN reads it and writes another intermediate tensor, ReLU reads that and writes the final output. Two extra write-then-read cycles for every layer.

On an MCU, every write to SRAM takes ~0.5 ns per byte. A 56×56×96 intermediate tensor is 300 KB. Writing it and reading it back costs 300 KB × 2 (write + read) × 0.5 ns/byte = 300 μs per layer. In a 52-layer MobileNetV2, this is 15.6 ms just in redundant memory traffic — not counting any computation.

Operator fusion merges conv + BN + ReLU into a single fused kernel. After computing each output value of the convolution, it immediately applies the BN scale-and-shift and the ReLU clamp — before writing the result to SRAM. The intermediate values live in CPU registers, never in SRAM. Memory traffic reduces by 2/3 for this group of operations.

Batch normalization at inference is simple: each output channel c has a learned scale γ[c] and bias β[c] (folded in from the running mean/variance during model export). The fused conv-BN-ReLU computes:

output[n,h,w,c] = ReLU6( conv(X)[n,h,w,c] × γ[c] + β[c] )

In code, this means the inner accumulation loop simply multiplies by γ[c] and adds β[c] before the ReLU clamp, with no intermediate buffer:

c — fused conv+BN+ReLU
for (oc = 0; oc < C_out; oc++) {
  int32_t acc = bias[oc];
  for (kh..kw..ic)                    // inner conv loops
    acc += X[oh+kh][ow+kw][ic] * W[oc][kh][kw][ic];
  // BN: scale + shift (γ, β are folded into INT8 scale factor)
  acc = acc * scale[oc] + shift[oc];
  // ReLU6 clamp: output range [0, 6] in quantized domain
  Y[oh][ow][oc] = clamp(acc, 0, 255);  // write ONCE to SRAM
}
What fusion saves: For a 52-layer MobileNetV2 deployed with conv+BN+ReLU groups: unfused = 3 separate kernel calls × 52 layers = 156 kernel launches + 104 intermediate tensor writes (2 per group × 52). Fused = 52 kernel calls, 0 intermediate writes. Memory traffic for intermediates: ~15.6 ms eliminated. Kernel launch overhead: ~2× reduction. Combined speedup contribution: 1.5–2× for this component alone.
A conv-BN-ReLU fusion eliminates intermediate activation buffers by keeping intermediate values in registers. This works because conv, BN, and ReLU are all "point-wise" in the output dimension — each output value depends on no other output value. Which of the following operators CANNOT be fused this way?

Chapter 8: Showcase: TinyEngine Profiler

This chapter brings together all the optimizations. Each bar in the profiler represents the time contribution of one layer of MCUNet-tiny, broken down by which bottleneck limits it: memory-bound (cache misses dominate) or compute-bound (ALU utilization dominates). Toggle optimizations on/off to see how each one changes the profile — and the total inference time.

TinyEngine Layer Profiler — Toggle Optimizations

Check/uncheck optimizations to see their impact on per-layer inference time. The roofline bar (right) shows where each layer sits: memory-bound layers benefit from layout/in-place fixes; compute-bound layers benefit from SIMD and tiling.

A few patterns to observe as you toggle:

Chapter 9: Connections & Cheat Sheet

Systems Cheat Sheet

TechniqueWhat it fixesKey numberMCU-specific note
Loop reorderingColumn-wise (stride-N) cache access12× speedup (i,k,j vs i,j,k)Critical for NCHW matmul on MCU with 8 KB L1
Loop tilingWorking set exceeds L1 cache19× cumulative; tile size T ≤ sqrt(L1/(3×elem_size))MCU L1 = 8 KB → T≤26 for FP32, T≤52 for INT8
Loop unrollingBranch/pointer overhead in inner loop2.85× speedup; code size ×NTradeoff: Flash is small on MCUs — don't over-unroll
SIMD (NEON)1 MAC per cycle → 16 MACs/cycle (INT8)4× FP32, 16× INT8 per laneARM Cortex-M7/M55 has 128-bit NEON. Use vmull_s8
Direct convim2col K²×C SRAM blow-upK²×C× duplication → zero with directim2col is infeasible on MCUs. Direct conv is the default.
NHWC for PW / NCHW for DWWrong layout for the conv typePW inner loop = C (want NHWC); DW inner loop = spatial (want NCHW)TinyEngine splits layout: stores NHWC, transposes for DW
In-place depthwise2×C×H×W SRAM for DW convSaves ~50% SRAM: 2×C → (1+1/C)×C ≈ CNeeds 1-row temp buffer; requires NCHW channel-independent order
Operator fusionIntermediate buffer writes (BN, ReLU)Eliminates 2 buffer write/reads per conv-BN-ReLU groupValues stay in registers; only applicable when ops are point-wise
Code generationInterpreter overhead; large runtime binaryTFLM: 300+ KB Flash; TinyEngine: <100 KB; no dispatch overheadGenerates layer-specific C code per model — no general dispatch

Code Generation vs Interpretation

TensorFlow Lite Micro is a general-purpose interpreter: it reads a flatbuffer model file at runtime, iterates over an ops list, and dispatches each op to a registered function via function pointers. This generality costs Flash space (300–400 KB for the runtime alone) and runtime overhead (op dispatch, buffer allocation, type checking).

TinyEngine's approach is different: given a specific model, it generates a C function that hardcodes the exact sequence of layer calls, with parameters inlined as constants. The generated code looks like:

c — TinyEngine generated code (model-specific, no interpreter)
void run_mcunet(const uint8_t* input) {
  // Layer 0: pointwise expand, C_in=32, C_out=192, HW=28
  tinyengine_conv_pw_nhwc(input, WEIGHTS_L0, buf0, 28, 28, 32, 192);
  // Layer 1: in-place depthwise, K=3, C=192, HW=28
  tinyengine_dw_inplace_nchw(buf0, WEIGHTS_L1, 28, 28, 192);
  // Layer 2: pointwise project, C_in=192, C_out=32, HW=28
  tinyengine_conv_pw_nhwc(buf0, WEIGHTS_L2, buf1, 28, 28, 192, 32);
  // ... 49 more layers
}

No runtime model parsing, no op dispatch, no buffer allocator. The kernel functions themselves are also code-generated with dimensions as compile-time constants, enabling the compiler to fully unroll inner loops and allocate registers optimally.

Co-design with NAS and Quantization

TinyEngine does not operate in isolation. It is the "execution" half of MCUNet's co-design loop. TinyNAS (L10) finds the architecture; TinyEngine executes it. The optimizations here — in-place depthwise, NCHW for depthwise — directly informed TinyNAS's search space: the NAS must produce architectures that use depthwise convolution (to benefit from in-place) and inverted residuals (to benefit from the NHWC/NCHW split).

Quantization (L8–L9) is the third co-design axis. INT8 quantization makes SIMD 4× more efficient (16 values per 128-bit NEON register vs 4 for FP32) and reduces the working set size for tiling. In-place depthwise's savings are also 4× larger in byte terms with INT8 vs FP32.

Bridge to LLM Efficiency (L12+)

Every technique here reappears at LLM scale, just with different constants. Flash Attention (L12) is operator fusion applied to the QK^T matmul and softmax, keeping the attention matrix in SRAM tiles. vLLM's paged attention is loop tiling over the KV cache. INT4/INT8 LLM quantization exploits SIMD via tensor-core instructions. The code-generation philosophy lives in kernel libraries like Triton and TensorRT, which JIT-compile model-specific CUDA kernels. The physics is the same: memory hierarchy, SIMD width, and working-set size constrain everything.

Related Gleams to continue:
TinyML L10: MCUNet — the architecture that TinyEngine runs (TinyNAS + patch-based inference)
TinyML L2: Pruning — reducing FLOPs so the engine has less work to optimize
CS229S: Efficient Architectures — FlashAttention and GPU-side operator fusion
Day in the Life: ML Inference Engineer — production deployment applying these techniques
"What I cannot create, I do not understand. What I cannot run in 93 milliseconds on 320 KB of SRAM, I have not truly optimized."
— paraphrase of Richard Feynman, adapted for TinyML