TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 8

NAS II: Hardware-Aware & Once-for-All Networks

L7's NAS found architectures that minimize FLOPs — but FLOPs don't run on hardware, operations do. Two MABs with identical FLOPs can differ 4× in measured latency. This lesson closes that gap: hardware-aware search, differentiable latency (ProxylessNAS), and a network so elastic it can serve 10¹⁹ device configurations without retraining (Once-for-All).

Prerequisites: TinyML L7 (NAS I) — DARTS, supernet, weight sharing, search space. MAC counting from L2 helpful.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: FLOPs ≠ Latency

L7's NAS worked. You ran a search, you got an architecture with 260M FLOPs, you trained it, it hits 75% top-1 on ImageNet. Now your PM asks: "Can we deploy this on a Snapdragon 888 phone with a 30ms latency budget?" You run the model. It takes 174ms. Three minutes later she says: "Also, one of our clients uses Nvidia 1080Ti for cloud inference — different latency budget."

Here is the brutal truth: the search optimized the wrong thing. FLOPs count arithmetic operations. Latency measures wall-clock time on a specific piece of silicon. These are not the same thing, and the gap is not small. Two architectures with identical FLOPs can differ by a factor of 4× in measured latency on the same device — because latency depends on memory access patterns, parallelism, cache behavior, and operator-level hardware support, none of which FLOPs capture.

The proxy problem: Previous NAS (NASNet: 48K GPU-hours; DARTS: 4 GPU-days) searched on proxy objectives — CIFAR-10, low depth, few epochs, FLOPs. The resulting architectures were optimal for the proxy, not for your actual target hardware. ProxylessNAS (Cai et al., ICLR 2019) was the first to search directly on ImageNet with the target device latency in the loop.

The canvas below shows real data from ProxylessNAS (Titan xp GPU). Two ways to scale a transformer: increase the number of layers (depth scaling) or increase the hidden dimension (width scaling). Both trajectories cross FLOPs from 159M to 326M — nearly identical FLOPs. But their latency trajectories diverge drastically.

FLOPs vs Latency — same FLOPs, wildly different latency

Two architectures at each FLOPs point: one scales depth (adds layers), one scales hidden dimension (wider layers). On an Nvidia Titan xp GPU, depth scaling causes latency to grow linearly — hidden-dim scaling keeps latency flat because wider matrix multiplications saturate GPU SIMD throughput.

Why does depth scaling hurt so much? Layers execute sequentially — layer N must finish before layer N+1 begins. GPUs are throughput machines: they want one massive parallel computation, not many sequential small ones. Hidden-dim scaling makes each matrix multiply wider, which the GPU can exploit with tensor cores running thousands of multiply-accumulates in a single cycle. At 259M FLOPs, the depth-scaled variant takes 174ms; the width-scaled variant takes 51.7ms. That is a 3.4× latency difference at identical FLOPs.

This is the central insight of hardware-aware NAS: the search objective must include measured latency on the target device. Not FLOPs, not MACs, not parameter count — actual milliseconds on actual silicon. The rest of this lesson is the engineering story of how to make that measurement scalable and differentiable.

Two versions of the same problem: We need to run NAS for hundreds of different hardware targets — phones, GPUs, FPGAs, MCUs, cloud accelerators. Measuring latency on the real device during search is expensive. Measuring it after search means we can only search for one target at a time. This lesson covers two solutions: (1) a latency lookup table that makes measurement cheap enough to use in the search loop, and (2) Once-for-All, which decouples training from search entirely so you train once and deploy everywhere.
Two transformer variants have 259M FLOPs. Variant A scales hidden dimension; Variant B scales depth. On an Nvidia GPU, which is faster and why?

Chapter 1: Measuring Hardware Latency

The obvious solution: just measure latency on the real device during search. Send each candidate architecture to the phone, run it, record the milliseconds, use that as the score. MNASNet (Tan et al., CVPR 2019) did exactly this. The result: 40,000 GPU-hours to search for a single mobile architecture. Why so expensive? Because each candidate requires a separate round-trip to the device, and you need thousands of candidates to explore the space well.

Can we do better? ProxylessNAS proposes two approaches that make latency estimation cheap enough to run inside the search loop.

Method 1: Layer-wise Lookup Table
Profile each candidate operation (conv3, conv5, DW3, skip, …) on the target device ONCE. Store [op_type → measured_latency_ms]. Total network latency = Σ_i lat(op_i). Additive assumption: layers don't overlap.
↓ or, for non-additive architectures
Method 2: Latency Prediction Model
Build a small neural net that takes [kernel_size, #layers, hidden_dim, #heads, …] as input and outputs a predicted latency. Trained on ~100–1,000 (arch, measured_latency) pairs. Generalizes to architectures not profiled.
↓ either way
Search can use latency as a fast, differentiable signal
Lookup: O(1) per candidate. Predictor: O(1) neural net forward pass. Both are 1,000× cheaper than measuring on-device.
Why the lookup table works: Modern networks are composed of near-independent layers with no overlapping execution on most hardware. The total latency is well-approximated by the sum of per-layer latencies. ProxylessNAS measured this empirically on ARM CPU, GPU, and specialized accelerators — the additive model predicts real latency with R² > 0.99 on all three.

The lookup table for a single hardware target looks like this:

python
# Profiled once on Snapdragon 888, resolution 224×224
LATENCY_TABLE = {
    'conv_3x3_c64':  2.3,   # ms
    'conv_5x5_c64':  4.1,
    'dw_3x3_c64':   0.8,   # depthwise: much cheaper
    'dw_5x5_c64':   1.2,
    'identity':      0.05,  # skip connection
    'max_pool_3x3': 0.3,
}

def score_architecture(arch):
    """arch = list of op names, one per layer."""
    total_lat = 0.0
    for op in arch:
        total_lat += LATENCY_TABLE[op]
    return total_lat  # ms

# Example: 12-layer network
arch = ['dw_3x3_c64', 'conv_3x3_c64', 'dw_5x5_c64',
        'identity', 'conv_3x3_c64', 'dw_3x3_c64',
        'conv_5x5_c64', 'identity', 'dw_3x3_c64',
        'conv_3x3_c64', 'dw_5x5_c64', 'identity']

latency = score_architecture(arch)
# = 0.8+2.3+1.2+0.05+2.3+0.8+4.1+0.05+0.8+2.3+1.2+0.05 = 15.95 ms
# One lookup: ~microseconds. On-device measurement: ~seconds. ~10,000× faster.

The key insight: you profile each operation once per device, offline. During the search — which runs thousands or millions of candidate evaluations — every latency estimate is a single dictionary lookup. The per-search overhead is essentially zero.

New-target cost: To support a new hardware platform, you need to re-profile each operation on that device — typically 20–100 operations × a few seconds per measurement = a few minutes of profiling. Then the existing search runs immediately on the new lookup table. No retraining. No new search from scratch.
A network has 12 layers. Re-running NAS with on-device measurement takes 8 hours per target. Profiling a latency lookup table takes 5 minutes per target. Once the table is built, each of the 50,000 candidates in the search costs one lookup (microseconds). What is the total time to support 10 hardware targets with the lookup-table approach?

Chapter 2: Multi-Objective Search

Now that latency is cheap to estimate, we need to decide how to incorporate it into the search objective. The naive approach — minimize latency while ignoring accuracy — returns a trivially small network that does nothing useful. The correct framing is multi-objective optimization: find architectures on the Pareto frontier between accuracy and latency, then select the best point for your hardware budget.

There are two common formulations. The first is a hard constraint: maximize accuracy subject to latency ≤ T. This is clean but brittle — a tiny constraint violation disqualifies otherwise excellent architectures. The second (used in ProxylessNAS and MNASNet) is a soft penalty:

L(θ, α) = LCE(θ, α) + λ · E[latency(α)]

Here θ are the network weights (optimized for task loss on the training set), α are the architecture parameters (optimized for the combined loss on the validation set), and λ is a Lagrange-multiplier-like coefficient controlling the accuracy-latency tradeoff. Increasing λ pushes the search toward lower-latency architectures; decreasing λ toward higher-accuracy ones.

λ is not just a hyperparameter — it's a dial for your deployment scenario: A self-driving car might use λ ≈ 0 (accuracy is everything, latency is handled by throwing hardware at it). A smartwatch uses λ ≈ 10 (battery life and latency dominate, a few points of accuracy are acceptable). ProxylessNAS searches with a single λ and produces one architecture. If you need five devices, you run five separate searches.

The canvas below shows the Pareto front between accuracy and latency for a family of architectures found by OFA on ImageNet. Each point is a sub-network with different depth/width/kernel configuration. Dragging the λ slider moves the "target" point along the front — showing which architecture the search would select.

Multi-objective Pareto Front — drag λ to move the search target

The orange dot is the architecture selected at the current λ. Device budget lines show where phone / GPU / MCU constraints would cut the search space.

λ (latency penalty) 0.50

An alternative to the weighted sum is to treat the constraint as exact: find the best accuracy achievable under latency ≤ T. Evolutionary search handles this naturally — eliminate any candidate that violates the constraint and evolve only among feasible ones. OFA uses this exact approach at deploy time (Chapter 8). ProxylessNAS uses the differentiable weighted-sum during search.

In the loss L = L_CE + λ·E[latency], what happens to the selected architecture as λ → ∞?

Chapter 3: ProxylessNAS: Differentiable Latency

DARTS (L7) showed that we can make architecture search differentiable by replacing discrete operation choices with a soft mixture. For each edge in the computation graph, instead of picking one operation, we run a weighted sum of all candidate operations:

o̅(x) = ∑i pi · oi(x),    where pi = softmax(α)i = eαi / ∑j eαj

ProxylessNAS extends this to make latency differentiable too. Since the lookup table gives us a fixed latency for each operation, and the operation is selected with probability p_i, the expected latency for one mixed-op layer is:

E[latencylayer] = ∑i pi · lati

This is a smooth function of the architecture parameters α, because softmax is smooth. So the gradient of the latency term flows back through the softmax to α, and the search can minimize latency by gradient descent — the same optimizer that minimizes task loss.

Worked example. Layer has 5 candidate ops: conv3×3 (12ms), conv5×5 (24ms), conv7×7 (41ms), DW3×3 (6ms), skip (1ms). Raw α = [2.1, 0.4, −0.8, 1.5, −1.2]. Compute: exp(α) = [8.17, 1.49, 0.449, 4.48, 0.301]; sum = 14.89; p = [0.549, 0.100, 0.030, 0.301, 0.020]. E[latency] = 0.549×12 + 0.100×24 + 0.030×41 + 0.301×6 + 0.020×1 = 6.59 + 2.40 + 1.23 + 1.81 + 0.02 = 12.05 ms. The gradient ∂E[lat]/∂α_i = p_i·(lat_i − E[lat]) — if op i is slower than average, increasing α_i increases expected latency (positive gradient); gradient descent pushes α_i down, reducing its probability.

The canvas below makes this concrete. Each bar represents one candidate operation at a layer. The bar height is the softmax probability (controlled by sliders, which simulate the architecture parameters α). The bar label shows the probability, operation latency, and weighted contribution. The bottom bar shows the expected latency — it updates live as you adjust the probabilities.

ProxylessNAS: Differentiable Expected Latency

Drag the α sliders. The softmax probabilities redistribute. The expected latency = Σ pᵢ·latᵢ updates instantly. This is the quantity that gradient descent minimizes via ∂L/∂α.

The total network expected latency is the sum over all layers:

E[latencynetwork] = ∑layers lops i pi(l) · lati(l)

The full training loss becomes L = L_CE + λ · E[latency_network], and both terms are differentiated in the same backward pass. Architecture parameters α are updated on the validation set (preventing overfitting of the architecture to training data), while network weights θ are updated on the training set.

python
# ProxylessNAS loss with differentiable latency
def proxyless_loss(logits, targets, arch_params, latency_table, lambda_lat):
    # Task loss (cross-entropy on validation batch)
    ce_loss = F.cross_entropy(logits, targets)

    # Expected latency — differentiable through softmax
    expected_lat = 0.0
    for layer_idx, alpha in enumerate(arch_params):
        probs = F.softmax(alpha, dim=0)            # shape: [num_ops]
        lats = latency_table[layer_idx]              # preloaded per-op latencies
        expected_lat += (probs * lats).sum()         # Σ pᵢ·latᵢ

    # Combined loss — gradient flows through both terms
    loss = ce_loss + lambda_lat * expected_lat
    return loss, ce_loss.item(), expected_lat.item()

# Gradient of expected latency w.r.t. α_i:
# ∂E[lat]/∂α_i = p_i · (lat_i − E[lat])   (standard softmax Jacobian identity)
# → ops slower than average get negative gradient → their α decreases → probability drops
In ProxylessNAS, after training, you need to select ONE operation per layer (not a mixture). You pick the argmax of α. Which operation wins at a layer with α = [−0.5, 2.3, 0.8, −1.1, 1.4] for ops {conv3, conv5, conv7, DW3, skip}?

Chapter 4: Binarized Paths & Memory

DARTS ran all candidate operations simultaneously in a weighted mixture during the forward pass. That's beautiful for gradient flow, but catastrophic for memory: if you have N candidate operations per edge, you need to store N activation tensors simultaneously. For ImageNet with N=8 operations and large feature maps, this hits >100GB of GPU memory — far beyond any single GPU.

ProxylessNAS's solution: binarize the architecture parameters during the forward pass. Instead of activating all N operations with real-valued probabilities p_i, you sample a binary mask: one operation is activated (probability = 1), all others are deactivated (probability = 0). Only the single active operation's activation tensor exists in memory.

DARTS (all paths active):

o̅(x) = p1·o1(x) + p2·o2(x) + … + pN·oN(x)

Memory: O(N) activations. Gradient: exact.

ProxylessNAS (binarized):

o̅(x) = ok(x), where k ~ Bernoulli(p1, …, pN)

Memory: O(1) activations. Gradient: STE-approximated.

Misconception: "Binarization means we can't compute gradients." The binarize step is non-differentiable (it's a discrete sample). ProxylessNAS uses the Straight-Through Estimator (STE) in the backward pass: treat binarize as the identity function and pass gradients through unchanged. This is exactly the trick Quantization-Aware Training (L6) uses for the round() function. The gradient is biased but provides a useful learning signal: if the sampled operation performed well, its probability increases; if it performed poorly, it decreases.

The binarization scheme works like this in practice:

python
import torch, torch.nn.functional as F

class MixedOp(torch.nn.Module):
    def __init__(self, ops):
        super().__init__()
        self.ops = torch.nn.ModuleList(ops)
        self.alpha = torch.nn.Parameter(torch.zeros(len(ops)))

    def forward(self, x):
        probs = F.softmax(self.alpha, dim=0)

        # ── Binarize: sample one path (memory: O(1)) ──
        # In training, sample from Bernoulli with probabilities probs
        idx = torch.multinomial(probs, 1).item()

        # Straight-Through Estimator for backward pass:
        # forward: run only op[idx] (discrete, cheap)
        # backward: pretend softmax was applied (continuous gradients flow)
        out = self.ops[idx](x)

        # Rescale by probability for gradient signal
        # This connects p_idx in the backward graph
        out = out * probs[idx] / probs[idx].detach()

        return out

# Memory comparison (ImageNet, 224×224, 8 candidate ops):
# DARTS:       8 × H × W × C × 4 bytes ≈ 8 × 56 × 56 × 256 × 4 = 0.51 GB  (×N layers → OOM)
# ProxylessNAS: 1 × H × W × C × 4 bytes ≈ 64 MB per layer  → feasible on single GPU

The result: ProxylessNAS can search directly on ImageNet (the target task) with large architecture spaces on a single GPU. Previous NAS with DARTS required either CIFAR-10 as a proxy or 100+GB of GPU memory. That's the "proxyless" in the name — no proxy task, no memory limitation workaround needed.

A ProxylessNAS mixed-op has N=8 candidate operations. During a forward pass, the binarization samples operation k=3 (conv5×5). In the backward pass, which operations receive gradient updates?

Chapter 5: The Retraining Problem

ProxylessNAS solved one problem: how to search for one architecture on one target device efficiently. It takes about 8 GPU-hours (NASNet: 48,000 GPU-hours; DARTS: 96 GPU-hours). But now your deployment team comes back with a harder request:

"We need to support the Samsung S20, S21, and S22 phones. Plus the LG G8. Plus Google Pixel 2. Plus two FPGA variants. Plus an MCU for the IoT product line. Oh, and the GPU team wants a cloud version too."

That's 8 hardware targets. ProxylessNAS searches for one architecture per target, and then each found architecture must be retrained from scratch (the search supernet weights are not the final weights — the winning discrete architecture is initialized randomly and trained fully). Each full ImageNet training run: 40,000 GPU-hours. Total cost: 8 × 40,000 = 320,000 GPU-hours. That's economically insane.

The double cost: ProxylessNAS (and MNASNet, DARTS) have two expensive steps: (1) search — find the architecture; (2) retrain — train that architecture from scratch. The search cost scales with target count, and the retrain cost scales linearly too. For K targets, total cost ≈ K × (search + retrain) = K × 40K GPU-hours. This is unsustainable past 2–3 targets.

The fundamental issue: these methods tightly couple training and search. Every time you want a new target architecture, you must run both steps again. The question Once-for-All asks is: can we decouple them?

Old paradigm (coupled)
For each hardware target: (1) search → architecture A; (2) retrain A from scratch → deployed model. Cost: O(K) × (search + retrain).
↓ Once-for-All idea
New paradigm (decoupled)
(1) Train ONE supernet that contains ALL sub-networks simultaneously; (2) For each target: search (no training!) → extract sub-network → deploy directly with inherited weights. Cost: O(1) train + O(K) × fast-search.
A company needs to deploy a model to 20 hardware targets. ProxylessNAS costs 8 GPU-hours search + 40,000 GPU-hours retraining per target. What is the total cost? And why does Once-for-All fundamentally change this scaling?

Chapter 6: Once-for-All: Elastic Supernet

Once-for-All (OFA, Cai et al., ICLR 2020) trains a single "elastic" supernet that simultaneously supports an enormous range of sub-network configurations. The supernet is parameterized by four elastic dimensions: depth (how many layers per block), kernel size (3, 5, or 7 for each conv), width (channel multiplier per layer), and resolution (input image size). Any combination of valid choices in these dimensions defines one valid sub-network — and all of them share weights with the supernet.

How many sub-networks does OFA contain? Let's count. OFA uses a MobileNetV3-like backbone with 5 stages. Each stage has a variable number of blocks: depth ∈ {2, 3, 4}. Each block has a variable kernel: {3, 5, 7}. Width multiplier per stage ∈ {0.25, 0.5, 0.75, 1.0}. Input resolution ∈ {128, 160, 192, 224}. The combinatorial count:

#sub-nets = 4res × (4width)5 × (3depth)5 × (3kernel)20 ≈ 1019

That is 10,000,000,000,000,000,000 sub-networks from one trained supernet. Each one shares weights — you don't retrain, you just extract and deploy. The accuracy predictor (trained separately, cheaply) tells you which configuration maximizes accuracy under a given latency constraint.

The magic of weight sharing: A sub-network with depth=2, kernel=3, width=0.5 uses the same first-two-layers weights as the full 4-layer, kernel=7, width=1.0 network — they're literally the same parameters in memory. The OFA training procedure forces these shared weights to simultaneously work well for all their sub-network configurations. This is the hard part — and it's exactly what progressive shrinking solves.

From the user's perspective at deploy time:

Given: target device (latency budget T, SRAM budget S)
Query the latency lookup table profiled for this device (5 minutes to build).
Search: accuracy predictor + evolutionary search (~1 GPU-hour)
Find the architecture configuration (d, k, w, r) that maximizes predicted accuracy while lat(d,k,w,r) ≤ T and memory(d,k,w,r) ≤ S.
Deploy: extract sub-network + load weights (minutes)
Slice the OFA weights at positions corresponding to the chosen (d,k,w). No retraining. Latency and accuracy match predictions.

OFA achieved 80.0% top-1 on ImageNet on Samsung Note10 with 26ms latency — beating MobileNetV3 by 3.8% at similar latency. It also produced models for MCUs as small as the Cortex-M4 (256kB SRAM / 1MB Flash).

An OFA supernet has depth ∈ {2, 3, 4} per stage over 5 stages; kernel ∈ {3, 5, 7} per layer over 20 layers; and 4 width choices per stage. Ignoring resolution, how many sub-networks does it contain?

Chapter 7: Progressive Shrinking

The naive approach to training the OFA supernet: start from scratch and randomly sample sub-networks on each mini-batch, training all configurations simultaneously. This fails catastrophically. The large sub-networks (full depth, full width, big kernels) want large weight values to express complex features. The small sub-networks (shallow, narrow, small kernels) want small weight values compatible with limited capacity. These two pressures fight each other — the result is a supernet where both extremes perform worse than a dedicated network trained for each.

Progressive shrinking is OFA's curriculum solution. Instead of training all configurations from the start, it adds one elastic dimension at a time, in order from largest to smallest, letting each dimension settle before introducing the next disturbance.

Stage 1: Full supernet
Train the complete network (max depth D, max kernel K=7, max width W=1.0, full resolution R=224) to convergence. This establishes high-quality weights for the largest sub-network.
↓ freeze full-net weights; add elastic kernel
Stage 2: Elastic kernel
Keep depth/width fixed. Sample kernels k ∈ {3,5,7} randomly per layer. Smaller kernels use a learned center-crop transformation: W_k3 = T_25×9 · W_k7_flattened — a learned 9×9 matrix projects 7×7 weights to 3×3. Distill from the full-kernel teacher (match feature maps).
↓ add elastic depth
Stage 3: Elastic depth
Keep width fixed, kernels already elastic. Sample depth d ∈ {2,3,4} per block. Later layers in each block can be skipped. The earlier layers see gradient from both deep and shallow configurations — they learn to be good general feature extractors. Distill from full-depth teacher.
↓ add elastic width
Stage 4: Elastic width
Sample channel multiplier ∈ {0.25, 0.5, 0.75, 1.0} per layer. Channels are sorted by importance (L1 norm), so the top-k% channels are always the most informative ones. Width-shrunk variants use the top-k channels and see loss from both full and reduced configurations.
↓ add elastic resolution
Stage 5: Elastic resolution
Randomly sample input resolution ∈ {128, 160, 192, 224} per batch. The network learns position- and scale-invariant features. BN statistics are tracked separately per resolution (or recalculated at deployment).
The kernel-shrinking trick: How do you extract 3×3 weights from a trained 7×7 kernel? OFA does NOT simply center-crop the 7×7 weight matrix. Instead, it learns a 25×9 transformation matrix T for each layer (25 = 5×5 "middle" from 7×7 flattened; 9 = 3×3 flattened). Then W_3x3 = T · W_7x7_center25_flattened. The transformation matrix T is learned during Stage 2 training. This lets the 3×3 sub-networks access a combination of the 7×7 weights rather than just the center 9 values — a richer representation.

The canvas below animates the four shrinking stages. Click "Next Stage" to advance through the progression. Notice how each stage reduces the network's effective capacity while keeping the same underlying weight storage.

Progressive Shrinking — one elastic dimension at a time

Click "Next Stage" to advance through the OFA training curriculum. Each stage adds one elastic dimension — the network is always initialized from the previous stage's weights.

Stage 0: Full | Stage 1: Elastic Kernel | Stage 2: Elastic Depth | Stage 3: Elastic Width

python
# OFA progressive shrinking training schedule (pseudocode)
def ofa_progressive_training(supernet, train_loader, distill_teacher):
    # Stage 1: Train full supernet
    for epoch in range(180):
        train_full_net(supernet, train_loader)

    # Stage 2: Elastic kernel (k=7,5,3)
    for epoch in range(25):
        k = random.choice([3, 5, 7])    # sample kernel per layer
        subnet = supernet.get_subnet(kernel=k, depth='full', width=1.0)
        loss = task_loss(subnet) + distill_loss(subnet, distill_teacher)
        loss.backward()

    # Stage 3: Elastic depth (d=2,3,4)
    for epoch in range(25):
        d = random.choice([2, 3, 4])
        k = random.choice([3, 5, 7])
        subnet = supernet.get_subnet(kernel=k, depth=d, width=1.0)
        loss = task_loss(subnet) + distill_loss(subnet, distill_teacher)
        loss.backward()

    # Stage 4: Elastic width (w=0.25,0.5,0.75,1.0)
    for epoch in range(25):
        w = random.choice([0.25, 0.5, 0.75, 1.0])
        d = random.choice([2, 3, 4])
        k = random.choice([3, 5, 7])
        subnet = supernet.get_subnet(kernel=k, depth=d, width=w)
        # Channel sorting: top-k channels by L1 norm always selected
        loss = task_loss(subnet) + distill_loss(subnet, distill_teacher)
        loss.backward()
    # Total: ~1,200 GPU-hours on ImageNet. Per-target cost after this: ~0.
Why does OFA add elastic dimensions progressively rather than training all configurations from scratch simultaneously?

Chapter 8: Showcase: OFA Deploy Search

This is the payoff chapter. OFA is trained. The supernet contains 10¹⁹ sub-networks. A new hardware target arrives. You don't retrain anything. You search.

The search is an evolutionary algorithm guided by two oracle functions: the latency lookup table (gives exact latency for any (d, k, w, r) configuration in O(1)) and the accuracy predictor (a small MLP trained on ~1,000 labeled (arch, measured accuracy) pairs that predicts top-1 accuracy from the architecture encoding in O(1)). The evolutionary search samples sub-networks, scores them, keeps the Pareto-front survivors, mutates/crosses them, and repeats for ~20 evolution rounds — the whole thing takes less than one GPU-hour.

Accuracy predictor cost: Building the predictor requires ~1,000 labeled (sub-network config, accuracy) pairs. Each accuracy measurement: run the sub-net with inherited OFA weights on 1,000 validation images — about 1 GPU-minute. Total labeling cost: ~1,000 GPU-minutes = ~16 GPU-hours. This is the only training-like cost at deploy time. The predictor is a 3-layer MLP — trivial to train and run.

The canvas below simulates OFA deploy-time search. Drag the latency budget slider for a new hardware target. The canvas instantly identifies the best sub-network configuration: the point on the OFA Pareto front with the highest accuracy that fits within the budget. In practice this happens in under an hour with evolutionary search — here it's instantaneous because we've precomputed the Pareto front.

OFA Deploy-Time Search — set budget, get best sub-network instantly

Drag the latency budget. The highlighted orange dot is the best OFA sub-network for that budget. All configurations share the same supernet weights — zero retraining required.

Latency budget (ms) 30

Try budget = 10 (GPU), 30 (phone), 50 (MCU). Each gives a different architecture from the same supernet.

Let's trace through what OFA reported in the paper for Samsung Note10 (30ms budget): the search finds a sub-net with kernel sizes [7, 5, 7, 5, 3] across 5 stages, depth [4, 3, 4, 4, 4], width multiplier 1.0, resolution 224 — achieving 80.0% top-1. Compare: MobileNetV3-Large (same latency): 75.2%. ProxylessNAS GPU: 75.1% at 5.1ms GPU latency. OFA achieves higher accuracy because it searches over a wider and deeper space — the full OFA search space has 10¹⁹ configurations while ProxylessNAS searches in a smaller space with each run.

OFA for MCUs (TinyNAS preview): The same paradigm extends to microcontrollers, but the constraint changes from latency-ms to SRAM activation footprint (kB) and Flash model size (MB). An STM32H743 (512kB SRAM / 2MB Flash) can run an OFA-derived sub-net with resolution=48, depth=2, kernel=3, width=0.25 — achieving 46.6% top-1 on ImageNet at 10 TOPS/W. L10 will cover MCUNet, which fully automates this SRAM-constrained search. This is why OFA is the foundation — it gives you the sub-networks; TinyNAS/MCUNet gives you the SRAM-aware search strategy.

The full OFA deploy pipeline, end to end:

python
from ofa.nas.accuracy_predictor import AccuracyPredictor
from ofa.nas.efficiency_predictor import Latency_table

# One-time: build latency table for this device (~5 min per device)
lat_table = Latency_table.build(device='samsung_note10')

# One-time: train accuracy predictor (~16 GPU-hours per supernet)
acc_pred = AccuracyPredictor.load('ofa_mbv3_accuracy_predictor.pt')

# Per-target: evolutionary search (~1 GPU-hour)
def deploy_search(budget_ms):
    population = sample_random_subnets(1000)
    for _ in range(20):  # evolution rounds
        # Score: predicted accuracy, constrained to lat ≤ budget
        scores = []
        for net in population:
            lat = lat_table.predict(net)
            if lat > budget_ms: continue  # hard constraint
            acc = acc_pred.predict(net)
            scores.append((acc, net))
        scores.sort(reverse=True)
        # Keep top-50, mutate and crossover for next generation
        population = mutate_and_crossover(scores[:50])

    best_arch = scores[0][1]
    # Extract sub-network weights from OFA supernet — no retraining
    model = ofa_supernet.get_subnet(best_arch)
    return model  # ready to deploy

Chapter 9: Connections & Cheat Sheet

We've covered hardware-aware NAS from problem to solution. Here's the compressed knowledge map of this lesson — everything you need to reproduce any part of it from memory.

The Central Insight Chain: FLOPs ≠ latency (same FLOPs, 4× different latency from depth vs width scaling) → need hardware measurement → direct measurement too slow (40K GPU-hrs/target) → latency lookup table (measure each op once, score any arch in O(1)) → make latency differentiable (E[lat] = Σ pᵢ·latᵢ, gradient flows through softmax) → search one target cheaply (ProxylessNAS, 8 GPU-hrs) → retraining still needed per target → decouple training from search (OFA, 1 supernet containing 10¹⁹ sub-nets, trained once with progressive shrinking) → deploy-time search is fast + free of retraining.
MethodKey IdeaSearch CostTargetsRetrain?
MNASNetOn-device RL search, measured latency40K GPU-hrs/targetOne at a timeYes
ProxylessNASDifferentiable latency (lookup table + Σpᵢ·latᵢ), binarized paths~8 GPU-hrs/targetOne at a timeYes
Once-for-AllElastic supernet (10¹⁹ sub-nets), progressive shrinking, accuracy predictor~1,200 GPU-hrs once + <1 hr/targetUnlimited, same supernetNo
MCUNet (L10)OFA + SRAM/Flash memory constraint in search objectiveSame as OFAMCUs specificallyNo

Formula Cheat Sheet

ProxylessNAS loss: L = LCE(θ, α) + λ · ∑li pi(l) · lati(l)

where p_i = softmax(α)_i, lat_i comes from the lookup table, λ controls accuracy-latency tradeoff. Gradient of latency term: ∂E[lat]/∂α_i = p_i(lat_i − E[lat]).

OFA sub-network count: |D|S × |K|L × |W|S × |R| ≈ 35 × 320 × 45 × 4 ≈ 1019

S = stages (5), L = layers (20), D = depth choices {2,3,4}, K = kernel choices {3,5,7}, W = width choices {0.25,0.5,0.75,1.0}, R = resolution choices {128,160,192,224}.

OFA kernel shrinking: W3×3 = T25×9 · W7×7,center25,flat

T is a learned transformation matrix. The 5×5 center of W_7×7 (25 values) is projected to 9 values (3×3 kernel). This is learned, not a fixed crop.

What's Next

L9: Knowledge Distillation
OFA uses knowledge distillation internally — the full supernet is the teacher; smaller sub-nets are the students. L9 covers KD from first principles: teacher logit matching, feature-level distillation, temperature scaling. Connects back: OFA's progressive shrinking is a structured form of self-distillation.
L10: MCUNet — TinyML on Microcontrollers
OFA searches under latency constraints. MCUNet adds SRAM activation footprint and Flash model-size constraints — essential for Cortex-M4/M7 devices with 256kB–512kB RAM. Introduces TinyNAS (the SRAM-aware search space) and TinyEngine (memory-efficient inference engine). The search backend is OFA.

Related Lessons

TinyML L7: NAS I — Searching for Efficient Networks — DARTS, RL, evolutionary search, supernets. This lesson continues where L7 ends.

TinyML L4: Pruning II — NetAdapt & AMC — NetAdapt uses a hardware-in-the-loop lookup table exactly like ProxylessNAS, but for pruning ratios instead of operation types. Same lookup-table concept, different application.

TinyML L6: Quantization II — QAT — The Straight-Through Estimator (STE) used in ProxylessNAS binarization is identical to the STE used in quantization-aware training for rounding. Same mathematical trick, two domains.

"The goal is not to design a neural network. The goal is to design a process that produces the right neural network for every piece of hardware that will ever exist." — Song Han, MIT 6.5940