L7's NAS found architectures that minimize FLOPs — but FLOPs don't run on hardware, operations do. Two MABs with identical FLOPs can differ 4× in measured latency. This lesson closes that gap: hardware-aware search, differentiable latency (ProxylessNAS), and a network so elastic it can serve 10¹⁹ device configurations without retraining (Once-for-All).
L7's NAS worked. You ran a search, you got an architecture with 260M FLOPs, you trained it, it hits 75% top-1 on ImageNet. Now your PM asks: "Can we deploy this on a Snapdragon 888 phone with a 30ms latency budget?" You run the model. It takes 174ms. Three minutes later she says: "Also, one of our clients uses Nvidia 1080Ti for cloud inference — different latency budget."
Here is the brutal truth: the search optimized the wrong thing. FLOPs count arithmetic operations. Latency measures wall-clock time on a specific piece of silicon. These are not the same thing, and the gap is not small. Two architectures with identical FLOPs can differ by a factor of 4× in measured latency on the same device — because latency depends on memory access patterns, parallelism, cache behavior, and operator-level hardware support, none of which FLOPs capture.
The canvas below shows real data from ProxylessNAS (Titan xp GPU). Two ways to scale a transformer: increase the number of layers (depth scaling) or increase the hidden dimension (width scaling). Both trajectories cross FLOPs from 159M to 326M — nearly identical FLOPs. But their latency trajectories diverge drastically.
Two architectures at each FLOPs point: one scales depth (adds layers), one scales hidden dimension (wider layers). On an Nvidia Titan xp GPU, depth scaling causes latency to grow linearly — hidden-dim scaling keeps latency flat because wider matrix multiplications saturate GPU SIMD throughput.
Why does depth scaling hurt so much? Layers execute sequentially — layer N must finish before layer N+1 begins. GPUs are throughput machines: they want one massive parallel computation, not many sequential small ones. Hidden-dim scaling makes each matrix multiply wider, which the GPU can exploit with tensor cores running thousands of multiply-accumulates in a single cycle. At 259M FLOPs, the depth-scaled variant takes 174ms; the width-scaled variant takes 51.7ms. That is a 3.4× latency difference at identical FLOPs.
This is the central insight of hardware-aware NAS: the search objective must include measured latency on the target device. Not FLOPs, not MACs, not parameter count — actual milliseconds on actual silicon. The rest of this lesson is the engineering story of how to make that measurement scalable and differentiable.
The obvious solution: just measure latency on the real device during search. Send each candidate architecture to the phone, run it, record the milliseconds, use that as the score. MNASNet (Tan et al., CVPR 2019) did exactly this. The result: 40,000 GPU-hours to search for a single mobile architecture. Why so expensive? Because each candidate requires a separate round-trip to the device, and you need thousands of candidates to explore the space well.
Can we do better? ProxylessNAS proposes two approaches that make latency estimation cheap enough to run inside the search loop.
The lookup table for a single hardware target looks like this:
python # Profiled once on Snapdragon 888, resolution 224×224 LATENCY_TABLE = { 'conv_3x3_c64': 2.3, # ms 'conv_5x5_c64': 4.1, 'dw_3x3_c64': 0.8, # depthwise: much cheaper 'dw_5x5_c64': 1.2, 'identity': 0.05, # skip connection 'max_pool_3x3': 0.3, } def score_architecture(arch): """arch = list of op names, one per layer.""" total_lat = 0.0 for op in arch: total_lat += LATENCY_TABLE[op] return total_lat # ms # Example: 12-layer network arch = ['dw_3x3_c64', 'conv_3x3_c64', 'dw_5x5_c64', 'identity', 'conv_3x3_c64', 'dw_3x3_c64', 'conv_5x5_c64', 'identity', 'dw_3x3_c64', 'conv_3x3_c64', 'dw_5x5_c64', 'identity'] latency = score_architecture(arch) # = 0.8+2.3+1.2+0.05+2.3+0.8+4.1+0.05+0.8+2.3+1.2+0.05 = 15.95 ms # One lookup: ~microseconds. On-device measurement: ~seconds. ~10,000× faster.
The key insight: you profile each operation once per device, offline. During the search — which runs thousands or millions of candidate evaluations — every latency estimate is a single dictionary lookup. The per-search overhead is essentially zero.
Now that latency is cheap to estimate, we need to decide how to incorporate it into the search objective. The naive approach — minimize latency while ignoring accuracy — returns a trivially small network that does nothing useful. The correct framing is multi-objective optimization: find architectures on the Pareto frontier between accuracy and latency, then select the best point for your hardware budget.
There are two common formulations. The first is a hard constraint: maximize accuracy subject to latency ≤ T. This is clean but brittle — a tiny constraint violation disqualifies otherwise excellent architectures. The second (used in ProxylessNAS and MNASNet) is a soft penalty:
Here θ are the network weights (optimized for task loss on the training set), α are the architecture parameters (optimized for the combined loss on the validation set), and λ is a Lagrange-multiplier-like coefficient controlling the accuracy-latency tradeoff. Increasing λ pushes the search toward lower-latency architectures; decreasing λ toward higher-accuracy ones.
The canvas below shows the Pareto front between accuracy and latency for a family of architectures found by OFA on ImageNet. Each point is a sub-network with different depth/width/kernel configuration. Dragging the λ slider moves the "target" point along the front — showing which architecture the search would select.
The orange dot is the architecture selected at the current λ. Device budget lines show where phone / GPU / MCU constraints would cut the search space.
An alternative to the weighted sum is to treat the constraint as exact: find the best accuracy achievable under latency ≤ T. Evolutionary search handles this naturally — eliminate any candidate that violates the constraint and evolve only among feasible ones. OFA uses this exact approach at deploy time (Chapter 8). ProxylessNAS uses the differentiable weighted-sum during search.
DARTS (L7) showed that we can make architecture search differentiable by replacing discrete operation choices with a soft mixture. For each edge in the computation graph, instead of picking one operation, we run a weighted sum of all candidate operations:
ProxylessNAS extends this to make latency differentiable too. Since the lookup table gives us a fixed latency for each operation, and the operation is selected with probability p_i, the expected latency for one mixed-op layer is:
This is a smooth function of the architecture parameters α, because softmax is smooth. So the gradient of the latency term flows back through the softmax to α, and the search can minimize latency by gradient descent — the same optimizer that minimizes task loss.
The canvas below makes this concrete. Each bar represents one candidate operation at a layer. The bar height is the softmax probability (controlled by sliders, which simulate the architecture parameters α). The bar label shows the probability, operation latency, and weighted contribution. The bottom bar shows the expected latency — it updates live as you adjust the probabilities.
Drag the α sliders. The softmax probabilities redistribute. The expected latency = Σ pᵢ·latᵢ updates instantly. This is the quantity that gradient descent minimizes via ∂L/∂α.
The total network expected latency is the sum over all layers:
The full training loss becomes L = L_CE + λ · E[latency_network], and both terms are differentiated in the same backward pass. Architecture parameters α are updated on the validation set (preventing overfitting of the architecture to training data), while network weights θ are updated on the training set.
python # ProxylessNAS loss with differentiable latency def proxyless_loss(logits, targets, arch_params, latency_table, lambda_lat): # Task loss (cross-entropy on validation batch) ce_loss = F.cross_entropy(logits, targets) # Expected latency — differentiable through softmax expected_lat = 0.0 for layer_idx, alpha in enumerate(arch_params): probs = F.softmax(alpha, dim=0) # shape: [num_ops] lats = latency_table[layer_idx] # preloaded per-op latencies expected_lat += (probs * lats).sum() # Σ pᵢ·latᵢ # Combined loss — gradient flows through both terms loss = ce_loss + lambda_lat * expected_lat return loss, ce_loss.item(), expected_lat.item() # Gradient of expected latency w.r.t. α_i: # ∂E[lat]/∂α_i = p_i · (lat_i − E[lat]) (standard softmax Jacobian identity) # → ops slower than average get negative gradient → their α decreases → probability drops
DARTS ran all candidate operations simultaneously in a weighted mixture during the forward pass. That's beautiful for gradient flow, but catastrophic for memory: if you have N candidate operations per edge, you need to store N activation tensors simultaneously. For ImageNet with N=8 operations and large feature maps, this hits >100GB of GPU memory — far beyond any single GPU.
ProxylessNAS's solution: binarize the architecture parameters during the forward pass. Instead of activating all N operations with real-valued probabilities p_i, you sample a binary mask: one operation is activated (probability = 1), all others are deactivated (probability = 0). Only the single active operation's activation tensor exists in memory.
DARTS (all paths active):
Memory: O(N) activations. Gradient: exact.
ProxylessNAS (binarized):
Memory: O(1) activations. Gradient: STE-approximated.
The binarization scheme works like this in practice:
python import torch, torch.nn.functional as F class MixedOp(torch.nn.Module): def __init__(self, ops): super().__init__() self.ops = torch.nn.ModuleList(ops) self.alpha = torch.nn.Parameter(torch.zeros(len(ops))) def forward(self, x): probs = F.softmax(self.alpha, dim=0) # ── Binarize: sample one path (memory: O(1)) ── # In training, sample from Bernoulli with probabilities probs idx = torch.multinomial(probs, 1).item() # Straight-Through Estimator for backward pass: # forward: run only op[idx] (discrete, cheap) # backward: pretend softmax was applied (continuous gradients flow) out = self.ops[idx](x) # Rescale by probability for gradient signal # This connects p_idx in the backward graph out = out * probs[idx] / probs[idx].detach() return out # Memory comparison (ImageNet, 224×224, 8 candidate ops): # DARTS: 8 × H × W × C × 4 bytes ≈ 8 × 56 × 56 × 256 × 4 = 0.51 GB (×N layers → OOM) # ProxylessNAS: 1 × H × W × C × 4 bytes ≈ 64 MB per layer → feasible on single GPU
The result: ProxylessNAS can search directly on ImageNet (the target task) with large architecture spaces on a single GPU. Previous NAS with DARTS required either CIFAR-10 as a proxy or 100+GB of GPU memory. That's the "proxyless" in the name — no proxy task, no memory limitation workaround needed.
ProxylessNAS solved one problem: how to search for one architecture on one target device efficiently. It takes about 8 GPU-hours (NASNet: 48,000 GPU-hours; DARTS: 96 GPU-hours). But now your deployment team comes back with a harder request:
"We need to support the Samsung S20, S21, and S22 phones. Plus the LG G8. Plus Google Pixel 2. Plus two FPGA variants. Plus an MCU for the IoT product line. Oh, and the GPU team wants a cloud version too."
That's 8 hardware targets. ProxylessNAS searches for one architecture per target, and then each found architecture must be retrained from scratch (the search supernet weights are not the final weights — the winning discrete architecture is initialized randomly and trained fully). Each full ImageNet training run: 40,000 GPU-hours. Total cost: 8 × 40,000 = 320,000 GPU-hours. That's economically insane.
The fundamental issue: these methods tightly couple training and search. Every time you want a new target architecture, you must run both steps again. The question Once-for-All asks is: can we decouple them?
Once-for-All (OFA, Cai et al., ICLR 2020) trains a single "elastic" supernet that simultaneously supports an enormous range of sub-network configurations. The supernet is parameterized by four elastic dimensions: depth (how many layers per block), kernel size (3, 5, or 7 for each conv), width (channel multiplier per layer), and resolution (input image size). Any combination of valid choices in these dimensions defines one valid sub-network — and all of them share weights with the supernet.
How many sub-networks does OFA contain? Let's count. OFA uses a MobileNetV3-like backbone with 5 stages. Each stage has a variable number of blocks: depth ∈ {2, 3, 4}. Each block has a variable kernel: {3, 5, 7}. Width multiplier per stage ∈ {0.25, 0.5, 0.75, 1.0}. Input resolution ∈ {128, 160, 192, 224}. The combinatorial count:
That is 10,000,000,000,000,000,000 sub-networks from one trained supernet. Each one shares weights — you don't retrain, you just extract and deploy. The accuracy predictor (trained separately, cheaply) tells you which configuration maximizes accuracy under a given latency constraint.
From the user's perspective at deploy time:
OFA achieved 80.0% top-1 on ImageNet on Samsung Note10 with 26ms latency — beating MobileNetV3 by 3.8% at similar latency. It also produced models for MCUs as small as the Cortex-M4 (256kB SRAM / 1MB Flash).
The naive approach to training the OFA supernet: start from scratch and randomly sample sub-networks on each mini-batch, training all configurations simultaneously. This fails catastrophically. The large sub-networks (full depth, full width, big kernels) want large weight values to express complex features. The small sub-networks (shallow, narrow, small kernels) want small weight values compatible with limited capacity. These two pressures fight each other — the result is a supernet where both extremes perform worse than a dedicated network trained for each.
Progressive shrinking is OFA's curriculum solution. Instead of training all configurations from the start, it adds one elastic dimension at a time, in order from largest to smallest, letting each dimension settle before introducing the next disturbance.
The canvas below animates the four shrinking stages. Click "Next Stage" to advance through the progression. Notice how each stage reduces the network's effective capacity while keeping the same underlying weight storage.
Click "Next Stage" to advance through the OFA training curriculum. Each stage adds one elastic dimension — the network is always initialized from the previous stage's weights.
Stage 0: Full | Stage 1: Elastic Kernel | Stage 2: Elastic Depth | Stage 3: Elastic Width
python # OFA progressive shrinking training schedule (pseudocode) def ofa_progressive_training(supernet, train_loader, distill_teacher): # Stage 1: Train full supernet for epoch in range(180): train_full_net(supernet, train_loader) # Stage 2: Elastic kernel (k=7,5,3) for epoch in range(25): k = random.choice([3, 5, 7]) # sample kernel per layer subnet = supernet.get_subnet(kernel=k, depth='full', width=1.0) loss = task_loss(subnet) + distill_loss(subnet, distill_teacher) loss.backward() # Stage 3: Elastic depth (d=2,3,4) for epoch in range(25): d = random.choice([2, 3, 4]) k = random.choice([3, 5, 7]) subnet = supernet.get_subnet(kernel=k, depth=d, width=1.0) loss = task_loss(subnet) + distill_loss(subnet, distill_teacher) loss.backward() # Stage 4: Elastic width (w=0.25,0.5,0.75,1.0) for epoch in range(25): w = random.choice([0.25, 0.5, 0.75, 1.0]) d = random.choice([2, 3, 4]) k = random.choice([3, 5, 7]) subnet = supernet.get_subnet(kernel=k, depth=d, width=w) # Channel sorting: top-k channels by L1 norm always selected loss = task_loss(subnet) + distill_loss(subnet, distill_teacher) loss.backward() # Total: ~1,200 GPU-hours on ImageNet. Per-target cost after this: ~0.
This is the payoff chapter. OFA is trained. The supernet contains 10¹⁹ sub-networks. A new hardware target arrives. You don't retrain anything. You search.
The search is an evolutionary algorithm guided by two oracle functions: the latency lookup table (gives exact latency for any (d, k, w, r) configuration in O(1)) and the accuracy predictor (a small MLP trained on ~1,000 labeled (arch, measured accuracy) pairs that predicts top-1 accuracy from the architecture encoding in O(1)). The evolutionary search samples sub-networks, scores them, keeps the Pareto-front survivors, mutates/crosses them, and repeats for ~20 evolution rounds — the whole thing takes less than one GPU-hour.
The canvas below simulates OFA deploy-time search. Drag the latency budget slider for a new hardware target. The canvas instantly identifies the best sub-network configuration: the point on the OFA Pareto front with the highest accuracy that fits within the budget. In practice this happens in under an hour with evolutionary search — here it's instantaneous because we've precomputed the Pareto front.
Drag the latency budget. The highlighted orange dot is the best OFA sub-network for that budget. All configurations share the same supernet weights — zero retraining required.
Try budget = 10 (GPU), 30 (phone), 50 (MCU). Each gives a different architecture from the same supernet.
Let's trace through what OFA reported in the paper for Samsung Note10 (30ms budget): the search finds a sub-net with kernel sizes [7, 5, 7, 5, 3] across 5 stages, depth [4, 3, 4, 4, 4], width multiplier 1.0, resolution 224 — achieving 80.0% top-1. Compare: MobileNetV3-Large (same latency): 75.2%. ProxylessNAS GPU: 75.1% at 5.1ms GPU latency. OFA achieves higher accuracy because it searches over a wider and deeper space — the full OFA search space has 10¹⁹ configurations while ProxylessNAS searches in a smaller space with each run.
The full OFA deploy pipeline, end to end:
python from ofa.nas.accuracy_predictor import AccuracyPredictor from ofa.nas.efficiency_predictor import Latency_table # One-time: build latency table for this device (~5 min per device) lat_table = Latency_table.build(device='samsung_note10') # One-time: train accuracy predictor (~16 GPU-hours per supernet) acc_pred = AccuracyPredictor.load('ofa_mbv3_accuracy_predictor.pt') # Per-target: evolutionary search (~1 GPU-hour) def deploy_search(budget_ms): population = sample_random_subnets(1000) for _ in range(20): # evolution rounds # Score: predicted accuracy, constrained to lat ≤ budget scores = [] for net in population: lat = lat_table.predict(net) if lat > budget_ms: continue # hard constraint acc = acc_pred.predict(net) scores.append((acc, net)) scores.sort(reverse=True) # Keep top-50, mutate and crossover for next generation population = mutate_and_crossover(scores[:50]) best_arch = scores[0][1] # Extract sub-network weights from OFA supernet — no retraining model = ofa_supernet.get_subnet(best_arch) return model # ready to deploy
We've covered hardware-aware NAS from problem to solution. Here's the compressed knowledge map of this lesson — everything you need to reproduce any part of it from memory.
| Method | Key Idea | Search Cost | Targets | Retrain? |
|---|---|---|---|---|
| MNASNet | On-device RL search, measured latency | 40K GPU-hrs/target | One at a time | Yes |
| ProxylessNAS | Differentiable latency (lookup table + Σpᵢ·latᵢ), binarized paths | ~8 GPU-hrs/target | One at a time | Yes |
| Once-for-All | Elastic supernet (10¹⁹ sub-nets), progressive shrinking, accuracy predictor | ~1,200 GPU-hrs once + <1 hr/target | Unlimited, same supernet | No |
| MCUNet (L10) | OFA + SRAM/Flash memory constraint in search objective | Same as OFA | MCUs specifically | No |
where p_i = softmax(α)_i, lat_i comes from the lookup table, λ controls accuracy-latency tradeoff. Gradient of latency term: ∂E[lat]/∂α_i = p_i(lat_i − E[lat]).
S = stages (5), L = layers (20), D = depth choices {2,3,4}, K = kernel choices {3,5,7}, W = width choices {0.25,0.5,0.75,1.0}, R = resolution choices {128,160,192,224}.
T is a learned transformation matrix. The 5×5 center of W_7×7 (25 values) is projected to 9 values (3×3 kernel). This is learned, not a fixed crop.
TinyML L7: NAS I — Searching for Efficient Networks — DARTS, RL, evolutionary search, supernets. This lesson continues where L7 ends.
TinyML L4: Pruning II — NetAdapt & AMC — NetAdapt uses a hardware-in-the-loop lookup table exactly like ProxylessNAS, but for pruning ratios instead of operation types. Same lookup-table concept, different application.
TinyML L6: Quantization II — QAT — The Straight-Through Estimator (STE) used in ProxylessNAS binarization is identical to the STE used in quantization-aware training for rounding. Same mathematical trick, two domains.
"The goal is not to design a neural network. The goal is to design a process that produces the right neural network for every piece of hardware that will ever exist." — Song Han, MIT 6.5940