TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 7

NAS I: Searching for Efficient Networks

MobileNet took a team of Google engineers months to design. ResNet took Kaiming He's team years to iterate on. What if an algorithm could search billions of architecture candidates and return one optimized for your exact hardware budget — automatically? This lesson is the blueprint for that algorithm.

Prerequisites: TinyML L2 (NN Building Blocks) — depthwise-separable conv, inverted bottleneck, MAC counting. Basic reinforcement learning concepts helpful but not required.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Design Bottleneck

It is 2015. You are an ML engineer at Google. Your task: design a neural network for a mobile phone camera — one that achieves high accuracy on ImageNet but runs in real time on a Snapdragon CPU. You have five design dimensions to tune: the number of layers, the number of channels per layer, the kernel size at each layer, the block type (residual, bottleneck, inverted bottleneck), and the input resolution.

Each candidate design takes 4–6 GPU-hours to train and evaluate. You can try maybe a hundred configurations in a month. Meanwhile, the combinatorial search space you're navigating has on the order of 1019 valid configurations. You will never cover it by hand.

This is the problem NAS solves. The human designs the rules of the game — which operations are legal, how cells connect, what budget the final architecture must respect. The algorithm searches over the resulting space automatically, guided by an objective (accuracy, latency, energy, or some combination). The result: NASNet, EfficientNet, MobileNetV3, and dozens of architectures that outperform anything a human would design given the same time budget.

Why NAS is not magic: The algorithm doesn't invent new operations. It picks from the operations you gave it. Garbage in, garbage out — if your search space doesn't contain the right building blocks, no search strategy can find them. The design of the search space is at least as important as the search algorithm itself.

Before diving into search strategies, let's feel the scale of the problem. You are an architect. You have 6 candidate operations per layer (3×3 conv, 5×5 conv, 3×3 depthwise, 5×5 depthwise, identity, max-pool). You have 20 layers. That's 620 ≈ 3.6 × 1015 possible architectures — and that's not even counting width choices. The canvas below lets you feel how fast this explodes.

Search-Space Size Explorer — watch the combinatorial count explode

Adjust the number of layers and candidate operations per layer. The log-scale bar shows how the total search-space size grows — and why exhaustive search is impossible.

Layers10
Ops / layer5
A search space has 5 candidate operations at each of 20 layers. How many distinct architectures exist, and why is exhaustive evaluation infeasible?

Chapter 1: What is NAS?

Neural Architecture Search (NAS) is the process of automatically finding the best neural network architecture from a candidate set, optimizing some objective (accuracy, latency, energy, memory). Every NAS algorithm has three components working together: a search space (the universe of architectures that can be expressed), a search strategy (how we navigate that universe), and a performance estimation strategy (how we quickly evaluate any candidate without full training).

Think of it like this: the search space is the map. The search strategy is the navigation algorithm (random walk vs GPS vs intuition). The performance estimator is a compass that gives you a noisy reading of "how good is this point?" without requiring you to walk the whole path to find out.

The NAS loop: (1) Sample an architecture from the search space. (2) Estimate its performance (validation accuracy, latency, or both). (3) Use the performance signal to guide the next sample — toward better regions of the space. Repeat until budget exhausted. Return the best architecture found.
Search Space
All architectures that can be expressed: which operations, how connected, how wide, how deep
↓ feeds candidates into
Search Strategy
How to navigate: random, RL controller, evolutionary, gradient-based (DARTS)
↓ uses feedback from
Performance Estimation
How to cheaply score a candidate: train fewer epochs, use weight sharing, use a predictor
↻ reward/fitness updates strategy
Best Architecture Found
Retrain from scratch → deploy on target hardware

The field spent 2016–2022 pushing each of these three components. Early NAS used random search for the strategy and full training for estimation — 450 GPU-days per search run (the original NASNet paper). Modern NAS uses gradient-based search (DARTS, 4 GPU-days) with weight sharing (one supernet, zero retraining per candidate). That's a 100× reduction in search cost.

Historical milestone: The original NAS paper (Zoph and Le, ICLR 2017) required 450 GPU-days on 800 GPUs to find NASNet. DARTS (Liu et al., ICLR 2019) finds comparable architectures in 4 GPU-days on a single GPU — a 100× speedup from better search strategy + estimation alone.
Which component of NAS is responsible for reducing 450 GPU-days (original NAS) to 4 GPU-days (DARTS)?

Chapter 2: Search Space

The search space defines what architectures the NAS algorithm can even consider. A poorly designed search space — one that excludes the right building blocks — means the best possible find is mediocre. There are two main paradigms: cell-based search spaces and network-level (macro) search spaces.

Cell-Based Search Space

Instead of searching the entire network end-to-end, you search for a small repeating module called a cell. The cell is then stacked N times to form the full network. This dramatically shrinks the search space: instead of choosing operations for 50+ layers individually, you choose operations for a 5-block cell and the stacking rules handle the rest.

NASNet (Zoph et al., CVPR 2018) introduced two cell types: a Normal Cell that preserves feature map resolution and a Reduction Cell that halves the spatial dimensions. You stack Normal Cells in groups, with Reduction Cells between groups. The RNN controller only needs to design these two cells — everything else (how to stack them, how many stages) is fixed by the human designer.

Misconception: "Cell-based search = searching less." Cell-based search is actually MORE constrained than macro search — you can only find architectures that consist of repeated versions of a single learned cell. This is a strong inductive bias. It works well for classification on CIFAR and ImageNet, but may miss architectures where different stages look completely different (which the best manually-designed networks often do).

Each cell contains B blocks. Each block takes two inputs (hidden states from earlier in the cell), applies an operation to each, and combines them. The search choices per block are: which two inputs to use, which operation to apply to each, and how to combine the results (add or concatenate). With B=5 blocks and |O| candidate operations, the combinatorial count grows as (2×2×M×M×N)B — which we'll count precisely in the next chapter.

Network-Level Search Space

Macro NAS searches the whole network topology: depth (how many layers per stage), width (how many channels), kernel sizes, resolution, and even connectivity patterns (which layers receive skip connections from which earlier layers). ProxylessNAS and EfficientNet use this approach. Instead of a repeated cell, each layer independently chooses from a menu of operations (MBConv 3×3, MBConv 5×5, MBConv 7×7 with expansion ratios 3 or 6, identity, etc.).

For TinyML specifically, the search space must also include memory constraints. A GPU has 32 GB. An MCU like the STM32F746 has 320 KB of SRAM. Memory is not just a soft constraint — it is a hard wall. MCUNet addresses this by jointly searching the space (which width/depth combinations allow models to fit in SRAM) before searching within the space (which specific architecture has best accuracy).

Search space design is NAS's hardest problem. RegNet (Radosavovic et al., CVPR 2020) analyzed the statistics of good architectures and designed a quantized linear search space where width grows approximately linearly with depth — this eliminated poorly structured architectures before any search. Better space design directly improved final accuracy even with identical search algorithms.
What is the key advantage of cell-based search over network-level search?

Chapter 3: Space Size & Cell Counting

Let's count exactly how many architectures live in NASNet's cell-based search space. This is not an academic exercise — it motivates why we need smart search strategies and why exhaustive evaluation is hopeless.

NASNet Cell Structure

Recall the NASNet cell: it contains B = 5 blocks. Each block performs these 5 choices:

  1. Which hidden state to feed as the first input (choose from 2 candidates at the start, growing as blocks complete)
  2. Which hidden state to feed as the second input
  3. Which operation to apply to the first input (choose from M = 5 operations)
  4. Which operation to apply to the second input (choose from M = 5 operations)
  5. Which combination method to merge the two transformed states (N = 2: add or concatenate)

Assuming 2 candidate inputs for both input choices, M = 5 operations, N = 2 combination methods, per block the number of choices is: 2 × 2 × 5 × 5 × 2 = 200. Over B = 5 blocks: 2005? No — that overcounts because blocks are ordered and each new block can also use outputs of prior blocks. The simplified bound from the lecture is:

|S| = (2 × 2 × M × M × N)B = (2 × 2 × 5 × 5 × 2)5 = 2005 ≈ 3.2 × 1011

More precisely, the NASNet paper (with M = 13 operations including sep-conv 3×3, sep-conv 5×5, sep-conv 7×7, dilated sep-conv, max-pool 3×3, avg-pool 3×3, identity, etc.) counts the space as approximately 1.3 × 1010 unique cells (for the Normal Cell alone). Both the Normal and Reduction Cells must be discovered — so the joint space is even larger.

Worked Example: M=5, N=2, B=5

Let's plug in concrete numbers from the slide: M = 5 candidate operations, N = 2 combination methods, B = 5 blocks.

|S| = (4 × M2 × N)B = (4 × 25 × 2)5 = 2005 = 3.2 × 1011

Training each candidate for just 1 GPU-hour gives 3.2 × 1011 GPU-hours = 36.5 million GPU-years. Even with 800 GPUs running in parallel: 45,600 years. This is why random search (which can sample densely from a good region) and RL/gradient methods (which can move toward better regions) are necessary.

The scaling lesson: Each additional block multiplies the space size by 200. Going from B=4 to B=5 makes the space 200× larger. This is why cell-based search spaces specify a fixed B. You're not searching "how many blocks" — B is part of the design spec, not a variable. The search is over the operations and connections within a fixed block count.
python
# Search space size for cell-based NAS (NASNet-style)
def cell_space_size(M: int, N: int = 2, B: int = 5) -> float:
    """
    M: number of candidate operations per input
    N: combination methods (add=1, concat=2)
    B: number of blocks per cell
    Each block: 2 input choices × 2 input choices × M ops × M ops × N combine
    But input choices grow as blocks complete (for blocks 1..B, choices are 2+b-1)
    Simplified (lecture formula): (2*2*M*M*N)^B
    """
    return (4 * M**2 * N) ** B

# Lecture values
size = cell_space_size(M=5, N=2, B=5)
print(f"M=5, N=2, B=5: {size:.2e} candidates")  # 3.20e+11

size_nasnet = cell_space_size(M=13, N=2, B=5)
print(f"NASNet M=13: {size_nasnet:.2e} candidates")  # ~1.3e+13

# At 1 GPU-hour per candidate with 800 GPUs:
gpu_years = size / (800 * 8760)
print(f"GPU-years with 800 GPUs: {gpu_years:.0f}")  # 45,662 years
In the NASNet-style formula |S| = (4 × M² × N)B, what does M represent and why does it appear squared?

Chapter 4: RL Controller

The first successful NAS algorithm (Zoph and Le, ICLR 2017) used reinforcement learning to search the space. The idea is beautiful: treat architecture design as a sequential decision-making problem. An RNN controller samples an architecture token by token — "first layer: 3×3 conv, second layer: 5×5 depthwise, third layer: identity..." — like a language model generating a sentence. The architecture is the "sentence." The validation accuracy of the trained network is the "reward."

The RL Controller Loop

The controller is a recurrent neural network. At each step, it emits a categorical distribution over the possible choices at that position (operation type, kernel size, skip connection target, etc.) via a softmax. It samples from this distribution to generate one discrete architectural token. After all tokens are generated, the sampled architecture is trained for a few epochs and evaluated on a validation set.

Controller RNN
Outputs logits → softmax → sample discrete tokens (ops, connections)
↓ generate architecture at
Train Candidate
Train the sampled architecture for E epochs on training set
↓ evaluate on validation set
Reward Rt
Rt = validation accuracy of architecture at
↓ update controller via REINFORCE
Policy Update
θ J = E[Rtθ log πθ(at)]
↻ repeat for thousands of episodes

The REINFORCE Update — Derived

The controller has parameters θ. It defines a policy πθ(a) — the probability of sampling architecture a. We want to maximize the expected reward J(θ) = Ea~πθ[R(a)].

We can't compute this expectation exactly (the space has 1011+ architectures). But the REINFORCE gradient trick lets us estimate the gradient unbiasedly from samples:

θ J(θ) = Ea ~ πθ[ R(a) · ∇θ log πθ(a) ]

In words: if architecture a got reward R(a), push the controller's parameters in the direction that would have made a more likely — scaled by R(a). Architectures with high reward get positively reinforced. Architectures with low reward get negatively reinforced.

Worked Step: One REINFORCE Update

Suppose the controller samples architecture a = [3×3 conv, 5×5 dwconv, skip]. The RNN's log-probability of generating this sequence is log πθ(a) = log(0.3) + log(0.4) + log(0.5) = −4.12. The architecture achieves R = 74.2% validation accuracy on a held-out set. The reward baseline (exponential moving average of past rewards) is b = 72.0%.

We subtract the baseline to reduce variance:

gradient signal = (R - b) · ∇θ log πθ(a) = (74.2 - 72.0) · ∇θ(-4.12) = 2.2 · ∇θ(-4.12)

The positive (R − b) = 2.2 means: this architecture was better than average. The gradient update pushes θ to make this architecture's tokens more likely in the future. If the reward were below baseline, the signal would be negative — suppressing those choices.

Misconception: "RL NAS converges to the global optimum." REINFORCE gives unbiased gradient estimates, but the variance is enormous. The original NASNet needed 450 GPU-days because thousands of samples are needed before the gradient estimates reliably point toward better architectures. The controller is also limited to architectures it has seen — it can't reason about unseen combinations. Evolutionary and gradient methods often converge faster.
python
# Sketch: RL controller sampling + REINFORCE update
import torch
import torch.nn as nn

class NASController(nn.Module):
    def __init__(self, num_ops=5, num_layers=10, hidden=64):
        super().__init__()
        self.rnn = nn.LSTMCell(num_ops, hidden)
        self.heads = nn.ModuleList([
            nn.Linear(hidden, num_ops) for _ in range(num_layers)
        ])
        self.baseline = 0.0

    def sample_architecture(self):
        # returns (architecture_tokens, log_prob_sum)
        tokens, log_probs = [], []
        h = c = torch.zeros(1, self.rnn.hidden_size)
        x = torch.zeros(1, self.rnn.input_size)
        for head in self.heads:
            h, c = self.rnn(x, (h, c))
            logits = head(h)                      # (1, num_ops)
            dist = torch.distributions.Categorical(logits=logits)
            tok = dist.sample()                  # discrete choice
            tokens.append(tok.item())
            log_probs.append(dist.log_prob(tok))
            x = nn.functional.one_hot(tok, self.rnn.input_size).float()
        return tokens, torch.stack(log_probs).sum()

    def reinforce_update(self, log_prob_sum, reward, lr=1e-3, ema=0.9):
        self.baseline = ema * self.baseline + (1 - ema) * reward
        loss = -log_prob_sum * (reward - self.baseline)
        # maximize reward ⟺ minimize -reward * log π(a)
        loss.backward()
RL Controller Loop Animation — watch sampled architectures improve over episodes

Each episode: the controller samples an architecture (mapped to a 2D accuracy-vs-cost point), evaluates it, and updates. The policy improves — samples gradually concentrate near the Pareto front. Click "Run Episode" to step forward one at a time, or "Run 20" to fast-forward.

Episode 0 — Click "Run Episode" to start.
In the REINFORCE update, what role does the reward baseline b play?

Chapter 5: Evolutionary Search

The second major NAS search strategy mimics biological evolution. You maintain a population of candidate architectures. You evaluate each on a fitness function (usually validation accuracy, sometimes combined with latency). You keep the fittest individuals, mutate some of them (randomly change one operation or one layer depth), and crossover others (recombine choices from two parents). The population evolves over generations toward better architectures.

Why Evolution Works for Architecture Search

Evolution is effective when the fitness landscape has structure — when similar architectures tend to have similar performance. Architecture space does have this property: an architecture that uses 5×5 depthwise conv in layer 3 is likely to perform similarly to one that uses 3×3 depthwise conv in layer 3, because they share the same blueprint everywhere else. Small mutations produce small performance changes, not catastrophic ones. This smooth landscape means evolved populations can navigate toward good regions.

Mutation

Mutation on depth: Change Stage 1 from 3 layers to 2 layers, keeping all operations intact. This tests whether the network can achieve the same accuracy with fewer parameters at that stage.

Mutation on operator: Keep all layer counts fixed but change one operation from MB6 3×3 to MB6 5×5 in Stage 2 Layer 1. This tests whether a larger receptive field at that specific position improves accuracy.

In Once-for-All (Cai et al., ICLR 2020), a supernet is trained once and mutations are applied to sub-network architectures sampled from the supernet. Mutation cost = zero retraining, just re-evaluate the mutated sub-network using the shared weights.

Crossover

Crossover takes two parent architectures and randomly combines their layer choices. Parent 1 uses MB6 3×3 in layers {1,3,5}; Parent 2 uses MB6 5×5 in layers {2,4,6}. For each layer, randomly pick from Parent 1 or Parent 2 to form the child. The child inherits the best genes (hopefully) from both parents.

Fitness function beyond accuracy: For hardware-aware NAS, the fitness function is multi-objective. OFA defines F(accuracy, latency) where you keep an architecture only if it satisfies the latency constraint AND has the highest accuracy among constraint-satisfying candidates in this generation. This naturally evolves a Pareto-optimal population — fast architectures are accurate, and accurate architectures are fast.

Comparison: RL vs Evolution

AspectRL ControllerEvolutionary Search
StateController parameters θ (gradient updates)Population of architectures (discrete)
Learning signalReward R − baseline via REINFORCEFitness (accuracy + HW metrics)
ExplorationProbabilistic sampling from controllerMutation + crossover
ParallelismSample many architectures per batchEvaluate entire population in parallel
Hardware awarenessAdd latency to reward (MnasNet)Add latency to fitness function directly
Search costHigh (each evaluation = training run)High, but lower w/ supernet
python
# Evolutionary NAS sketch — mutation + tournament selection
import random

def mutate(arch, ops_per_layer, p_mutate=0.1):
    """Randomly change each layer's op with probability p_mutate."""
    return [
        random.choice(ops_per_layer) if random.random() < p_mutate else op
        for op in arch
    ]

def crossover(parent1, parent2):
    """For each layer, randomly pick from parent1 or parent2."""
    return [
        p1 if random.random() < 0.5 else p2
        for p1, p2 in zip(parent1, parent2)
    ]

def evolutionary_nas(search_space, fitness_fn, pop_size=50, generations=30):
    ops = search_space['ops']
    # Initialize random population
    population = [[random.choice(ops) for _ in range(search_space['layers'])]
                  for _ in range(pop_size)]
    for gen in range(generations):
        scores = [fitness_fn(a) for a in population]
        # Keep top 50%, generate children via mutation + crossover
        top_k = sorted(zip(scores, population), reverse=True)[:pop_size//2]
        survivors = [arch for _, arch in top_k]
        children = [mutate(crossover(*random.sample(survivors, 2)), ops)
                    for _ in range(pop_size - len(survivors))]
        population = survivors + children
    return max(population, key=fitness_fn)
Why does evolutionary search work well for neural architecture search, even though it has no gradient information?

Chapter 6: DARTS: Gradient-Based NAS

RL and evolutionary search are both black-box methods: they treat the search space as discrete, treat performance evaluation as an oracle, and can't compute gradients through architecture choices. DARTS (Differentiable Architecture Search, Liu et al., ICLR 2019) fixes this by making the architecture search itself differentiable.

The Continuous Relaxation

In a standard cell, each edge (i→j) carries exactly one operation (e.g., 3×3 conv). DARTS replaces that single choice with a mixture of all operations, weighted by learned scalars. The key equation:

ō(i,j)(x) = ∑o ∈ O   softmax(i,j))o  ·  o(x)

Here O is the set of candidate operations. α(i,j) is a vector of real-valued scalars (one per operation) — the architecture parameters. The softmax converts them to weights that sum to 1. The "mixed operation" ō simultaneously runs ALL operations on x and produces a weighted average of their outputs.

Worked Example: 3 Operations

Suppose O = {3×3 conv, 5×5 conv, skip}. The architecture parameters for edge (0,1) are α = [2.1, 0.3, −0.8]. Let's compute the DARTS mixed operation output for input x:

softmax(α) = softmax([2.1, 0.3, -0.8]) = [e2.1, e0.3, e-0.8] / (e2.1 + e0.3 + e-0.8)
= [8.17, 1.35, 0.45] / 9.97 = [0.820, 0.135, 0.045]
ō(x) = 0.820 · conv3x3(x) + 0.135 · conv5x5(x) + 0.045 · x

The gradient of the validation loss with respect to α flows through this weighted sum. If conv3×3 outputs lead to lower loss (better validation accuracy), its α value increases, increasing its weight in future mixed operations. The α values converge toward the best operation on each edge.

Discretization

After the search phase, each edge is discretized: pick the operation with the highest α weight. In the example above, that's conv3×3 (weight 0.820). The supernet collapses to a single discrete architecture. This is then retrained from scratch to get the final accuracy.

DARTS's known weakness: The discretization step is noisy. The supernet is optimized for the continuous mixture, not for any single discrete architecture. After discretization, performance can drop significantly — an operation that dominated the mixture (weight 0.82) may not be the best when used alone. Several follow-up works (PC-DARTS, DARTS+) reduce this gap with regularization and early stopping on α.

Bilevel Optimization

DARTS has two sets of parameters: network weights w (the convolution filters) and architecture parameters α. They're optimized jointly via bilevel optimization:

minα Lval(w*(α), α)   s.t.   w*(α) = argminw Ltrain(w, α)

In practice this is approximated by alternating: one gradient step on w using training data, one gradient step on α using validation data. This avoids the expensive inner optimization loop.

python
# DARTS mixed operation — the core idea
import torch
import torch.nn as nn
import torch.nn.functional as F

class MixedOp(nn.Module):
    """DARTS mixed operation: weighted sum over all candidate ops."""
    def __init__(self, ops: list[nn.Module]):
        super().__init__()
        self.ops = nn.ModuleList(ops)
        # Architecture parameters α — initialized near uniform
        self.alphas = nn.Parameter(torch.zeros(len(ops)))

    def forward(self, x):
        weights = F.softmax(self.alphas, dim=0)  # shape: (|O|,)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

    def discretize(self) -> nn.Module:
        """Return the operation with the highest α weight."""
        best_idx = self.alphas.argmax().item()
        return self.ops[best_idx]

# During DARTS search (alternating bilevel optimization):
# Step 1: one step on w using train_batch, α fixed
# Step 2: one step on α using val_batch, w fixed
for train_batch, val_batch in zip(train_loader, val_loader):
    w_optim.zero_grad()
    loss_train = model(*train_batch)
    loss_train.backward()
    w_optim.step()  # update weights w, α unchanged

    alpha_optim.zero_grad()
    loss_val = model(*val_batch)
    loss_val.backward()
    alpha_optim.step()  # update α, w unchanged
DARTS Continuous Relaxation — drag α logits, watch the mixed-op weights shift

Three candidate operations on a single edge: 3×3 Conv, 5×5 Conv, Skip. Drag the sliders to set the α logits. Watch the softmax weights update live. Click "Discretize" to snap to the argmax.

α0 (3×3 Conv)2.1
α1 (5×5 Conv)0.3
α2 (Skip)-0.8
Drag sliders to adjust α logits.
In DARTS, why does training the architecture parameters α on the validation set (not the training set) matter?

Chapter 7: Performance Estimation

The bottleneck in any NAS algorithm isn't the search strategy — it's how you evaluate each candidate. Training a network to convergence on ImageNet takes 90 epochs × 1.2M images × 4 GPU-hours = days per candidate. With 1011 candidates in the search space, you need much faster estimation.

Strategy 1: Lower-Fidelity Evaluation

Train each candidate for fewer epochs on a smaller dataset (e.g., CIFAR-10 as a proxy for ImageNet), or at lower resolution, or with fewer channels. The relative ranking of architectures is usually preserved even under these degraded conditions — an architecture that wins at 5 epochs often wins at 90 epochs. This reduces cost by 10–20×.

Strategy 2: Weight Sharing (One-Shot / Supernet)

The most impactful innovation in NAS. Instead of training each candidate from scratch, train one giant supernet whose weights are shared across all sub-network architectures. A sub-network (a specific architecture) is just a choice of which edges/operations to activate in the supernet. All sub-networks sharing edges also share the weights on those edges.

SMASH (Brock et al.) and ENAS (Pham et al., 2018) were early weight-sharing methods. Once-for-All (OFA, Cai et al., ICLR 2020) extends this: train one supernet once for 300 epochs, then search over sub-networks for free by just evaluating them using the shared weights — no retraining at all.

The weight-sharing accuracy gap: Weights shared in the supernet were optimized for the average of all sub-architectures, not for any single one. So the ranking of sub-networks by shared-weight accuracy is only a noisy proxy for their true standalone accuracy. Small networks and large networks sharing the same weights are both sub-optimal. OFA addresses this via progressive shrinking: first train full net, then progressively fine-tune smaller subsets.

Strategy 3: Accuracy Predictors

Train a regression model (usually a small MLP or GNN) to predict a new architecture's accuracy from its encoding. The predictor is trained on ~500 architecture-accuracy pairs from early in the NAS run. Once trained, it scores millions of new candidates in milliseconds. This is the approach used by NAT, BRP-NAS, and Semi-Supervised NAS.

The Cost Comparison

MethodCost per CandidateTotal Search Cost
Train from scratch (full)4–6 GPU-hours450 GPU-days (NASNet)
Lower fidelity (10 epochs)~20 GPU-min~45 GPU-days
Weight sharing (ENAS)~2 GPU-min (shared weights)~0.5 GPU-days
DARTS (continuous)N/A — gradient through space4 GPU-days
Accuracy predictorMilliseconds<1 GPU-day (incl. predictor training)
Supernet Weight Sharing — one network, many sub-paths

A 3-stage supernet with 3 operation choices per stage. Highlighted sub-path = one candidate architecture using shared weights. Toggle between "Supernet" view and candidate architectures. See how the same weights are reused across different sub-networks.

Showing supernet. Click "Next Architecture" to highlight a sub-network path.
Why is weight-sharing accuracy a "noisy proxy" for true standalone accuracy?

Chapter 8: Showcase: NAS Strategy Explorer

This showcase puts all three NAS search strategies side-by-side: Random Search, Evolutionary, and DARTS (simulated as gradient ascent in accuracy space). A 2D accuracy-vs-MACs landscape represents the search space. The goal: find the Pareto-optimal front of high-accuracy, low-MACs architectures. Watch how each strategy navigates — and how quickly each converges.

NAS Strategy Comparison — navigate a simulated accuracy-vs-MACs landscape

Three strategies search the same landscape in parallel. Each dot is a sampled architecture. Stars mark the best found so far. Click "Run 10 Steps" to advance all three, or step individual strategies. The Pareto front (best accuracy per cost level) advances as search progresses.

Click "Run 10 Steps" to begin search.
What to watch for: Random search explores broadly but needs many samples to find good architectures. Evolutionary search converges faster once it finds a good region — mutation keeps it near proven winners. Simulated DARTS focuses quickly on promising areas but may get stuck in a local optimum. In practice, the landscape is 1011-dimensional, not 2D — these dynamics become more pronounced.
Why does evolutionary search tend to converge faster than random search on practical NAS problems, given that neither has gradient information?

Chapter 9: Connections & Cheat Sheet

NAS is not one algorithm — it's a framework with three independently-improving components. The best modern NAS systems (OFA, EfficientNet, ProxylessNAS) mix and match these components to achieve 100× faster search than the 2017 baseline.

NAS Component Cheat Sheet

ComponentOptionsKey Trade-off
Search SpaceCell-based (NASNet) | Network-level (ProxylessNAS, OFA) | TinyML (MCUNet)Expressiveness vs tractability. Cell-based: smaller space, strong prior. Network-level: richer but harder to search.
Search StrategyRandom | RL (NASNet, MnasNet) | Evolutionary (OFA, AmoebaNet) | Gradient (DARTS, PC-DARTS)RL: principled but high variance, expensive. Evolution: robust, parallelizable, hardware-aware. DARTS: fast, differentiable, but discretization gap.
Performance EstimationFull training | Low-fidelity | Weight sharing (ENAS, OFA) | Accuracy predictorFull training: true accuracy. Weight sharing: 100–1000× faster but noisy. Predictor: milliseconds but requires training data.

RL vs Evolution vs DARTS — When to Use Which

CriterionRL (REINFORCE)EvolutionaryDARTS
Search costHigh (450 GPU-days)Medium (with supernet: low)Low (4 GPU-days)
Hardware awarenessReward termFitness termLatency loss term
Multi-objectiveScalarized rewardPareto front directlyGradient w/ latency regularizer
Discrete/continuousDiscrete (sampling)Discrete (mutation)Continuous → discrete (discretize)
Theoretical guaranteeConverges to local optNone (heuristic)Local opt (bilevel)

Bridge to NAS II: Hardware-Aware and Once-for-All

This lecture covered NAS fundamentals. The next frontier is hardware-aware NAS: making the search space and objective directly aware of target-hardware latency, not just FLOPs. ProxylessNAS models latency as a differentiable function of architecture parameters. MnasNet adds a latency term to the RL reward. OFA trains a single supernet and searches for specialized sub-networks for each hardware target — a phone, a Raspberry Pi, an MCU — without retraining.

NAS II (MIT 6.5940 Lecture 8) covers: hardware-aware NAS, Once-for-All progressive shrinking, zero-shot NAS (scoring architectures without ANY training using gradient statistics), and neural-hardware co-design (searching architecture and accelerator simultaneously).

Why NAS still requires human judgment: The search space must be designed by hand. The hardware cost model must be specified. The accuracy metric (top-1 on ImageNet? BLEU? mAP?) must be chosen. NAS automates the search within a human-defined framework — it doesn't replace the engineer, it amplifies them. A well-designed search space found by random search beats a poorly-designed search space found by the best algorithm.

Related Gleams

"The art of NAS is not in the search algorithm — it's in designing a search space so well that even random search finds great architectures." — paraphrase of insights from Real et al., AAAI 2019
A practitioner wants to find an efficient network for a new IoT microcontroller with 512 KB RAM and 1 MB flash. They have limited GPU budget (2 GPU-days). Which NAS component choices are most appropriate?