NAS I: Searching for Efficient Networks

Chapter 0: The Design Bottleneck

It is 2015. You are an ML engineer at Google. Your task: design a neural network for a mobile phone camera — one that achieves high accuracy on ImageNet but runs in real time on a Snapdragon CPU. You have five design dimensions to tune: the number of layers, the number of channels per layer, the kernel size at each layer, the block type (residual, bottleneck, inverted bottleneck), and the input resolution.

Each candidate design takes 4–6 GPU-hours to train and evaluate. You can try maybe a hundred configurations in a month. Meanwhile, the combinatorial search space you're navigating has on the order of 10¹⁹ valid configurations. You will never cover it by hand.

This is the problem NAS solves. The human designs the rules of the game — which operations are legal, how cells connect, what budget the final architecture must respect. The algorithm searches over the resulting space automatically, guided by an objective (accuracy, latency, energy, or some combination). The result: NASNet, EfficientNet, MobileNetV3, and dozens of architectures that outperform anything a human would design given the same time budget.

Why NAS is not magic: The algorithm doesn't invent new operations. It picks from the operations you gave it. Garbage in, garbage out — if your search space doesn't contain the right building blocks, no search strategy can find them. The design of the search space is at least as important as the search algorithm itself.

Before diving into search strategies, let's feel the scale of the problem. You are an architect. You have 6 candidate operations per layer (3×3 conv, 5×5 conv, 3×3 depthwise, 5×5 depthwise, identity, max-pool). You have 20 layers. That's 6²⁰ ≈ 3.6 × 10¹⁵ possible architectures — and that's not even counting width choices. The canvas below lets you feel how fast this explodes.

Search-Space Size Explorer — watch the combinatorial count explode

Adjust the number of layers and candidate operations per layer. The log-scale bar shows how the total search-space size grows — and why exhaustive search is impossible.

Layers10

Ops / layer5

A search space has 5 candidate operations at each of 20 layers. How many distinct architectures exist, and why is exhaustive evaluation infeasible?

5 × 20 = 100 architectures — easily exhaustive with 100 training runs. 5²⁰ ≈ 10¹⁴ architectures, but we can prune 99% by accuracy, leaving ~10¹² feasible ones. 5²⁰ ≈ 10¹⁴ architectures — training each takes hours, so exhaustive search would take billions of GPU-years. The space is infinite because you can always add more layers, so no count is meaningful.

Chapter 1: What is NAS?

Neural Architecture Search (NAS) is the process of automatically finding the best neural network architecture from a candidate set, optimizing some objective (accuracy, latency, energy, memory). Every NAS algorithm has three components working together: a search space (the universe of architectures that can be expressed), a search strategy (how we navigate that universe), and a performance estimation strategy (how we quickly evaluate any candidate without full training).

Think of it like this: the search space is the map. The search strategy is the navigation algorithm (random walk vs GPS vs intuition). The performance estimator is a compass that gives you a noisy reading of "how good is this point?" without requiring you to walk the whole path to find out.

The NAS loop: (1) Sample an architecture from the search space. (2) Estimate its performance (validation accuracy, latency, or both). (3) Use the performance signal to guide the next sample — toward better regions of the space. Repeat until budget exhausted. Return the best architecture found.

Search Space

All architectures that can be expressed: which operations, how connected, how wide, how deep

↓ feeds candidates into

Search Strategy

How to navigate: random, RL controller, evolutionary, gradient-based (DARTS)

↓ uses feedback from

Performance Estimation

How to cheaply score a candidate: train fewer epochs, use weight sharing, use a predictor

↻ reward/fitness updates strategy

Best Architecture Found

Retrain from scratch → deploy on target hardware

The field spent 2016–2022 pushing each of these three components. Early NAS used random search for the strategy and full training for estimation — 450 GPU-days per search run (the original NASNet paper). Modern NAS uses gradient-based search (DARTS, 4 GPU-days) with weight sharing (one supernet, zero retraining per candidate). That's a 100× reduction in search cost.

Historical milestone: The original NAS paper (Zoph and Le, ICLR 2017) required 450 GPU-days on 800 GPUs to find NASNet. DARTS (Liu et al., ICLR 2019) finds comparable architectures in 4 GPU-days on a single GPU — a 100× speedup from better search strategy + estimation alone.

Which component of NAS is responsible for reducing 450 GPU-days (original NAS) to 4 GPU-days (DARTS)?

Expanding the search space to include more operations, providing more candidates to choose from. Using a larger neural network model as the controller to make better decisions. Better search strategy (gradient-based instead of RL) and better performance estimation (weight sharing instead of training from scratch). Using more GPU hardware in parallel, distributing the search across thousands of machines.

Chapter 2: Search Space

The search space defines what architectures the NAS algorithm can even consider. A poorly designed search space — one that excludes the right building blocks — means the best possible find is mediocre. There are two main paradigms: cell-based search spaces and network-level (macro) search spaces.

Cell-Based Search Space

Instead of searching the entire network end-to-end, you search for a small repeating module called a cell. The cell is then stacked N times to form the full network. This dramatically shrinks the search space: instead of choosing operations for 50+ layers individually, you choose operations for a 5-block cell and the stacking rules handle the rest.

NASNet (Zoph et al., CVPR 2018) introduced two cell types: a Normal Cell that preserves feature map resolution and a Reduction Cell that halves the spatial dimensions. You stack Normal Cells in groups, with Reduction Cells between groups. The RNN controller only needs to design these two cells — everything else (how to stack them, how many stages) is fixed by the human designer.

Misconception: "Cell-based search = searching less." Cell-based search is actually MORE constrained than macro search — you can only find architectures that consist of repeated versions of a single learned cell. This is a strong inductive bias. It works well for classification on CIFAR and ImageNet, but may miss architectures where different stages look completely different (which the best manually-designed networks often do).

Each cell contains B blocks. Each block takes two inputs (hidden states from earlier in the cell), applies an operation to each, and combines them. The search choices per block are: which two inputs to use, which operation to apply to each, and how to combine the results (add or concatenate). With B=5 blocks and |O| candidate operations, the combinatorial count grows as (2×2×M×M×N)^B — which we'll count precisely in the next chapter.

Network-Level Search Space

Macro NAS searches the whole network topology: depth (how many layers per stage), width (how many channels), kernel sizes, resolution, and even connectivity patterns (which layers receive skip connections from which earlier layers). ProxylessNAS and EfficientNet use this approach. Instead of a repeated cell, each layer independently chooses from a menu of operations (MBConv 3×3, MBConv 5×5, MBConv 7×7 with expansion ratios 3 or 6, identity, etc.).

For TinyML specifically, the search space must also include memory constraints. A GPU has 32 GB. An MCU like the STM32F746 has 320 KB of SRAM. Memory is not just a soft constraint — it is a hard wall. MCUNet addresses this by jointly searching the space (which width/depth combinations allow models to fit in SRAM) before searching within the space (which specific architecture has best accuracy).

Search space design is NAS's hardest problem. RegNet (Radosavovic et al., CVPR 2020) analyzed the statistics of good architectures and designed a quantized linear search space where width grows approximately linearly with depth — this eliminated poorly structured architectures before any search. Better space design directly improved final accuracy even with identical search algorithms.

What is the key advantage of cell-based search over network-level search?

Cell-based search finds architectures that always outperform macro-searched networks on every task. By searching only a small repeating cell instead of the full network, the search space shrinks dramatically while the full-network architecture is recovered by stacking. Cell-based search requires no performance estimation because cells are too small to need accuracy evaluation. Cell-based search automatically discovers the correct number of stages and cells to stack.

Chapter 3: Space Size & Cell Counting

Let's count exactly how many architectures live in NASNet's cell-based search space. This is not an academic exercise — it motivates why we need smart search strategies and why exhaustive evaluation is hopeless.

NASNet Cell Structure

Recall the NASNet cell: it contains B = 5 blocks. Each block performs these 5 choices:

Which hidden state to feed as the first input (choose from 2 candidates at the start, growing as blocks complete)
Which hidden state to feed as the second input
Which operation to apply to the first input (choose from M = 5 operations)
Which operation to apply to the second input (choose from M = 5 operations)
Which combination method to merge the two transformed states (N = 2: add or concatenate)

Assuming 2 candidate inputs for both input choices, M = 5 operations, N = 2 combination methods, per block the number of choices is: 2 × 2 × 5 × 5 × 2 = 200. Over B = 5 blocks: 200⁵? No — that overcounts because blocks are ordered and each new block can also use outputs of prior blocks. The simplified bound from the lecture is:

|S| = (2 × 2 × M × M × N)^B = (2 × 2 × 5 × 5 × 2)⁵ = 200⁵ ≈ 3.2 × 10¹¹

More precisely, the NASNet paper (with M = 13 operations including sep-conv 3×3, sep-conv 5×5, sep-conv 7×7, dilated sep-conv, max-pool 3×3, avg-pool 3×3, identity, etc.) counts the space as approximately 1.3 × 10¹⁰ unique cells (for the Normal Cell alone). Both the Normal and Reduction Cells must be discovered — so the joint space is even larger.

Worked Example: M=5, N=2, B=5

Let's plug in concrete numbers from the slide: M = 5 candidate operations, N = 2 combination methods, B = 5 blocks.

|S| = (4 × M² × N)^B = (4 × 25 × 2)⁵ = 200⁵ = 3.2 × 10¹¹

Training each candidate for just 1 GPU-hour gives 3.2 × 10¹¹ GPU-hours = 36.5 million GPU-years. Even with 800 GPUs running in parallel: 45,600 years. This is why random search (which can sample densely from a good region) and RL/gradient methods (which can move toward better regions) are necessary.

The scaling lesson: Each additional block multiplies the space size by 200. Going from B=4 to B=5 makes the space 200× larger. This is why cell-based search spaces specify a fixed B. You're not searching "how many blocks" — B is part of the design spec, not a variable. The search is over the operations and connections within a fixed block count.

python
# Search space size for cell-based NAS (NASNet-style)
def cell_space_size(M: int, N: int = 2, B: int = 5) -> float:
    """
    M: number of candidate operations per input
    N: combination methods (add=1, concat=2)
    B: number of blocks per cell
    Each block: 2 input choices × 2 input choices × M ops × M ops × N combine
    But input choices grow as blocks complete (for blocks 1..B, choices are 2+b-1)
    Simplified (lecture formula): (2*2*M*M*N)^B
    """
    return (4 * M**2 * N) ** B

# Lecture values
size = cell_space_size(M=5, N=2, B=5)
print(f"M=5, N=2, B=5: {size:.2e} candidates")  # 3.20e+11

size_nasnet = cell_space_size(M=13, N=2, B=5)
print(f"NASNet M=13: {size_nasnet:.2e} candidates")  # ~1.3e+13

# At 1 GPU-hour per candidate with 800 GPUs:
gpu_years = size / (800 * 8760)
print(f"GPU-years with 800 GPUs: {gpu_years:.0f}")  # 45,662 years

In the NASNet-style formula |S| = (4 × M² × N)^B, what does M represent and why does it appear squared?

M is the number of blocks; it's squared because each block has a Normal and Reduction variant. M is the number of combination methods; it's squared because you combine two operations per block. M is the number of candidate operations; it's squared because each block applies an operation to EACH of its two inputs independently, giving M choices for input 1 and M choices for input 2. M is the number of hidden states; it's squared because you choose two hidden states per block.

Chapter 4: RL Controller

The first successful NAS algorithm (Zoph and Le, ICLR 2017) used reinforcement learning to search the space. The idea is beautiful: treat architecture design as a sequential decision-making problem. An RNN controller samples an architecture token by token — "first layer: 3×3 conv, second layer: 5×5 depthwise, third layer: identity..." — like a language model generating a sentence. The architecture is the "sentence." The validation accuracy of the trained network is the "reward."

The RL Controller Loop

The controller is a recurrent neural network. At each step, it emits a categorical distribution over the possible choices at that position (operation type, kernel size, skip connection target, etc.) via a softmax. It samples from this distribution to generate one discrete architectural token. After all tokens are generated, the sampled architecture is trained for a few epochs and evaluated on a validation set.

Controller RNN

Outputs logits → softmax → sample discrete tokens (ops, connections)

↓ generate architecture a_t

Train Candidate

Train the sampled architecture for E epochs on training set

↓ evaluate on validation set

Reward R_t

R_t = validation accuracy of architecture a_t

↓ update controller via REINFORCE

Policy Update

∇_θ J = E[R_t ∇_θ log π_θ(a_t)]

↻ repeat for thousands of episodes

The REINFORCE Update — Derived

The controller has parameters θ. It defines a policy π_θ(a) — the probability of sampling architecture a. We want to maximize the expected reward J(θ) = E_{a~π_θ}[R(a)].

We can't compute this expectation exactly (the space has 10¹¹+ architectures). But the REINFORCE gradient trick lets us estimate the gradient unbiasedly from samples:

∇_θ J(θ) = E_{a ~ π_θ}[ R(a) · ∇_θ log π_θ(a) ]

In words: if architecture a got reward R(a), push the controller's parameters in the direction that would have made a more likely — scaled by R(a). Architectures with high reward get positively reinforced. Architectures with low reward get negatively reinforced.

Worked Step: One REINFORCE Update

Suppose the controller samples architecture a = [3×3 conv, 5×5 dwconv, skip]. The RNN's log-probability of generating this sequence is log π_θ(a) = log(0.3) + log(0.4) + log(0.5) = −4.12. The architecture achieves R = 74.2% validation accuracy on a held-out set. The reward baseline (exponential moving average of past rewards) is b = 72.0%.

We subtract the baseline to reduce variance:

gradient signal = (R - b) · ∇_θ log π_θ(a) = (74.2 - 72.0) · ∇_θ(-4.12) = 2.2 · ∇_θ(-4.12)

The positive (R − b) = 2.2 means: this architecture was better than average. The gradient update pushes θ to make this architecture's tokens more likely in the future. If the reward were below baseline, the signal would be negative — suppressing those choices.

Misconception: "RL NAS converges to the global optimum." REINFORCE gives unbiased gradient estimates, but the variance is enormous. The original NASNet needed 450 GPU-days because thousands of samples are needed before the gradient estimates reliably point toward better architectures. The controller is also limited to architectures it has seen — it can't reason about unseen combinations. Evolutionary and gradient methods often converge faster.

python
# Sketch: RL controller sampling + REINFORCE update
import torch
import torch.nn as nn

class NASController(nn.Module):
    def __init__(self, num_ops=5, num_layers=10, hidden=64):
        super().__init__()
        self.rnn = nn.LSTMCell(num_ops, hidden)
        self.heads = nn.ModuleList([
            nn.Linear(hidden, num_ops) for _ in range(num_layers)
        ])
        self.baseline = 0.0

    def sample_architecture(self):
        # returns (architecture_tokens, log_prob_sum)
        tokens, log_probs = [], []
        h = c = torch.zeros(1, self.rnn.hidden_size)
        x = torch.zeros(1, self.rnn.input_size)
        for head in self.heads:
            h, c = self.rnn(x, (h, c))
            logits = head(h)                      # (1, num_ops)
            dist = torch.distributions.Categorical(logits=logits)
            tok = dist.sample()                  # discrete choice
            tokens.append(tok.item())
            log_probs.append(dist.log_prob(tok))
            x = nn.functional.one_hot(tok, self.rnn.input_size).float()
        return tokens, torch.stack(log_probs).sum()

    def reinforce_update(self, log_prob_sum, reward, lr=1e-3, ema=0.9):
        self.baseline = ema * self.baseline + (1 - ema) * reward
        loss = -log_prob_sum * (reward - self.baseline)
        # maximize reward ⟺ minimize -reward * log π(a)
        loss.backward()

RL Controller Loop Animation — watch sampled architectures improve over episodes

Each episode: the controller samples an architecture (mapped to a 2D accuracy-vs-cost point), evaluates it, and updates. The policy improves — samples gradually concentrate near the Pareto front. Click "Run Episode" to step forward one at a time, or "Run 20" to fast-forward.

Episode 0 — Click "Run Episode" to start.

In the REINFORCE update, what role does the reward baseline b play?

It sets the maximum allowed reward, preventing the controller from over-fitting to single architectures. It reduces variance: only architectures that beat the baseline (R > b) get reinforced; weaker architectures get suppressed. The gradient signal (R − b) is an advantage estimate. It normalizes the reward to the range [0, 1] so the gradient step size is consistent. It is the reward from the previous episode, used to compute temporal differences like in Q-learning.

Chapter 5: Evolutionary Search

The second major NAS search strategy mimics biological evolution. You maintain a population of candidate architectures. You evaluate each on a fitness function (usually validation accuracy, sometimes combined with latency). You keep the fittest individuals, mutate some of them (randomly change one operation or one layer depth), and crossover others (recombine choices from two parents). The population evolves over generations toward better architectures.

Why Evolution Works for Architecture Search

Evolution is effective when the fitness landscape has structure — when similar architectures tend to have similar performance. Architecture space does have this property: an architecture that uses 5×5 depthwise conv in layer 3 is likely to perform similarly to one that uses 3×3 depthwise conv in layer 3, because they share the same blueprint everywhere else. Small mutations produce small performance changes, not catastrophic ones. This smooth landscape means evolved populations can navigate toward good regions.

Mutation

Mutation on depth: Change Stage 1 from 3 layers to 2 layers, keeping all operations intact. This tests whether the network can achieve the same accuracy with fewer parameters at that stage.

Mutation on operator: Keep all layer counts fixed but change one operation from MB6 3×3 to MB6 5×5 in Stage 2 Layer 1. This tests whether a larger receptive field at that specific position improves accuracy.

In Once-for-All (Cai et al., ICLR 2020), a supernet is trained once and mutations are applied to sub-network architectures sampled from the supernet. Mutation cost = zero retraining, just re-evaluate the mutated sub-network using the shared weights.

Crossover

Crossover takes two parent architectures and randomly combines their layer choices. Parent 1 uses MB6 3×3 in layers {1,3,5}; Parent 2 uses MB6 5×5 in layers {2,4,6}. For each layer, randomly pick from Parent 1 or Parent 2 to form the child. The child inherits the best genes (hopefully) from both parents.

Fitness function beyond accuracy: For hardware-aware NAS, the fitness function is multi-objective. OFA defines F(accuracy, latency) where you keep an architecture only if it satisfies the latency constraint AND has the highest accuracy among constraint-satisfying candidates in this generation. This naturally evolves a Pareto-optimal population — fast architectures are accurate, and accurate architectures are fast.

Comparison: RL vs Evolution

Aspect	RL Controller	Evolutionary Search
State	Controller parameters θ (gradient updates)	Population of architectures (discrete)
Learning signal	Reward R − baseline via REINFORCE	Fitness (accuracy + HW metrics)
Exploration	Probabilistic sampling from controller	Mutation + crossover
Parallelism	Sample many architectures per batch	Evaluate entire population in parallel
Hardware awareness	Add latency to reward (MnasNet)	Add latency to fitness function directly
Search cost	High (each evaluation = training run)	High, but lower w/ supernet

python
# Evolutionary NAS sketch — mutation + tournament selection
import random

def mutate(arch, ops_per_layer, p_mutate=0.1):
    """Randomly change each layer's op with probability p_mutate."""
    return [
        random.choice(ops_per_layer) if random.random() < p_mutate else op
        for op in arch
    ]

def crossover(parent1, parent2):
    """For each layer, randomly pick from parent1 or parent2."""
    return [
        p1 if random.random() < 0.5 else p2
        for p1, p2 in zip(parent1, parent2)
    ]

def evolutionary_nas(search_space, fitness_fn, pop_size=50, generations=30):
    ops = search_space['ops']
    # Initialize random population
    population = [[random.choice(ops) for _ in range(search_space['layers'])]
                  for _ in range(pop_size)]
    for gen in range(generations):
        scores = [fitness_fn(a) for a in population]
        # Keep top 50%, generate children via mutation + crossover
        top_k = sorted(zip(scores, population), reverse=True)[:pop_size//2]
        survivors = [arch for _, arch in top_k]
        children = [mutate(crossover(*random.sample(survivors, 2)), ops)
                    for _ in range(pop_size - len(survivors))]
        population = survivors + children
    return max(population, key=fitness_fn)

Why does evolutionary search work well for neural architecture search, even though it has no gradient information?

It doesn't — evolutionary search always performs worse than gradient-based methods and is only used when gradients are unavailable. It works because architectures can be represented as real-valued vectors, and genetic algorithms are equivalent to stochastic gradient descent on those vectors. Architecture fitness landscapes tend to be smooth — nearby architectures (differing in one operation or layer depth) have similar performance — so mutation produces valid exploratory steps, and crossover recombines proven sub-structures. Evolutionary search evaluates the entire search space in one generation, while RL only evaluates one architecture at a time.

Chapter 6: DARTS: Gradient-Based NAS

RL and evolutionary search are both black-box methods: they treat the search space as discrete, treat performance evaluation as an oracle, and can't compute gradients through architecture choices. DARTS (Differentiable Architecture Search, Liu et al., ICLR 2019) fixes this by making the architecture search itself differentiable.

The Continuous Relaxation

In a standard cell, each edge (i→j) carries exactly one operation (e.g., 3×3 conv). DARTS replaces that single choice with a mixture of all operations, weighted by learned scalars. The key equation:

ō^(i,j)(x) = ∑_{o ∈ O} softmax(α^(i,j))_o · o(x)

Here O is the set of candidate operations. α^(i,j) is a vector of real-valued scalars (one per operation) — the architecture parameters. The softmax converts them to weights that sum to 1. The "mixed operation" ō simultaneously runs ALL operations on x and produces a weighted average of their outputs.

Worked Example: 3 Operations

Suppose O = {3×3 conv, 5×5 conv, skip}. The architecture parameters for edge (0,1) are α = [2.1, 0.3, −0.8]. Let's compute the DARTS mixed operation output for input x:

softmax(α) = softmax([2.1, 0.3, -0.8]) = [e^2.1, e^0.3, e^-0.8] / (e^2.1 + e^0.3 + e^-0.8)

= [8.17, 1.35, 0.45] / 9.97 = [0.820, 0.135, 0.045]

ō(x) = 0.820 · conv3x3(x) + 0.135 · conv5x5(x) + 0.045 · x

The gradient of the validation loss with respect to α flows through this weighted sum. If conv3×3 outputs lead to lower loss (better validation accuracy), its α value increases, increasing its weight in future mixed operations. The α values converge toward the best operation on each edge.

Discretization

After the search phase, each edge is discretized: pick the operation with the highest α weight. In the example above, that's conv3×3 (weight 0.820). The supernet collapses to a single discrete architecture. This is then retrained from scratch to get the final accuracy.

DARTS's known weakness: The discretization step is noisy. The supernet is optimized for the continuous mixture, not for any single discrete architecture. After discretization, performance can drop significantly — an operation that dominated the mixture (weight 0.82) may not be the best when used alone. Several follow-up works (PC-DARTS, DARTS+) reduce this gap with regularization and early stopping on α.

Bilevel Optimization

DARTS has two sets of parameters: network weights w (the convolution filters) and architecture parameters α. They're optimized jointly via bilevel optimization:

min_α L_val(w*(α), α) s.t. w*(α) = argmin_w L_train(w, α)

In practice this is approximated by alternating: one gradient step on w using training data, one gradient step on α using validation data. This avoids the expensive inner optimization loop.

python
# DARTS mixed operation — the core idea
import torch
import torch.nn as nn
import torch.nn.functional as F

class MixedOp(nn.Module):
    """DARTS mixed operation: weighted sum over all candidate ops."""
    def __init__(self, ops: list[nn.Module]):
        super().__init__()
        self.ops = nn.ModuleList(ops)
        # Architecture parameters α — initialized near uniform
        self.alphas = nn.Parameter(torch.zeros(len(ops)))

    def forward(self, x):
        weights = F.softmax(self.alphas, dim=0)  # shape: (|O|,)
        return sum(w * op(x) for w, op in zip(weights, self.ops))

    def discretize(self) -> nn.Module:
        """Return the operation with the highest α weight."""
        best_idx = self.alphas.argmax().item()
        return self.ops[best_idx]

# During DARTS search (alternating bilevel optimization):
# Step 1: one step on w using train_batch, α fixed
# Step 2: one step on α using val_batch, w fixed
for train_batch, val_batch in zip(train_loader, val_loader):
    w_optim.zero_grad()
    loss_train = model(*train_batch)
    loss_train.backward()
    w_optim.step()  # update weights w, α unchanged

    alpha_optim.zero_grad()
    loss_val = model(*val_batch)
    loss_val.backward()
    alpha_optim.step()  # update α, w unchanged

DARTS Continuous Relaxation — drag α logits, watch the mixed-op weights shift

Three candidate operations on a single edge: 3×3 Conv, 5×5 Conv, Skip. Drag the sliders to set the α logits. Watch the softmax weights update live. Click "Discretize" to snap to the argmax.

α₀ (3×3 Conv)2.1

α₁ (5×5 Conv)0.3

α₂ (Skip)-0.8

Drag sliders to adjust α logits.

In DARTS, why does training the architecture parameters α on the validation set (not the training set) matter?

It doesn't matter — training on validation vs training data produces the same α values because both sets come from the same distribution. The validation set is larger, so it provides more gradient signal for the architecture parameters. Training α on validation ensures the found architecture generalizes, not just memorizes — otherwise α would select operations that overfit the training set, which may not transfer. The validation set has different operations available, so using it changes which α values can be optimized.

Chapter 7: Performance Estimation

The bottleneck in any NAS algorithm isn't the search strategy — it's how you evaluate each candidate. Training a network to convergence on ImageNet takes 90 epochs × 1.2M images × 4 GPU-hours = days per candidate. With 10¹¹ candidates in the search space, you need much faster estimation.

Strategy 1: Lower-Fidelity Evaluation

Train each candidate for fewer epochs on a smaller dataset (e.g., CIFAR-10 as a proxy for ImageNet), or at lower resolution, or with fewer channels. The relative ranking of architectures is usually preserved even under these degraded conditions — an architecture that wins at 5 epochs often wins at 90 epochs. This reduces cost by 10–20×.

Strategy 2: Weight Sharing (One-Shot / Supernet)

The most impactful innovation in NAS. Instead of training each candidate from scratch, train one giant supernet whose weights are shared across all sub-network architectures. A sub-network (a specific architecture) is just a choice of which edges/operations to activate in the supernet. All sub-networks sharing edges also share the weights on those edges.

SMASH (Brock et al.) and ENAS (Pham et al., 2018) were early weight-sharing methods. Once-for-All (OFA, Cai et al., ICLR 2020) extends this: train one supernet once for 300 epochs, then search over sub-networks for free by just evaluating them using the shared weights — no retraining at all.

The weight-sharing accuracy gap: Weights shared in the supernet were optimized for the average of all sub-architectures, not for any single one. So the ranking of sub-networks by shared-weight accuracy is only a noisy proxy for their true standalone accuracy. Small networks and large networks sharing the same weights are both sub-optimal. OFA addresses this via progressive shrinking: first train full net, then progressively fine-tune smaller subsets.

Strategy 3: Accuracy Predictors

Train a regression model (usually a small MLP or GNN) to predict a new architecture's accuracy from its encoding. The predictor is trained on ~500 architecture-accuracy pairs from early in the NAS run. Once trained, it scores millions of new candidates in milliseconds. This is the approach used by NAT, BRP-NAS, and Semi-Supervised NAS.

The Cost Comparison

Method	Cost per Candidate	Total Search Cost
Train from scratch (full)	4–6 GPU-hours	450 GPU-days (NASNet)
Lower fidelity (10 epochs)	~20 GPU-min	~45 GPU-days
Weight sharing (ENAS)	~2 GPU-min (shared weights)	~0.5 GPU-days
DARTS (continuous)	N/A — gradient through space	4 GPU-days
Accuracy predictor	Milliseconds	<1 GPU-day (incl. predictor training)

Supernet Weight Sharing — one network, many sub-paths

A 3-stage supernet with 3 operation choices per stage. Highlighted sub-path = one candidate architecture using shared weights. Toggle between "Supernet" view and candidate architectures. See how the same weights are reused across different sub-networks.

Showing supernet. Click "Next Architecture" to highlight a sub-network path.

Why is weight-sharing accuracy a "noisy proxy" for true standalone accuracy?

Because the supernet uses INT8 weights while standalone networks use FP32. Because weight-sharing only works for small architectures and breaks down for large ones. Because shared weights are updated using gradients from all sub-architectures simultaneously, so each sub-architecture's weights are distorted by the other architectures' gradient updates. Conflicting gradient directions prevent any single architecture from converging to its optimal weights. Because shared weights are optimized for the average mixture of sub-architectures, not for any single sub-architecture's specific computation pattern. A standalone network retrained from scratch will find weights better tuned to its specific operations.

Chapter 8: Showcase: NAS Strategy Explorer

This showcase puts all three NAS search strategies side-by-side: Random Search, Evolutionary, and DARTS (simulated as gradient ascent in accuracy space). A 2D accuracy-vs-MACs landscape represents the search space. The goal: find the Pareto-optimal front of high-accuracy, low-MACs architectures. Watch how each strategy navigates — and how quickly each converges.

NAS Strategy Comparison — navigate a simulated accuracy-vs-MACs landscape

Three strategies search the same landscape in parallel. Each dot is a sampled architecture. Stars mark the best found so far. Click "Run 10 Steps" to advance all three, or step individual strategies. The Pareto front (best accuracy per cost level) advances as search progresses.

Click "Run 10 Steps" to begin search.

What to watch for: Random search explores broadly but needs many samples to find good architectures. Evolutionary search converges faster once it finds a good region — mutation keeps it near proven winners. Simulated DARTS focuses quickly on promising areas but may get stuck in a local optimum. In practice, the landscape is 10¹¹-dimensional, not 2D — these dynamics become more pronounced.

Why does evolutionary search tend to converge faster than random search on practical NAS problems, given that neither has gradient information?

Random search is inefficient because it always starts from a random architecture, while evolutionary search starts from the best known architecture. Evolutionary search uses crossover to combine multiple search dimensions simultaneously, while random search can only change one parameter at a time. Evolutionary search exploits the landscape's smoothness: mutation generates architectures near proven winners (locally similar = locally better), while random search wastes evaluations on regions far from any good architecture. Selection pressure concentrates effort in good regions. Evolutionary search has gradient information in the form of fitness gradients computed via finite differences.

Chapter 9: Connections & Cheat Sheet

NAS is not one algorithm — it's a framework with three independently-improving components. The best modern NAS systems (OFA, EfficientNet, ProxylessNAS) mix and match these components to achieve 100× faster search than the 2017 baseline.

NAS Component Cheat Sheet

Component	Options	Key Trade-off
Search Space	Cell-based (NASNet) \| Network-level (ProxylessNAS, OFA) \| TinyML (MCUNet)	Expressiveness vs tractability. Cell-based: smaller space, strong prior. Network-level: richer but harder to search.
Search Strategy	Random \| RL (NASNet, MnasNet) \| Evolutionary (OFA, AmoebaNet) \| Gradient (DARTS, PC-DARTS)	RL: principled but high variance, expensive. Evolution: robust, parallelizable, hardware-aware. DARTS: fast, differentiable, but discretization gap.
Performance Estimation	Full training \| Low-fidelity \| Weight sharing (ENAS, OFA) \| Accuracy predictor	Full training: true accuracy. Weight sharing: 100–1000× faster but noisy. Predictor: milliseconds but requires training data.

RL vs Evolution vs DARTS — When to Use Which

Criterion	RL (REINFORCE)	Evolutionary	DARTS
Search cost	High (450 GPU-days)	Medium (with supernet: low)	Low (4 GPU-days)
Hardware awareness	Reward term	Fitness term	Latency loss term
Multi-objective	Scalarized reward	Pareto front directly	Gradient w/ latency regularizer
Discrete/continuous	Discrete (sampling)	Discrete (mutation)	Continuous → discrete (discretize)
Theoretical guarantee	Converges to local opt	None (heuristic)	Local opt (bilevel)

Bridge to NAS II: Hardware-Aware and Once-for-All

This lecture covered NAS fundamentals. The next frontier is hardware-aware NAS: making the search space and objective directly aware of target-hardware latency, not just FLOPs. ProxylessNAS models latency as a differentiable function of architecture parameters. MnasNet adds a latency term to the RL reward. OFA trains a single supernet and searches for specialized sub-networks for each hardware target — a phone, a Raspberry Pi, an MCU — without retraining.

NAS II (MIT 6.5940 Lecture 8) covers: hardware-aware NAS, Once-for-All progressive shrinking, zero-shot NAS (scoring architectures without ANY training using gradient statistics), and neural-hardware co-design (searching architecture and accelerator simultaneously).

Why NAS still requires human judgment: The search space must be designed by hand. The hardware cost model must be specified. The accuracy metric (top-1 on ImageNet? BLEU? mAP?) must be chosen. NAS automates the search within a human-defined framework — it doesn't replace the engineer, it amplifies them. A well-designed search space found by random search beats a poorly-designed search space found by the best algorithm.

Related Gleams

← TinyML L2: NN Building Blocks & Compute — the ops NAS picks from (MBConv, depthwise-sep, bottleneck)
← TinyML L4: Pruning II — AMC uses RL (DDPG) to search sparsity ratios per layer, closely related to NAS RL
← TinyML L5: Quantization I — NAS and quantization are often co-searched (mixed-precision NAS)
Policy Gradients — REINFORCE and the policy gradient theorem, the theoretical foundation for RL-based NAS

"The art of NAS is not in the search algorithm — it's in designing a search space so well that even random search finds great architectures." — paraphrase of insights from Real et al., AAAI 2019

A practitioner wants to find an efficient network for a new IoT microcontroller with 512 KB RAM and 1 MB flash. They have limited GPU budget (2 GPU-days). Which NAS component choices are most appropriate?

RL controller + full training from scratch — gives the most principled search. Cell-based search space + DARTS + lower-fidelity evaluation — DARTS is fast and cell search is expressive. Network-level search space with memory constraint (MCUNet-style) + evolutionary search with OFA supernet + hardware latency/memory fitness function — directly targets MCU constraints, OFA supernet makes candidate evaluation nearly free. Random search over a very large space — random search is unbiased and works well for any budget.