TinyML & Efficient Deep Learning · MIT 6.5940 · Lectures 19–20

Distributed Training: Data, Tensor & Pipeline Parallelism

A 70B model's weights alone weigh 140 GB in FP16 — and training needs weights + gradients + Adam states, which totals over 1 TB. No single GPU holds that. You must distribute computation across hundreds of GPUs while keeping them synchronized. This lesson derives every communication primitive from scratch, shows exactly how ZeRO shards optimizer states, gradients, and weights to recover memory, explains how Megatron splits a single matmul across GPUs, and derives the pipeline bubble formula that quantifies the idle-time tax of staging layers. Every claim comes with numbers.

Prerequisites: TinyML L12 (Efficient Transformers) — attention FLOPs, KV cache. TinyML L2 (Pruning) — parameter counting. Basic PyTorch training loop.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Memory Wall

Let's start with a concrete problem. You want to train LLaMA-2 70B, the open-source model released by Meta. Its parameter count is 70 × 10⁹. In FP16 (2 bytes/param) that's 140 GB just for weights. An NVIDIA H100 — the most capable GPU available as of 2024 — has 80 GB of HBM. The model weights alone don't fit on a single GPU.

But that's only the beginning. Training requires three more memory buckets beyond weights. Gradients for every parameter: another 140 GB in FP16. Optimizer states for Adam: a FP32 master copy of weights (4 bytes/param = 280 GB), first moment (4 bytes/param), and second moment (4 bytes/param) — totaling 12 bytes per parameter = 840 GB. Grand total: 140 + 140 + 840 = 1,120 GB to train LLaMA-2-70B in full precision. That's 14 fully-loaded H100s just to hold the state, and you haven't run a single forward pass yet.

Even for a "small" 7B model: weights (14 GB) + gradients (14 GB) + Adam states (84 GB) = 112 GB. Still doesn't fit on one 80 GB GPU. And if it did fit, training 7B on a trillion tokens would take a single GPU 355 years (GPT-3's training bill was 3.1 million GPU-hours). Distributed training isn't optional — it's the only path.

Memory budget anatomy for training (Adam, FP16 mixed precision): For a model with Ψ parameters — Weights: 2Ψ bytes (FP16). Gradients: 2Ψ bytes (FP16). Adam states: 12Ψ bytes (FP32 param copy + momentum + variance). Total: 16Ψ bytes. For 7B: 16 × 7×10⁹ = 112 GB. For 70B: 1,120 GB. Even on an 80 GB A100, the trainable limit without distribution is 80/16 = 5B parameters.

There are three ways to distribute a training job. Data parallelism: every GPU holds a full model copy but processes a different mini-batch shard; gradients are synchronized across GPUs. Tensor parallelism: a single layer's weight matrix is split across multiple GPUs that each compute a partial result and merge it. Pipeline parallelism: different layers run on different GPUs in sequence, like an assembly line. In practice, training large models uses all three simultaneously — called 3D parallelism.

Training Memory Wall — 7B vs 70B Model

Each stacked bar shows the memory breakdown (weights + gradients + Adam states) for a model. The red dashed line is one GPU's memory (80 GB H100). Drag the slider to see how GPU count affects per-GPU memory under naive data parallelism (no sharding).

GPUs (naive DP) 1
You're training a 7B model with Adam optimizer and FP16 mixed precision. How much memory does training require per GPU under naive data parallelism (each GPU holds the full model)?

Chapter 1: Communication Primitives

Before we can understand how GPUs synchronize during training, we need to establish the vocabulary of collective communication. These are the fundamental operations — implemented in NCCL (NVIDIA's GPU communication library) — that underpin every distributed training system.

The simplest primitive is Send / Recv: transfer a tensor from one process to another. This is one-to-one communication — the building block for everything else. More interesting are one-to-many operations. Broadcast sends an identical copy from one GPU to all others. Scatter splits a tensor and sends each chunk to a different GPU. Gather collects chunks from all GPUs into one destination. Reduce is like Gather but applies an aggregation (usually summation or averaging) during collection — so [3][5][2][4] reduces to [14] at the destination.

The most important primitives for training are many-to-many operations. All-Reduce performs Reduce across all GPUs and then broadcasts the result back — every GPU ends up with the same aggregated tensor. All-Gather is like Gather but every GPU ends up holding all chunks. Reduce-Scatter is the inverse: reduce all chunks, then scatter so each GPU holds one shard of the reduced result. These three — All-Reduce, All-Gather, and Reduce-Scatter — are the workhorses of distributed training.

Key insight: All-Reduce = Reduce-Scatter + All-Gather. This decomposition is not just theoretical — it's how ring all-reduce is actually implemented. First, reduce-scatter: each GPU ends up with the fully-reduced version of its shard. Second, all-gather: each GPU broadcasts its shard so everyone gets the full result. This two-phase structure is why ZeRO-3 can replace All-Reduce with a custom reduce-scatter + all-gather sequence that shards the data in between.

In the classic Parameter Server architecture: step 1 is a Broadcast (server sends weights to all workers), and step 4 is a Reduce (workers push gradients, server aggregates). The problem: the server's bandwidth scales as O(N) with the number of workers — it's the bottleneck. At N=128 workers, the server must receive 128× the gradient traffic. This is why modern distributed training uses All-Reduce instead: no central bottleneck, symmetric load across all nodes.

OperationWho sendsWho receivesResultUse in training
Broadcast1 GPUAll GPUsAll have same copyParam server → workers
ReduceAll GPUs1 GPU1 GPU has sumWorkers → param server
All-ReduceAll GPUsAll GPUsAll have same sumGradient sync in DDP
All-GatherAll GPUsAll GPUsAll have full tensorZeRO-3: gather shards before forward
Reduce-ScatterAll GPUsAll GPUsEach holds 1 shardZeRO-3: scatter grads
In data parallelism, each GPU computes gradients on its own mini-batch shard. Which communication primitive is used to synchronize gradients so every GPU's model gets updated with the full-batch gradient?

Chapter 2: Ring All-Reduce

The naive all-reduce is sequential: GPU 0 reduces with GPU 1, then with GPU 2, and so on. This takes O(N) steps and O(N) bandwidth at the aggregator — the same bottleneck as a parameter server. Can we do better?

The ring all-reduce arranges all N GPUs in a logical ring: 0 → 1 → 2 → ... → N-1 → 0. Each GPU sends data to its right neighbor and receives from its left neighbor simultaneously. Split the gradient tensor into N equal chunks. In Phase 1 (Reduce-Scatter), over N-1 steps, each GPU passes its chunk to the next GPU, accumulating (summing) as it goes. After N-1 steps, each GPU holds the fully-reduced version of exactly one chunk — which is 1/N of the total data. In Phase 2 (All-Gather), over another N-1 steps, each GPU sends its reduced chunk around the ring. After 2(N-1) total steps, every GPU holds every reduced chunk — the same final result as a centralized all-reduce.

The bandwidth math is elegant. At each step, each GPU sends and receives exactly (data size / N). The per-GPU bandwidth is therefore constant regardless of N — O(1) peak load per node. Over 2(N-1) steps, each GPU sends a total of 2(N-1)/N × data_size bytes. As N → ∞, this approaches 2 × data_size. For large N, the ring all-reduce communication cost is:

Costring = 2 × (N−1)/N × D ≈ 2D  (for large N)

Where D is the total gradient data size. This is bandwidth-optimal: you cannot synchronize N copies of a tensor with less than 2D total communication (you need to both reduce and distribute). The ring algorithm achieves this lower bound. For 7B model gradients in FP16: D = 14 GB. Ring all-reduce total bytes per GPU ≈ 2 × 14 GB = 28 GB — and this can be fully pipelined with backward computation (PyTorch DDP does this by starting the all-reduce as soon as gradients are computed for each layer, before the backward pass finishes).

Misconception: more GPUs means more communication per GPU. With a ring all-reduce, each GPU's bandwidth cost converges to 2D as N grows — it does NOT grow with N. This is why ring all-reduce scales well: adding a 100th GPU doesn't make the communication 100× heavier per GPU. The total system communication grows (more GPUs transmit), but individual GPU bandwidth stays bounded. The parameter server has O(N) at the server; the ring distributes this uniformly.
Ring All-Reduce: Step-by-Step Animation

Sliders control the number of GPUs in the ring and which step to visualize. Each colored segment represents a partial sum chunk. After the reduce-scatter phase (N-1 steps), each node has one fully-reduced chunk. After all-gather (another N-1 steps), all nodes have all chunks.

Number of GPUs (N) 4
Step 0
You're running ring all-reduce over N=8 GPUs with a gradient tensor of size 4 GB. How much data does each GPU send in total across all steps?

Chapter 3: Data Parallelism

Data parallelism is the most natural way to scale training. The idea: give every GPU a complete, identical copy of the model. Split the training batch across GPUs — if you have N GPUs and a batch of B, each GPU sees B/N samples. Each GPU runs a full forward and backward pass on its shard, computing local gradients. Then, all-reduce the gradients across GPUs. Now every GPU has the same gradient — as if the full batch B had been processed on a single GPU. Every GPU performs the same weight update. Repeat.

In PyTorch, this is DistributedDataParallel (DDP). DDP registers gradient hooks so that as soon as a layer's backward pass finishes, its gradients are launched into the all-reduce immediately — before the rest of the backward pass completes. This computation-communication overlap hides most of the all-reduce latency behind the backward pass, making DDP nearly as fast as single-GPU training up to moderate scales.

The math is clean. With N GPUs, each processes B/N samples per step. The effective batch size is still B (since the averaged gradients are equivalent to full-batch). Throughput scales as O(N): N GPUs can process N× as many tokens per second. However, there is a subtle issue: large-batch training degrades convergence quality. The "linear scaling rule" (linear-scale the learning rate with batch size) works up to batch size ~8K; beyond that, large-batch training requires warm-up, LARS/LAMB optimizers, and often produces lower final accuracy than small-batch training. Training GPT-3 required 175B parameters × 300B tokens — even with 1,024 A100s, that's over 34 days. Data parallelism handles throughput, not model scale.

Misconception: data parallelism reduces per-GPU memory. It does NOT. Every GPU holds the full model — all weights, all gradients, all optimizer states. At 16 bytes/param, a 7B model still needs 112 GB per GPU. Adding 1,000 GPUs doesn't change this. Data parallelism improves throughput (more data per second) but does nothing for per-GPU memory. ZeRO, tensor parallelism, and pipeline parallelism address the memory problem.
python
# PyTorch DDP — launch with: torchrun --nproc_per_node=8 train.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group(backend='nccl')       # NCCL for GPU-to-GPU
rank = dist.get_rank()                           # this GPU's ID (0..N-1)
device = torch.device(f'cuda:{rank}')

model = MyModel().to(device)
model = DDP(model, device_ids=[rank])             # wraps model, hooks gradients

# Training loop — DDP handles all-reduce transparently
for batch in dataloader:                           # sampler splits batches by rank
    loss = forward(model, batch)
    loss.backward()                               # triggers all-reduce as we go
    optimizer.step()                               # same gradient on all GPUs
    optimizer.zero_grad()
# gradient synchronization cost: ~2×(N-1)/N × 14GB = ~27.6GB for 7B, N=8
You run DDP on 8 GPUs training a 7B model. You observe that each GPU is using 112 GB of memory. A colleague suggests "just add more GPUs — that'll reduce memory pressure." Are they correct?

Chapter 4: ZeRO & FSDP

Data parallelism wastes memory on redundancy: N GPUs each hold N identical copies of the same weights, gradients, and optimizer states. ZeRO (Zero Redundancy Optimizer), from DeepSpeed (Microsoft Research), eliminates this redundancy by sharding across the N data-parallel GPUs — while maintaining the same computation pattern.

ZeRO has three progressive stages. ZeRO-1 shards only the optimizer states. Each GPU holds full weights and full gradients, but only holds 1/N of the optimizer states (momentum + variance). Memory: (2 + 2 + 12/N) bytes/param. For N=64: (2 + 2 + 0.1875) = 4.1875 bytes/param. A 7B model: 7B × 4.19 = 29.3 GB per GPU. ZeRO-2 also shards gradients. Each GPU holds full weights but only 1/N of gradients and 1/N of optimizer states. Memory: (2 + 2/N + 12/N) bytes/param = (2 + 14/N). For N=64: 2.22 bytes/param. ZeRO-3 shards everything — weights, gradients, and optimizer states. Memory: (2 + 2 + 12)/N = 16/N bytes/param. For N=64: 0.25 bytes/param → 7B × 0.25 = 1.75 GB per GPU. You can train a 320B model on 64 A100s!

Baseline: (2 + 2 + 12)Ψ = 16Ψ bytes
ZeRO-1: (2 + 2 + 12/N)Ψ ≈ 4Ψ bytes (large N)
ZeRO-2: (2 + 2/N + 12/N)Ψ ≈ 2Ψ bytes (large N)
ZeRO-3: (2/N + 2/N + 12/N)Ψ = 16Ψ/N bytes

But ZeRO-3 has a catch: when a GPU needs to compute the forward or backward pass through a layer, it must first gather the full layer parameters from all other GPUs (an All-Gather), then compute, then discard those weights. Gradients are similarly scattered with a Reduce-Scatter after the backward pass. The communication volume for ZeRO-3 is equivalent to a standard all-reduce: the extra cost is small relative to the memory savings, but it does add latency on each layer.

In PyTorch, ZeRO-3 is implemented as FSDP (Fully Sharded Data Parallel). FSDP wraps each module (or a group of modules) — when a forward pass enters a wrapped module, FSDP runs All-Gather to reconstruct the full parameters; when the backward pass exits, FSDP runs Reduce-Scatter to shard the gradients back. The parameters are discarded immediately after use to reclaim memory.

python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
import functools

# Wrap each transformer layer independently (shard at layer granularity)
auto_wrap = functools.partial(
    transformer_auto_wrap_policy,
    transformer_layer_cls={TransformerBlock}
)
model = FSDP(model, auto_wrap_policy=auto_wrap,
               mixed_precision=MixedPrecision(
                   param_dtype=torch.float16,      # FP16 for fwd pass
                   reduce_dtype=torch.float32))    # FP32 for gradients

# Training loop is identical to DDP — FSDP handles all-gather/reduce-scatter
for batch in dataloader:
    loss = forward(model, batch)
    loss.backward()
    optimizer.step()  # optimizer updates only the local shard
ZeRO memory math worked example. 7B model, N=64 GPUs. Baseline: 16×7B = 112 GB per GPU. ZeRO-1: (2+2+12/64)×7B = (4+0.1875)×7B = 29.3 GB → 3.8× reduction. ZeRO-2: (2+14/64)×7B = (2.219)×7B = 15.5 GB → 7.2× reduction. ZeRO-3: (16/64)×7B = 0.25×7B = 1.75 GB → 64× reduction. With ZeRO-3, 64 A100s can train a model of size 64×80 GB / 0.25 bytes/param = 20.5B params at full precision, or 64×80/0.0625 = 82B params in INT8 training.
ZeRO Stages: Per-GPU Memory vs. Number of GPUs

Drag the GPU count slider. Each line shows per-GPU memory for Baseline DP, ZeRO-1, ZeRO-2, and ZeRO-3 for a 7B model. The H100 80 GB line shows the single-GPU limit. ZeRO-3 is the only strategy that crosses below the GPU line with few GPUs.

Number of GPUs 8
You're training a 70B model with ZeRO-3 on N=128 GPUs. How much memory does the model state (weights + gradients + Adam) occupy per GPU? (16 bytes/param total)

Chapter 5: Pipeline Parallelism

Data parallelism improves throughput but not per-GPU memory unless combined with ZeRO. Tensor parallelism reduces per-GPU memory but requires fast intra-node interconnect. A third strategy is pipeline parallelism: split the model by layer across GPUs. GPU 0 runs layers 0–15, GPU 1 runs layers 16–31, and so on. The model passes activations between GPUs like an assembly line.

The memory benefit is immediate and exact. A 70B model with 80 transformer layers distributed across 8 GPUs: each GPU holds 10 layers. 70B/8 = 8.75B params worth of weights. At 2 bytes/param (FP16), that's 17.5 GB per GPU — well within an 80 GB H100. Pipeline parallelism splits the model by layer depth, so memory scales as 1/P where P is the number of pipeline stages.

But there's a painful inefficiency: the pipeline bubble. In a naive pipeline, GPU 1 can't start until GPU 0 finishes its forward pass. GPU 2 waits for GPU 1. During this ramp-up phase, only one GPU is working at a time — the others are idle. The same happens in the ramp-down (drain) phase of the backward pass. With P stages and a single micro-batch, the bubble fraction is:

Bubble fraction (naive) = (P−1)/P

For P=4 stages, bubble fraction = 3/4 = 75% idle time. GPipe (Google, 2019) introduced the fix: micro-batching. Split the mini-batch into m micro-batches and feed them continuously. While stage 3 finishes micro-batch 1's forward, stage 2 can process micro-batch 2. The bubble fraction with m micro-batches is:

Bubble fraction (GPipe) = (P−1) / (m + P−1)

For P=4 stages and m=8 micro-batches: bubble = 3/(8+3) = 3/11 ≈ 27.3% idle. For m=16: 3/19 ≈ 15.8%. For m→∞, bubble → 0. Practical sweet spot: m ≈ 4P gives bubble fraction below 20%. Let's verify: P=4, m=8. Total time slots = m + (P-1) = 8 + 3 = 11. Useful slots = m = 8. Idle slots = 3 (both ramp-up and drain). Bubble = 3/11 = 27.3%. Matches the formula.

Misconception: micro-batching eliminates the pipeline bubble. It only shrinks it. The bubble is always (P-1) idle slots at startup plus (P-1) more at drain. With m micro-batches total, the active fraction grows to m/(m+P-1). But the (P-1) idle slots never disappear — they just become a smaller fraction of the total. The only way to eliminate the bubble is to have infinitely many micro-batches, which is impractical. In production, P=4–8 stages with m=16–64 micro-batches is typical.
Pipeline Bubble Visualizer

Each row is a pipeline stage (GPU). Each cell is a time slot. Blue = forward pass, orange = backward pass, gray = bubble (idle). Drag sliders to change stages and micro-batches. Watch the bubble fraction value update.

Pipeline stages (P) 4
Micro-batches (m) 4
A 32-layer model is split across P=4 pipeline stages with m=12 micro-batches. What is the pipeline bubble fraction?

Chapter 6: Tensor Parallelism

Pipeline parallelism splits between layers — each GPU processes a different set of layers. Tensor parallelism splits within a single layer — multiple GPUs each compute a different part of the same matrix multiplication. This is the approach pioneered by Megatron-LM (NVIDIA, 2019).

Consider an FFN with two linear layers: X ∈ [seq, d_model] → A ∈ [d_model, 4d] → ReLU → B ∈ [4d, d_model] → Z. For d_model=4096 and d_ff=16384, matrix A alone is 4096×16384 = 67M params = 128 MB in FP16. Now split A by columns across T=8 GPUs: each GPU holds A_i ∈ [d_model, d_ff/T] = [4096, 2048] = 8M params = 16 MB. Each GPU receives the full input X (same on all GPUs) and computes a partial hidden state H_i = ReLU(X @ A_i) locally. Then split B by rows: each GPU holds B_i ∈ [d_ff/T, d_model]. Each GPU computes partial output Z_i = H_i @ B_i. The final output Z = ΣZ_i requires one All-Reduce to sum partial results.

GPU i: Hi = ReLU(X @ Ai),   Zi = Hi @ Bi
Z = All-Reduce(Z1 + Z2 + … + ZT)

Memory saving: each GPU holds 1/T of A and 1/T of B. The All-Reduce tensor has shape [seq_len, d_model] — for seq_len=4096, d_model=4096, FP16: 32 MB. On NVLink at 600 GB/s, this takes ~0.05 ms. The forward pass time for one FFN on one GPU is roughly (seq × d_ff) matmul time — on an A100 at 312 TFLOPS, this is (4096 × 4096 × 16384 × 2) / 312e12 ≈ 1.7 ms. The All-Reduce is ~3% of compute time. Tensor parallelism is efficient when NVLink is available (intra-node), and becomes communication-bound over InfiniBand or Ethernet (inter-node).

Tensor parallelism needs synchronous communication on the forward critical path. Unlike data parallelism where gradient all-reduce is asynchronous (background), tensor parallelism's All-Reduce must complete before the next layer can start. This is why tensor parallelism degree T is typically limited to the number of GPUs within one node (e.g., T=8 for an 8-GPU DGX). Beyond one node, the 600 GB/s NVLink drops to 25 GB/s InfiniBand — the All-Reduce becomes 24× slower and dominates the step time.
python
# Megatron-style tensor parallelism for one FFN block
import torch
import torch.distributed as dist

class TensorParallelFFN(nn.Module):
    def __init__(self, d_model, d_ff, tp_size, tp_rank):
        super().__init__()
        # Column-parallel: each GPU gets d_ff/T output columns
        self.W1 = nn.Parameter(torch.empty(d_model, d_ff // tp_size))
        # Row-parallel: each GPU gets d_ff/T input rows
        self.W2 = nn.Parameter(torch.empty(d_ff // tp_size, d_model))
        self.tp_group = ...  # dist process group for TP ranks

    def forward(self, x):
        # x: [batch, seq, d_model] — replicated across TP group
        h = torch.nn.functional.relu(x @ self.W1)    # [batch,seq,d_ff/T]
        z = h @ self.W2                               # [batch,seq,d_model] partial
        dist.all_reduce(z, group=self.tp_group)       # sum partial outputs
        return z                                      # full d_model output
Tensor parallelism degree T=4 is applied to an attention layer with d_model=4096, num_heads=32, head_dim=128. The Q/K/V projection weights (3 matrices, each d_model × d_model) are split column-parallel. How many parameters does each GPU hold for the Q/K/V projection?

Chapter 7: 3D Parallelism

Data parallelism scales throughput. ZeRO/FSDP reduces memory redundancy. Pipeline parallelism splits layers across GPUs. Tensor parallelism splits individual weight matrices. Large-scale training systems use all three simultaneously — called 3D parallelism. The total GPU count is the product of three degrees: N = D × T × P, where D is data-parallel degree, T is tensor-parallel degree, and P is pipeline stages.

Consider training a 530B parameter model on 2,048 A100s (as done by BigScience BLOOM). Configuration: T=8 (one node, NVLink), P=12 (pipeline stages, inter-node), D=2048/(8×12)=21.3 ≈ 21 data-parallel groups. In practice: T=8, P=12, D=21, total = 8×12×21 = 2,016 GPUs. Each pipeline stage holds 530B/12 ≈ 44B params. Each TP-P group then has 44B/8 ≈ 5.5B params per GPU. At 16 bytes/param for training (with ZeRO-1 for optimizer states): 5.5B × (2+2+12/21) ≈ 5.5B × 4.57 ≈ 25 GB per GPU — fits in 80 GB.

The key rule for ordering the parallelism dimensions: tensor parallelism innermost (uses fast NVLink), pipeline parallelism middle (uses InfiniBand but is point-to-point), data parallelism outermost (uses all-reduce but only once per step). Within a single node of 8 GPUs: T=8. Across nodes within a rack: P stages. Across racks: D data parallel. This layering minimizes the use of the slowest interconnect.

Communication pattern in 3D parallelism: Tensor parallelism — 2 All-Reduces per transformer layer, over NVLink (600 GB/s), synchronous on forward/backward critical path. Pipeline parallelism — point-to-point send/recv of activations between adjacent stages, over InfiniBand (200 Gb/s = 25 GB/s); 1 communication per stage boundary per micro-batch. Data parallelism — 1 All-Reduce per step over all data-parallel ranks; can be pipelined with backward; volume ≈ 2×(D-1)/D × model_weights_on_this_TP-P_shard.

Beyond 3D parallelism, sequence parallelism (splitting the token dimension for very long sequences), expert parallelism (for mixture-of-experts models like Mixtral), and activation checkpointing interact with the 3D scheme. Activation checkpointing trades compute for memory: discard intermediate activations during the forward pass, recompute them from the checkpoint during backward. This reduces activation memory from O(L×seq×d) to O(sqrt(L)×seq×d) at the cost of roughly 33% extra compute.

3D Parallelism GPU Assignment Grid

Each cell represents one GPU. Color indicates the data-parallel group. The grid rows are tensor-parallel groups (within nodes). Columns of colored groups form pipeline stages. Drag sliders to configure the 3D setup. Total GPUs = D×T×P shown in title.

Data Parallel (D) 4
Tensor Parallel (T) 4
Pipeline Stages (P) 4
A training cluster uses 3D parallelism with T=4 (tensor), P=4 (pipeline), D=16 (data). How many GPUs are required total, and how many pipeline stages are there?

Chapter 8: Showcase: Distributed Training Lab

This showcase brings together the five key numerical results of this lesson: the memory wall, ring all-reduce cost, ZeRO memory savings, pipeline bubble fraction, and 3D parallelism memory-per-GPU. You can configure a model size, parallelism strategy, and hardware, and see all the numbers update live.

Distributed Training Full Calculator

Configure your model and parallelism strategy. The canvas shows per-GPU memory breakdown under your chosen strategy, plus the pipeline bubble fraction and ring all-reduce communication cost. All numbers are derived, not looked up.

Model size (B params) 70B
Data parallel N 8
Pipeline stages P 4
Micro-batches m 8
ZeRO stage Baseline
Verified numbers for reference: 7B model, N=8 GPUs, ZeRO-3, P=4, m=8 → per-GPU model state = 16×7B/8 = 14 GB. Pipeline bubble = 3/11 = 27.3%. Ring all-reduce of gradients on DP group: 2×(8-1)/8 × (14 GB/8) [ZeRO-3 shard] = 1.75 × 1.75 GB = ~3.1 GB per DP step per GPU. 70B model, N=64, ZeRO-3 → per-GPU = 16×70B/64 = 17.5 GB. Add ZeRO-3 comms overhead of ~50% vs baseline DP → still 3× better total system throughput than single-GPU due to parallelism.
You're configuring training for a 13B model. You have 64 A100s (80 GB each). You want to use ZeRO-3 for DP, pipeline parallelism with P=4, and no tensor parallelism. Without activations, what is the per-GPU model-state memory?

Chapter 9: Connections & Cheat Sheet

You now have the complete distributed training toolkit. Let's consolidate everything into a decision framework — then connect to the broader landscape.

Parallelism Strategy Comparison

StrategyWhat it splitsMemory/GPUCommunicationInterconnect neededWhen to use
Data Parallel (DDP)Batch across GPUsFull modelAll-Reduce gradients, 2×D bytes/stepAny (async OK)Model fits on 1 GPU; want throughput
ZeRO-1Optimizer states sharded~4 bytes/param+All-Gather optim statesInfiniBandModerate memory savings; easy to enable
ZeRO-2Optim + gradients sharded~2 bytes/param+Reduce-Scatter gradsInfiniBandLarger models; 7B on 4×A100
ZeRO-3 / FSDPEverything sharded16Ψ/N bytesAll-Gather params + Reduce-Scatter grads per layerInfiniBand70B+ on 8–64 GPUs
Pipeline ParallelLayers across GPUs1/P modelActivations between stagesInfiniBand OKVery deep models; ≤20% bubble with m≥4P
Tensor ParallelWeight matrices within layers1/T modelAll-Reduce per layer (sync!)NVLink requiredVery large layers; T≤8 (one node)
3D ParallelAll three combined16Ψ/(D×T×P) approxAll of the aboveNVLink + InfiniBand100B+ models on 1000+ GPUs

The Memory Budget Formula

Before adding complexity, always start with this formula. For a model with Ψ parameters trained with Adam in FP16 mixed precision, on N GPUs with ZeRO-3, pipeline degree P, tensor degree T:

Memory per GPU ≈ 16Ψ / (N × T) bytes (model state)
+ activations per stage ≈ seq × d × num_layers_per_stage × 2 bytes
+ gradients in ZeRO-3: already included in 16Ψ/N
Pipeline divides layers not params per layer, so the 1/P comes from fewer layers, not from the 16 factor directly
Quick sanity check table. 70B model (16×70B = 1,120 GB baseline): With N=8 GPUs, ZeRO-3: 1120/8 = 140 GB per GPU — still doesn't fit in 80 GB. With N=16: 70 GB — barely fits, no headroom for activations. With N=32: 35 GB + activations. With N=64 + P=4 (pipeline splits layers, not ZeRO multiply): ZeRO-3 → 1120/64 = 17.5 GB per GPU for model state. That fits 4 sequences on 80 GB GPUs. This is why 70B training runs typically use 64–256 A100s.

Memory-Saving Techniques Not Covered

Activation checkpointing (gradient checkpointing): discard all intermediate activations during forward; recompute them from checkpoints during backward. Memory savings: 2×–4× reduction in activation memory. Cost: ~33% more compute. Mixed precision: store weights in BF16, compute in FP32, store gradients in FP32 — reduces weight memory 2×. Gradient accumulation: run multiple micro-batches and sum gradients before doing one optimizer step — simulates larger batch without extra memory. These three techniques are usually the first adjustments before adding model parallelism.

Connection to On-Device Training (L17)

The lesson that follows this one covers the other extreme: training not on a cluster of 1,000+ GPUs, but on device — a smartphone or MCU with 1–100 MB of RAM. The memory budgets flip: where distributed training worries about fitting 1 TB+ of optimizer state, on-device training worries about fitting a 1MB model update. Techniques like LoRA and federated learning bridge the two extremes. Both ends share the same fundamental constraint: model state must fit in available memory, and communication is the bottleneck.

Related Lessons

The deepest insight in distributed training: every strategy is fundamentally a trade between memory, communication, and compute utilization. ZeRO-3 wins on memory but adds synchronous communication overhead per layer. Tensor parallelism wins on per-GPU memory with low bubble but needs NVLink. Pipeline parallelism is communication-light (point-to-point activations only) but loses utilization to the bubble. The art of large-scale training is choosing the three degrees (D, T, P) to maximize GPU utilization while keeping all GPUs busy and all memory below the hardware limit.
You're designing a distributed training configuration for a 7B model on 8 × A100 80GB GPUs. Your model fits on a single GPU (112 GB > 80 GB, so it actually doesn't). To fit with training overhead, which is the minimum parallelism strategy needed?