A 70B model needs 140 GB just for weights — a single A100 holds 80 GB. Training is too slow on one GPU anyway. This lesson derives every tool you need: collective communication primitives (all-reduce, reduce-scatter, all-gather), data parallelism with ring all-reduce cost formula, ZeRO stages 1–3, tensor parallelism (column/row parallel matmuls), and the memory/bandwidth/batch-size tradeoffs. Pipelines and sequence parallelism are next lecture.
You want to train Llama 3 405B. Its weights alone occupy 405 × 109 × 2 bytes (bf16) = 810 GB. A top-of-the-line NVIDIA A100 has 80 GB of HBM. The model does not fit — not even close. And even if memory weren't the issue, training on a single GPU would take years: GPT-3 at 300B tokens needs roughly 3 × 1023 FLOPs, and an A100 delivers about 312 TFLOPS. That's over 900 GPU-days — for a single training run.
The answer is multi-GPU parallelism: split both the memory and the compute across many devices. But "split" is not one thing. You can split the data (each GPU trains on a different subset of the batch), split the model layers (each GPU holds different layers), or split the weight matrices themselves (each GPU holds different columns or rows of a weight). These three ideas — data, pipeline, and tensor parallelism — are the three axes of what practitioners call 3D parallelism.
This lesson covers axes one and three. Axis two (pipeline parallelism) is the next lecture. By the end, you will be able to compute exactly how much communication each approach requires, derive the ring all-reduce cost formula, and understand why production runs combine all three axes simultaneously.
Throughout this lesson we work with concrete numbers: a 70B parameter model (Llama 2 scale), bf16 precision, AdamW optimizer, A100 80 GB GPUs. By the end you will know exactly how many GPUs you need to fit this model, and what the bandwidth cost is at each step.
How much memory does a 70B model need? Slide to explore training memory across GPU counts. The dashed line is 80 GB per GPU.
Before we can reason about the cost of communication, we need to understand the network. Modern multi-GPU systems have a strict hierarchy of interconnects, each with very different bandwidths and latencies. Getting the topology wrong means paying 10–100× in communication cost.
Within a single machine (intra-node), NVIDIA GPUs are connected via NVLink — a point-to-point interconnect that runs at up to 600 GB/s bidirectional on H100 NVLink 4.0 (each A100 NVLink 3.0 peer-to-peer link runs at 600 GB/s total, 300 GB/s each direction across all its NVLink connections). This is roughly 10–50× faster than PCIe (which tops at ~32 GB/s for PCIe Gen 4). Within a DGX node you typically have 8 GPUs connected in an all-to-all NVLink topology, meaning every GPU can communicate with every other at full bandwidth simultaneously.
Across machines (inter-node), you're limited to whatever the data center network provides. Modern HPC clusters use InfiniBand HDR at 200 Gb/s (25 GB/s) per port, or Ethernet at 100–400 Gb/s per link. Even top-tier inter-node bandwidth is ~10–20× slower than NVLink. This bandwidth asymmetry is fundamental: it forces a different choice of parallelism strategy depending on whether you're communicating within a node or across nodes.
TPUs take a different approach: their "toroidal mesh" topology connects every chip to its neighbors in a 2D or 3D grid, with dedicated high-bandwidth ICI (inter-chip interconnect) running at ~600 Gb/s per direction. TPUs are optimized for collective communication patterns (all-reduce, reduce-scatter) at the hardware level, which is why they favor data-parallel and fully-sharded training. GPU clusters favor pipeline + tensor parallelism because NVLink is fast but not all-to-all at scale beyond 8 GPUs.
| Interconnect | Bandwidth (per GPU) | Topology | Best for |
|---|---|---|---|
| NVLink 3.0 (A100) | 600 GB/s total | All-to-all within 8 GPUs | Tensor parallel, FSDP |
| NVLink 4.0 (H100) | 900 GB/s total | All-to-all within 8 GPUs | Tensor parallel, FSDP |
| PCIe Gen 4 | ~32 GB/s | Star through CPU | Small DP, CPU offload |
| InfiniBand HDR | 25 GB/s | Fat-tree / Dragonfly | Pipeline parallel, DP |
| TPU ICI | ~150 GB/s each direction | Toroidal mesh | Data parallel, ZeRO |
All distributed ML training boils down to six collective communication primitives. A collective is an operation where every GPU in a group participates, and the result depends on contributions from all of them. Unlike point-to-point communication (GPU 0 sends to GPU 1), collectives have a defined mathematical structure that lets the network runtime use optimized algorithms.
Let's define all six with P = 4 GPUs, each holding a vector of M elements. Call GPU i's vector vi.
Pick an operation to see data movement across 4 GPUs. Each colored block represents one chunk of data.
Data parallelism (DP) is the simplest form of parallelism and the most commonly used. The idea: replicate the entire model on every GPU, split each mini-batch across GPUs, run forward and backward independently on each shard, then synchronize gradients with an all-reduce before the optimizer step.
Formally, SGD with a global batch of size B across M GPUs: each GPU i computes gradients on B/M examples. The update rule is:
The sum is computed by all-reduce. Every GPU starts with its local gradient ∇i and ends with the average ∇ = (∇0 + ∇1 + ... + ∇M-1) / M. Then every GPU applies the same optimizer step with the same gradient — so all replicas stay synchronized.
The saving grace: gradient all-reduce can be overlapped with the backward pass. As backprop computes gradients for the last layer, those gradients can be immediately sent. By the time backprop reaches the first layer, the last layer's gradients are already synchronized. PyTorch's DistributedDataParallel (DDP) does this automatically via gradient hooks.
Data parallelism has good scaling properties for compute: doubling GPUs doubles throughput. The catch is memory: every GPU must store the full model — weights, gradients, and optimizer state. For a 70B model with AdamW in mixed precision: 2 (weights) + 2 (gradients) + 4 (fp32 master) + 4 (adam m1) + 4 (adam m2) = 16 bytes per parameter × 70B = 1.12 TB per GPU. An A100 has 80 GB. Data parallelism alone fails catastrophically on memory.
Step through one training iteration with 4 GPUs. Watch the batch split, gradients accumulate, and all-reduce synchronize.
Data parallelism wastes memory: every GPU stores the full model, gradients, and optimizer state. ZeRO (Zero Redundancy Optimizer) eliminates this redundancy by sharding the expensive parts across GPUs — while keeping the same communication cost as naive DDP.
ZeRO has three stages, each sharding more aggressively:
FullyShardedDataParallel (FSDP) implements ZeRO stage 3.The key insight — proven in the ZeRO paper — is that all-reduce is mathematically equivalent to reduce-scatter + all-gather, and in the bandwidth-limited regime both cost the same. So ZeRO stages 1 and 2 are free memory wins: you pay the same communication bandwidth but store much less on each GPU.
| Strategy | Comm cost / step | Bytes / param (70B, 8 GPU) | Max params on 8×80GB |
|---|---|---|---|
| Naive DDP | 2P all-reduce | 16 bytes | 40B |
| ZeRO Stage 1 | 2P (RS + AG) | 2+2+(8/8) = 5 bytes | 128B |
| ZeRO Stage 2 | 2P (RS + AG) | 2+(2/8)+(8/8) = 3.25 bytes | ~197B |
| ZeRO Stage 3 | 3P (AG+AG+RS) | (2+2+8)/8 = 1.5 bytes | ~427B |
Naive all-reduce — reduce to one GPU, then broadcast — is terrible: the bottleneck GPU receives P-1 vectors and transmits one, making bandwidth scale linearly with P. The ring all-reduce algorithm fixes this by distributing work evenly across all P GPUs in a ring topology.
The algorithm has two phases, each requiring P-1 steps:
Phase 1 — Reduce-Scatter: Arrange P GPUs in a ring. Each GPU holds a vector of M elements, split into P chunks of M/P each. In each of P-1 steps, every GPU sends one chunk to its right neighbor and receives one chunk from its left neighbor, accumulates (sums). After P-1 steps, each GPU holds one fully-reduced chunk of size M/P.
Phase 2 — All-Gather: Each GPU holds one correct chunk. In P-1 more steps, each GPU sends its chunk rightward. After P-1 steps, every GPU has all P chunks — the full reduced vector.
Now let's derive the bandwidth cost per GPU. In phase 1, each GPU sends M/P bytes in each of P-1 steps: (P-1) × M/P = (P-1)/P × M bytes sent, same received. Phase 2: same. Total per GPU:
For large P, (P-1)/P → 1, so the cost approaches 2M — twice the message size, independent of P. This is remarkable: adding more GPUs doesn't increase the communication cost per GPU. The ring is bandwidth-optimal.
Time to complete the all-reduce, given bandwidth B GB/s per link:
Example: 70B parameters × 2 bytes (bf16 gradients) = 140 GB. Ring all-reduce over P=32 GPUs on InfiniBand HDR (25 GB/s): T = 2 × 31/32 × 140 / 25 = 10.85 seconds. That's the communication cost floor per training step — which must be hidden by overlapping with computation.
Drag sliders to set model parameters and bandwidth. See how all-reduce time scales with P. Note it quickly plateaus.
Data parallelism replicates the model — memory doesn't scale. ZeRO shards state — memory scales but you're still data-parallel at the compute level. Tensor parallelism (also called model parallelism along the width axis) takes a different approach: split the weight matrices themselves across GPUs, so each GPU only ever holds a fraction of the parameters.
The key mathematical insight: matrix multiplication can be decomposed into sub-matrix multiplications. Given Y = X W, where X is the input (B × din) and W is the weight (din × dout), you can partition W along either its columns or rows:
Column-parallel: Split W into P column blocks W = [W1 | W2 | ... | WP], where each Wi is din × (dout/P). GPU i computes Yi = X Wi without any communication (X is replicated or gathered). The result Yi is a partial output of shape B × (dout/P).
Row-parallel: Split W into P row blocks W = [W1; W2; ...; WP], each of shape (din/P) × dout. GPU i gets input shard Xi (shape B × din/P) and computes partial Yi = Xi Wi. The final output Y = Y0 + Y1 + ... + YP-1 requires an all-reduce over P GPUs.
The communication cost per transformer block is one all-reduce over the activations (shape B × s × h), not over the parameters. For batch B=1, sequence s=4096, hidden h=4096, in bf16: 1 × 4096 × 4096 × 2 = 32 MB per all-reduce. With two all-reduces per block (one for attention, one for FFN) and 96 blocks (Llama 3 405B): 2 × 96 × 32 MB = 6.1 GB of all-reduce per forward pass. That's 6.1 / 600 GB/s = 10 ms on NVLink. Compare to the attention + FFN compute time of ~100 ms → 10% overhead. Acceptable on NVLink; unacceptable on InfiniBand.
See how Y = X·W is split across P GPUs. Toggle between column-parallel (no comm) and row-parallel (all-reduce at end).
Let's do the accounting concretely. A transformer model has four memory consumers: parameters, gradients, optimizer state, and activations. Each parallelism strategy handles these differently. Understanding which gets sharded — and by how much — tells you exactly how many GPUs you need to fit a model.
For a 70B model with AdamW mixed precision (fp32 master + bf16 weights + bf16 grads + fp32 moments):
| Component | Bytes/param | 70B total | Per-GPU (P=8) |
|---|---|---|---|
| bf16 weights | 2 | 140 GB | 17.5 GB (TP) or 140 GB (DP) |
| bf16 gradients | 2 | 140 GB | 17.5 GB (ZeRO-2+) or 140 GB (naive) |
| fp32 master weights | 4 | 280 GB | 35 GB (ZeRO-1+) or 280 GB (naive) |
| fp32 Adam m1 | 4 | 280 GB | 35 GB (ZeRO-1+) or 280 GB (naive) |
| fp32 Adam m2 | 4 | 280 GB | 35 GB (ZeRO-1+) or 280 GB (naive) |
| Total | 16 | 1.12 TB | ? |
Activations are the fourth consumer and are often overlooked. For a transformer layer with batch B, sequence s, and hidden dimension h, the activation memory required to store all intermediate values for backprop is approximately:
The 10 terms come from: attention scores (s×s), value projections, FFN intermediate values, layer norm inputs, and dropout masks. For Llama 3 70B with h=8192, s=4096, B=1, 32 layers, bf16: 10 × 4096 × 1 × 8192 × 2 × 32 = 21.5 GB. This is in addition to the parameter memory above.
Set model size, GPU count, and parallelism strategy to see total memory per GPU. The dashed line is 80 GB (A100 limit).
Production LLM training runs never use just one form of parallelism. The Llama 3 405B paper, DeepSeek V3, and Yi all combine data parallelism, tensor parallelism, and pipeline parallelism simultaneously. This is called 3D parallelism — three orthogonal axes of splitting.
The rules of thumb from the Megatron-LM paper (Narayanan et al. 2021) are:
Step 1: Tensor parallel within each node. Use TP degree up to 8 (one per GPU per node). This uses NVLink and is cheap. TP splits weight matrices, reducing memory by TP degree.
Step 2: Pipeline parallel across nodes. After TP fills up the intra-node bandwidth, scale to multiple nodes by assigning stages of the model to different nodes. PP communication is point-to-point activation passing — much smaller than gradient all-reduce. But PP requires large batch sizes to hide the "pipeline bubble."
Step 3: Data parallel for the rest. Once the model fits (after TP + PP), scale out compute by replicating the TP+PP group across more GPU sets with data parallelism. Use ZeRO stage 1 or 2 to reduce the memory overhead of DDP without increasing communication cost.
Set model size and cluster shape. See how TP, PP, and DP combine to fit the model and what the communication costs are. Red = model doesn't fit.
This lesson covered the two most bandwidth-critical forms of LLM parallelism. Here's the full picture of what we covered and where it fits:
| Technique | What's sharded | Comm per step | Memory scaling | Batch size? |
|---|---|---|---|---|
| Naive DDP | Batch only | 2P (all-reduce) | None | Linear |
| ZeRO Stage 1 | Optimizer state | 2P (RS+AG) | Partial (opt) | Linear |
| ZeRO Stage 2 | Opt + Grads | 2P (RS+AG) | Partial | Linear |
| ZeRO Stage 3 / FSDP | Opt + Grads + Params | 3P | Linear | Linear |
| Tensor Parallel | Weight matrices | 8×bsh per layer (NVLink) | Linear | None |
| Pipeline Parallel | Layers | bsh per microbatch | Linear | Needs large! |
What's coming in Lec 8 (Parallelism II): Pipeline parallelism in depth — micro-batches, the bubble formula (nstages-1)/nmicro, zero-bubble pipelining (splitting backward into activation + weight gradient passes), and sequence parallelism (splitting the LayerNorm and dropout along the sequence dimension to reduce activation memory). Sequence parallelism is what makes activation memory truly linear in GPU count.
"The question is not whether to parallelize, but how to partition the problem across the three axes — memory, compute, and communication — and that requires understanding all three at once."
— paraphrased from Tatsu Hashimoto, CS336 Lecture 7