TinyML & Efficient Deep Learning · MIT 6.5940 · Lecture 14

Efficient LLM Post-Training: LoRA, QLoRA & PEFT

You have a 7B-parameter base model. You want to specialize it for your domain — medical QA, code generation, customer support — without buying a data center. Full fine-tuning needs 56+ GB just for gradients and optimizer states, and you end up with a full 14 GB copy per task. Can you adapt with less than 1% of the parameters on a single consumer GPU? Yes. This lesson derives exactly how — from low-rank algebra to 4-bit quantization to merging adapters at zero inference cost.

Prerequisites: TinyML L6 (Quantization II) — PTQ, QAT, NF4 concept. TinyML L12 (Efficient Transformers) — Transformer weight matrices, KV cache. TinyML L13 (LLM Deployment) — 14 GB / 3.5 GB memory math.
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Fine-Tuning Wall

You work at a hospital. GPT-4 is brilliant at general conversation but makes medication errors on rare drug names. You want to fine-tune a 7B open model on your clinical notes. Simple enough — except the math says no.

A 7B-parameter model in FP16 occupies 14 GB just for the weights: 7 × 10⁹ × 2 bytes = 14 × 10⁹ bytes. But to train it, you need more. You need the gradients — another 14 GB. And the optimizer states: Adam stores two momentum buffers per parameter, in FP32, adding 7 × 10⁹ × 2 × 4 bytes = 56 GB for the optimizer alone. Total: 14 + 14 + 56 = 84 GB — more than an A100-80GB, even for a 7B model.

Then there's the deployment problem. You need five specialized versions: clinical QA, radiology reports, billing codes, nursing notes, patient communication. That's five separate 14 GB model copies. 70 GB of disk space and five GPUs to serve them simultaneously.

The two problems PEFT solves: (1) Training cost — you can't keep gradients and optimizer states for all 7B parameters on consumer hardware. (2) Storage/serving cost — one full copy per specialized task is prohibitive. Parameter-Efficient Fine-Tuning (PEFT) solves both: train <1% of parameters, store only the tiny diff per task, share the frozen base in memory.

The key insight is that fine-tuning doesn't require changing all parameters. A pretrained LLM already knows language, grammar, and most facts. Specialization is a small delta on top of that base. The question is: what is the minimum parameterization that captures this delta?

Full Fine-Tuning Memory: 7B Model

The stacked bar shows where memory goes during full fine-tuning. Toggle the model size. The dashed line marks an RTX 4090 (24 GB VRAM). Watch the gap.

Model size 7B
A 7B FP16 model needs how much total GPU memory for full fine-tuning with Adam optimizer (FP32 optimizer states)?

Chapter 1: Alignment: SFT & RLHF

Before we get to efficient fine-tuning, we need to understand why fine-tuning happens at all. A pretrained language model predicts the next token on internet text. It does not follow instructions, stay safe, or give helpful answers by default. Post-training is the process of aligning the model's behavior with human intent.

Supervised Fine-Tuning (SFT) is the simplest form: collect (prompt, ideal response) pairs, then train the model with the standard next-token prediction loss on those responses only. The model learns "when given this kind of prompt, produce this kind of output." Llama-2's SFT dataset includes both helpfulness and safety demonstration pairs.

LSFT = − ∑i log P(ui | u0, …, ui−1; θ)

SFT works, but it only covers what's in your demonstration dataset. It doesn't capture what humans actually prefer among multiple valid responses. That's the gap RLHF (Reinforcement Learning from Human Feedback) fills. Humans rank pairs of model responses, a reward model learns these preferences, then the language model is fine-tuned to maximize the reward — with a KL penalty to prevent it from "hacking" the reward model by drifting far from the pretrained distribution.

DPO shortcut: RLHF requires training three models — the reward model, a reference (frozen pretrained), and the policy being fine-tuned. Direct Preference Optimization (DPO, Rafailov et al. 2024) collapses this to a single-stage classification loss. The key insight: the optimal RLHF policy can be expressed in closed form in terms of the reference model, so you can optimize the policy directly from preference pairs without ever training a reward model. Same quality, much simpler training.

In all three cases (SFT, RLHF, DPO), the underlying operation is updating the model weights via gradient descent. And that's where the fine-tuning wall from Chapter 0 bites us. SFT on a 7B model still requires 84 GB at full precision. RLHF is worse — you're running three models simultaneously. The efficiency techniques in the rest of this lesson apply to all these post-training regimes.

DPO is described as "direct preference optimization." What makes it more efficient than RLHF, and what does it still require?

Chapter 2: Early PEFT Methods

The PEFT research community attacked the fine-tuning wall from several directions before LoRA became dominant. Understanding these methods clarifies why LoRA won — each earlier approach had a critical weakness.

BitFit (Zaken et al., 2021) is the simplest idea: freeze all weights, train only the bias terms. BERT-base has 110M parameters, but only 0.1M biases — that's 1,000× fewer trainable parameters. For small to medium datasets, BitFit is surprisingly competitive with full fine-tuning. For larger datasets it falls behind. Reason: bias terms have limited capacity. They shift activations but can't rotate the representation space.

Adapter (Houlsby et al., 2019) inserts small bottleneck modules inside each Transformer layer — after the multi-head attention projection and after each feed-forward layer. Each adapter: down-project from d to r (r ≪ d), apply non-linearity, up-project back to d, add skip connection. Only adapter weights are trained; all original weights are frozen. For 1000 tasks with a 7B model: full fine-tuning = 1000 × 14 GB = 14 PB storage; adapters = 1000 × ~14 MB = 14 GB. A million-fold reduction.

Adapter's fatal flaw: the adapter module sits inside the forward pass, sequentially between transformer sub-layers. You can't merge it away at inference time — it adds extra matrix multiplications on every forward pass. On a GPU serving thousands of requests, this ~10% latency overhead becomes unacceptable. The inference latency of a single forward pass in GPT-2 medium increased measurably even with tiny adapter dimensions.

Prompt Tuning / Prefix Tuning (Lester 2021, Li & Liang 2021) prepend learned "soft prompt" tokens to the input instead of changing weights. Prompt tuning adds learnable tokens to the first layer only. Prefix tuning adds them to every layer's key/value matrices. Both are elegant but share a problem: they consume context window. If you prepend 100 learned tokens to a model with a 2048-token window, you lose ~5% of usable context, and inference is slower because the attention spans those extra tokens.

All early PEFT methods share a training efficiency win — fewer trainable params, smaller optimizer states. But they add inference overhead. The open question entering 2022 was: can we build a PEFT method that adds no inference overhead at all?

Adapter modules add ~10% inference latency. What architectural property prevents you from "folding" an adapter into the frozen weight matrix (as you can fold BatchNorm into Conv)?

Chapter 3: LoRA: Low-Rank Algebra

LoRA (Hu et al., ICLR 2022) solves the inference-latency problem with one algebraic trick: the fine-tuning update is low-rank, and low-rank matrices can be added to the original weight before inference begins — no new forward-pass overhead at all.

Here is the core idea. A pretrained weight matrix W has shape d×k (for a key/query/value projection in a Transformer, d = k = 4096 in LLaMA-2-7B). When you fine-tune, you update W ← W + ΔW. Full fine-tuning computes and stores the entire d×k update ΔW — all 4096 × 4096 = 16.8 million numbers.

LoRA hypothesizes that ΔW has low intrinsic rank: the fine-tuning update can be captured by a matrix whose rank r is much smaller than d. Formally, instead of learning ΔW directly, LoRA learns two small matrices:

ΔW = B · A      B ∈ Rd×r, A ∈ Rr×k, r ≪ d

The forward pass becomes:

h = xW + xΔW = xW + xBA      α/r scaling applied to BA

Initialization: A is initialized with random Gaussian values; B is initialized to zeros. This ensures ΔW = B·A = 0 at the start of training — the model behaves identically to the pretrained base at initialization, so fine-tuning starts from a stable baseline.

Merge at inference — zero latency overhead: After training, LoRA computes W' = W + (α/r)·B·A once offline, then uses W' everywhere. The runtime model has the same weight-matrix shapes as the original. No extra matmuls, no changed architecture, no extra memory at inference time. This is what adapters couldn't do.

The scaling factor α/r (where α is a hyperparameter, often set to the rank r or 2r) controls the learning rate of the LoRA update relative to the frozen weights. A larger α makes the adapter update more aggressive.

LoRA Low-Rank Decomposition — Matrix Visualization

The d×k update matrix ΔW (left) is parameterized as B(d×r)·A(r×k). Drag rank r. Watch how parameter count collapses and how the low-rank structure captures the update's principal directions.

Rank r 8
Why is B initialized to zeros (not random) in LoRA, while A is initialized randomly?

Chapter 4: LoRA: Parameter Math

Let's work out the exact numbers. A single linear layer in LLaMA-2-7B (e.g., the Q-projection) has d = 4096, k = 4096. Full fine-tuning trains d × k = 16,777,216 parameters for this layer — and stores that many gradient values and 2× that many Adam states in FP32.

LoRA at rank r = 8 trains: A is r × k = 8 × 4096 = 32,768 values; B is d × r = 4096 × 8 = 32,768 values; total = 65,536 parameters. That's a 256× reduction for this layer alone.

ParamsLoRA = r × (d + k)      Paramsfull = d × k
Savings = d × k ÷ (r × (d + k)) = 4096 × 4096 ÷ (8 × 8192) = 16.77M ÷ 65.5K ≈ 256×

In a full LLaMA-2-7B model with 32 layers, LoRA is typically applied to the Q, K, V, and O projection matrices (4 per layer × 32 layers = 128 matrices). Total LoRA parameters at r = 8: 128 × 65,536 = 8,388,608 ≈ 8.4M — less than 0.12% of the 7B total. The entire adapter fits in 16 MB (FP16).

Why is ΔW low-rank? The hypothesis (backed by empirical evidence across many tasks and model sizes) is that language model adaptation lives in a low-dimensional subspace of the full parameter space. The model "knows" how to be helpful, factual, and stylistic — it just needs a small rotation in representation space to apply those capabilities to your domain. That rotation is naturally low-rank. Aghajanyan et al. (2021) showed that fine-tuning trajectories of large models have intrinsic dimensionality far below the parameter count.
python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """Frozen W with trainable low-rank update B @ A."""
    def __init__(self, d_in, d_out, rank=8, alpha=16):
        super().__init__()
        # Frozen pretrained weight — not updated
        self.W = nn.Parameter(torch.randn(d_out, d_in), requires_grad=False)
        # Trainable LoRA adapter
        self.A = nn.Parameter(torch.empty(rank, d_in))   # r x k
        self.B = nn.Parameter(torch.zeros(d_out, rank))  # d x r — zeros!
        nn.init.kaiming_uniform_(self.A, a=5**0.5)        # Gaussian-like init for A
        self.scale = alpha / rank                         # α/r scaling factor

    def forward(self, x):
        # x @ W.T + (x @ A.T) @ B.T * scale
        # At init: B=0, so output = x @ W.T exactly
        base = x @ self.W.T
        lora = x @ self.A.T @ self.B.T * self.scale
        return base + lora

    def merge(self):
        """Fold adapter into W — zero inference overhead."""
        with torch.no_grad():
            self.W += self.scale * (self.B @ self.A)
        self.A = self.B = None   # free adapter memory
For d=4096, k=4096 (Q-projection in LLaMA-2-7B), rank r=16: how many LoRA parameters, and what is the parameter reduction vs full fine-tuning?

Chapter 5: QLoRA: 4-bit Fine-Tuning

LoRA dramatically reduces the trainable parameters — but the frozen base model still occupies 14 GB in FP16. On a 24 GB RTX 4090, that leaves only 10 GB for gradients, optimizer states, activations, and the LoRA adapters. For a 7B model it's tight. For a 65B model it's completely impossible.

QLoRA (Dettmers et al., NeurIPS 2023) combines LoRA with aggressive quantization of the frozen base. The key insight: the frozen weights don't need to be in FP16 during training. They're never updated — they're just read to compute forward-pass activations. If you store them in 4-bit, the 7B base occupies only 3.5 GB instead of 14 GB.

The QLoRA training loop works like this:

Store
Frozen base weights in NF4 (4-bit) on GPU — 3.5 GB for 7B
↓ forward pass
Dequantize
Each layer's NF4 weights → BF16 temporarily for the matrix multiply
↓ multiply
Add LoRA
BF16 LoRA adapters (B @ A × α/r) added to the dequantized output
↓ backprop
Update only A, B
Gradients and Adam states exist only for LoRA params (<1% of 7B)
Critical misconception — QLoRA does NOT backprop through quantized weights. The NF4 quantization of the frozen base uses a straight-through estimator implicitly: since the frozen weights never change, their quantization error is a fixed constant that doesn't affect gradient correctness for the LoRA parameters. Gradients flow through the dequantized (BF16) weights to the LoRA adapters — the frozen base is just a read-only lookup table, not a quantized operation that needs STE.

Three innovations make QLoRA work:

  1. NF4 quantization — a new 4-bit data type optimized for normally-distributed weights (explained in Ch.6)
  2. Double quantization — quantize the quantization constants themselves, saving ~0.37 bits/param more
  3. Paged optimizers — NVIDIA's unified memory manages optimizer state spikes by swapping to CPU RAM when GPU memory overflows, preventing OOM crashes during long-context samples

Result: A 65B model (normally requires 780 GB for full fine-tuning) can be fine-tuned on a single 48 GB A40 GPU with QLoRA. The resulting model (Guanaco) achieves 99.3% of ChatGPT's performance on the Vicuna benchmark.

In QLoRA, gradients flow through the frozen base model weights. Why doesn't the 4-bit quantization of those frozen weights break gradient accuracy?

Chapter 6: NF4 & Double Quantization

Standard INT4 assigns equally-spaced values between -8 and 7. But LLM weight values are not uniformly distributed — they follow a bell-shaped (approximately normal) distribution, concentrated near zero. If you use INT4's uniform grid, most of your quantization levels cluster where the weights aren't, wasting representational budget.

NormalFloat (NF4) constructs its 16 levels by placing them at the quantiles of the standard normal distribution — so each level represents an equal fraction of the probability mass. More levels near zero (where most weights live), fewer levels at the extremes. Same 4 bits, lower average quantization error.

The exact NF4 values (16 levels):

[−1.0, −0.696, −0.525, −0.395, −0.284, −0.185, −0.091, 0.0,
0.080, 0.161, 0.246, 0.338, 0.441, 0.563, 0.723, 1.0]

Notice they're denser near zero and sparser at the extremes — exactly matching where normal-distributed weights actually live. Before quantizing a weight tensor, it's scaled to [-1, 1] using the absolute maximum, then each value is mapped to its nearest NF4 level.

NF4 vs INT4: Quantization Level Placement

A normal-distributed weight histogram (gray bars). Red ticks = INT4 uniform levels. Teal ticks = NF4 levels. Toggle to see which placement has lower average quantization error.

Double Quantization (DQ): NF4 uses one FP32 scaling constant per group of 64 weights (e.g. 32-bit float / 64 weights = 0.5 bits/param overhead). Double quantization quantizes these scale constants to 8-bit floats with a second-level 256-group constant. This reduces the overhead from 0.5 bits/param to ~0.127 bits/param — saving an additional 0.37 bits per parameter, worth ~3 GB on a 65B model.

python
# NF4 quantization (simplified implementation)
import numpy as np

NF4_LEVELS = np.array([
    -1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911, 0.0,
     0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230, 1.0
])  # quantile positions of N(0,1)

def quantize_nf4(w, group_size=64):
    """Quantize weight tensor w to NF4 in groups of 64."""
    w_flat = w.reshape(-1, group_size)
    abs_max = np.abs(w_flat).max(axis=1, keepdims=True)  # [N/64, 1]
    w_norm = w_flat / (abs_max + 1e-8)                     # scale to [-1, 1]
    # Find nearest NF4 level for each value
    indices = np.argmin(np.abs(w_norm[:,:,None] - NF4_LEVELS[None,None,:]), axis=2)
    return indices.astype(np.uint8), abs_max.astype(np.float32)  # 4-bit + FP32 scales

def dequantize_nf4(indices, abs_max):
    """Reconstruct BF16 weights from NF4 for a matmul."""
    w_norm = NF4_LEVELS[indices]                # map indices back to float levels
    return (w_norm * abs_max).reshape(-1)     # rescale and flatten
NF4 places its 16 quantization levels at the quantiles of a normal distribution. Why does this reduce average quantization error compared to INT4's uniform levels for LLM weights?

Chapter 7: Memory Budget Lab

Let's compare the three training regimes side by side with exact numbers. For LLaMA-2-7B (7 × 10⁹ params) fine-tuning a single epoch on your task dataset:

ComponentFull FT (FP16/FP32)LoRA (r=8)QLoRA (r=8, NF4)
Model weights14 GB (FP16)14 GB (frozen FP16)3.5 GB (NF4 4-bit)
Trainable params7B (all)~8.4M (0.12%)~8.4M (0.12%)
Gradients14 GB (FP16, all layers)~17 MB (adapters only)~17 MB (adapters only)
Adam states56 GB (FP32, 2×params)~34 MB (2× adapters FP32)~34 MB (2× adapters FP32)
Total (approx)~84 GB~14 GB + activations~3.5 GB + activations
Single GPU feasible?No (needs 2× A100-80GB)Tight on A100-80GBYes — RTX 3090 (24 GB)

For LLaMA-2-65B (65 × 10⁹ params):

RegimeWeightsOpt. statesTotalHardware needed
Full fine-tuning130 GB520 GB~780 GB10× A100-80GB
LoRA r=8130 GB frozen~0.3 GB~131 GB2× A100-80GB
QLoRA r=8 (NF4)32.5 GB (4-bit)~0.3 GB~36 GB1× A40 (48 GB)
The QLoRA 65B result is remarkable: a model that previously required 10 GPUs fine-tunable on one. The LoRA adapter for the 65B model is still tiny — r×(d+k) × 32 layers × 4 matrices × 2 bytes (BF16) ≈ 8 × 8192 × 128 × 2 = ~16 MB. The entire specialization of a 65B model fits in a USB thumb drive.
Fine-Tuning Memory Breakdown — Full vs LoRA vs QLoRA

Stacked bar chart. Each segment = one memory component. Drag to 7B or 65B. Watch how QLoRA compresses the base weights from the dominant cost to nearly nothing.

Model size 7B
QLoRA's main memory win over vanilla LoRA is not from the adapter itself. Where does the 4× memory saving come from?

Chapter 8: Multi-Adapter Serving & BitDelta

You fine-tuned 1000 specialized variants of your 7B model with LoRA — one per customer, enterprise, or domain. At deployment time, you have 1000 sets of adapters. The base model is shared, immutable, and loaded once in GPU memory. When a request arrives, you swap in the appropriate adapter, run the forward pass, and return the result.

This is multi-adapter serving: one frozen base on one GPU, many swappable LoRA adapters (16 MB each). Instead of 1000 × 14 GB = 14 TB of model storage and 1000 GPUs, you need 14 GB (one base) + 1000 × 16 MB = 16 GB total — a 875× storage reduction.

Adapter batching: Advanced serving systems like S-LoRA (Sheng et al., 2023) batch requests with different adapters into the same GPU matmul using segmented sparse computations. The base layer runs as one fused kernel; the adapter deltas are batched per-adapter. This gives near-full GPU utilization even when serving many different fine-tuned variants simultaneously.

BitDelta (Liu et al., NeurIPS 2024) takes this idea further: can we compress the fine-tuning delta (ΔW = W_fine − W_pretrained) even more aggressively than LoRA? The intuition is that fine-tuning adds less information than pretraining — the delta should be more compressible. BitDelta compresses ΔW to 1 bit per weight with a per-tensor scaling factor. The binary representation (±1 scaled by γ) is fine-tuned jointly with γ.

ΔW ≈ γ · sign(ΔW)     γ = mean(|ΔW|)

Storage: a 7B model fine-tune with BitDelta requires only 7B bits = 875 MB — vs 14 GB in FP16. For multi-tenant serving, this means the per-task delta is 16× smaller than LoRA adapters. The "the more you serve, the more you save" effect: a fused binary GEMM kernel multiplies the 1-bit delta with the batch, and since the kernel is more memory-bandwidth efficient than FP16 matmul, throughput increases as batch size grows.

Multi-Adapter Serving: One Base + N Adapters

Drag the number of specialized adapters. Compare total storage (LoRA vs full-copy vs BitDelta). Click a task to see which adapter is "loaded."

Number of adapters 10
You serve 500 specialized 7B models for 500 enterprise customers. In FP16 full-copy serving, storage = 500 × 14 GB = 7 TB. With LoRA (r=8, 16 MB adapters) + one shared base, what is the total storage?

Chapter 9: Connections & PEFT Cheat Sheet

You've now seen the complete PEFT landscape. Here is how the methods compare when you actually need to choose one.

MethodTrainable paramsInference overheadBest forKey limitation
BitFit~0.1% (bias only)NoneSmall datasets, quick adaptationLimited capacity, large-dataset gap
Adapter0.5–3% (bottleneck)~10% latencyNLP tasks (BERT era)Sequential overhead, can't merge
Prefix/Prompt Tuning0.01–0.1%~5% (longer context)Generation tasks, large modelsConsumes context window
LoRA0.01–1% (r=4–64)None (after merge)Any task, inference-criticalRank selection, merging strategy
QLoRA0.01–1% (adapters)None after mergeConsumer hardware, 65B modelsDequant per-layer = slower train
BitDeltaN/A (post-training)~None (fused kernel)Multi-tenant servingRequires fine-tuned model first

Rank Selection Guide

There is no single right rank. The empirical guidance from the LoRA paper and subsequent ablations:

Rank r does NOT equal quality. Higher r = more trainable params = more capacity to overfit on small datasets. At r=256, LoRA approximates full fine-tuning — but you also get full fine-tuning's overfitting risks. For a 100-sample domain adaptation: r=4 is safer than r=256.

Related Lessons in This Series

Prerequisites:

Related Gleams:

What comes next — L15 Long-Context LLMs: L14 showed how to specialize a model cheaply. L15 addresses context length: the attention complexity is O(N²) in sequence length N — doubling context quadruples cost. Long-context techniques (RoPE extension, YaRN, Longlora, ring attention) work on top of the LoRA-fine-tuned models you now know how to build.
"Full fine-tuning says: update everything. LoRA says: the update is a small perturbation in a low-dimensional subspace — parameterize only that. QLoRA says: compress even the frozen base while we're at it. BitDelta says: once fine-tuned, the delta itself compresses to a single bit per weight. Each step is another way of saying the same thing: adaptation is cheap information." — MIT 6.5940 course notes