Overview & Tokenization — Language Models from Scratch (CS336 L1)

Chapter 0: Why Would Anyone Build One From Scratch?

You can get a world-class language model to write you a sonnet in three seconds by typing into a box. So why would a sane person spend a quarter implementing a tokenizer, a Transformer, an optimizer, and a training loop by hand? The honest answer is the same reason a chef learns to break down a chicken even though the supermarket sells breasts in plastic: the abstractions are leaky, and the interesting work lives underneath them.

Here is the trend the course opens with. Watch how researchers have related to the technology over time:

~8 years ago: researchers implemented and trained their own models. They touched every weight.
~6 years ago: researchers downloaded a pretrained model (say, BERT) and fine-tuned it. They touched the top.
Today: most researchers just prompt a proprietary model (GPT-4, Claude, Gemini). They touch the surface.

Each step up this ladder buys productivity — you get more done per hour. But it costs understanding, and unlike the abstractions in a programming language or an operating system (which almost never leak), these ones leak constantly. When your fine-tune mysteriously degrades, or your prompt works at 7B and breaks at 70B, the explanation lives several rungs below where you're standing.

The abstraction ladder — productivity up, understanding down

Slide through the three eras. The productivity bar rises as you climb; the stack you actually understand shrinks. CS336 deliberately drops you to the bottom rung — you will rebuild the whole stack.

eraprompt

The course's thesis is a single phrase: understanding via building. You cannot reproduce a frontier model in a class — GPT-4 reportedly has ~1.8 trillion parameters and cost ~$100M to train; xAI built a cluster of 200,000 H100s; the Stargate project commits $500B over four years. Those numbers are out of reach. But you can rebuild the machinery at small scale, and three kinds of knowledge come out of that exercise:

Mechanics

How things work: what a Transformer is, how parallelism uses GPUs. Transfers fully.

Mindset

Squeeze the hardware; take scale seriously (scaling laws). Transfers fully.

Intuitions

Which data/modeling choices help accuracy. Transfers only partially — small ≠ large.

That last caveat is important and honest: more is different. A model under 1B parameters (what we train in this class) is not a tiny photograph of a 1T model. The fraction of compute spent in attention versus the MLP shifts with scale; some capabilities emerge only past a size threshold and are simply absent below it. So mechanics and mindset transfer cleanly; intuitions you must hold loosely.

The bitter lesson, read correctly. The famous "bitter lesson" is often misquoted as "scale is all that matters, algorithms don't." The course insists on the right reading: algorithms that scale are what matter. Captured in one equation:

accuracy = efficiency × resources

At large scale you cannot afford to be wasteful, so efficiency matters more, not less. One study found a 44× gain in algorithmic efficiency on ImageNet between 2012 and 2019 — pure algorithm, same accuracy, a fraction of the compute.

This reframes the entire course as one optimization problem: given a fixed compute and data budget, what is the best model you can build? Maximize efficiency. Every design decision in the next sixteen lectures — tokenization, architecture tweaks, kernels, parallelism, data filtering — is ultimately a move to squeeze more accuracy out of the same resources.

Common misconception: "I should take this class to learn the hottest techniques (RAG, agents, multimodality) or to get great results on my application." The course is blunt that this is the wrong reason. If you want results on an application, prompt or fine-tune an existing model. CS336 is for people with an "obsessive need to understand how things work" and who want to build research-engineering muscle. The payoff is depth, not a product.

The course slogan is "accuracy = efficiency × resources." Why does this make efficiency more important at large scale, not less?

At large scale, resources are infinite, so only efficiency varies. Efficiency only matters for small models; large models rely purely on scale. At large scale you're spending enormous, expensive resources, so a wasteful algorithm burns far more compute — a 2× efficiency gain is worth millions of dollars and weeks of cluster time. Efficiency and resources are the same thing.

Chapter 1: The Whole Pipeline at a Glance

Before we zoom into tokenization, let's see the whole machine, because every later lecture is a chapter of this one diagram. The course frames the entire problem with a concrete prompt: given a Common Crawl dump and 32 H100 GPUs for two weeks, what should you do? Answering that requires five stages, and they map almost one-to-one onto the rest of this lesson series.

The CS336 pipeline — click a stage to see what it owns

Five stages turn raw internet text + GPUs into a useful assistant. Click each box. The stage we build today — tokenization — is the very first transformation your text undergoes.

Click a stage above.

Here is the same map in words, with the assignment each stage corresponds to — this is your roadmap for the lessons that follow:

Stage	What it covers	The efficiency angle
Basics	Tokenization, Transformer architecture, training loop (optimizer, LR schedule).	Get one full pipeline working end-to-end at all.
Systems	GPU kernels (Triton), parallelism across many GPUs, inference (prefill/decode).	Squeeze the most out of the hardware you have.
Scaling laws	Predict large-model loss/hyperparameters from cheap small-model experiments.	Don't waste compute tuning at full scale.
Data	Curate, transform, filter, and deduplicate the training corpus.	Don't waste updates on bad or repeated data.
Alignment	Supervised fine-tuning + learning from feedback (DPO, PPO, GRPO).	A well-aligned model can be smaller for the same usefulness.

Why efficiency drives even tokenization. Working directly with raw bytes is mathematically elegant — no vocabulary to design, nothing language-specific. But it is compute-inefficient with today's architectures, because attention cost grows quadratically with sequence length and bytes make sequences long. So we compress text into fewer, denser tokens. Tokenization exists almost entirely as a concession to efficiency — a theme you'll see again in every stage.

Notice the framing in the last column: every single stage is a war on waste. That's the through-line. Today we're at the very start of "Basics," doing the first thing that ever happens to a piece of text. So let's ask the obvious beginner question and refuse to hand-wave it.

The course says working with raw bytes is "elegant but compute-inefficient." What specifically makes long byte sequences expensive for a Transformer?

Bytes can't represent emoji. Attention cost grows quadratically with sequence length, so longer (byte-level) sequences cost dramatically more compute and memory. Bytes require a larger vocabulary than words. GPUs cannot do integer arithmetic.

Chapter 2: What Exactly Is a Tokenizer?

A neural network does not understand text. It understands numbers — specifically, it consumes sequences of integers and looks each one up in a big table of vectors (the embedding matrix). So before any modeling can happen, we need a deterministic procedure that converts a human string into a list of integers, and another that converts those integers back. That pair of procedures is a tokenizer.

Concretely, a tokenizer is any object implementing two methods:

python
from abc import ABC

class Tokenizer(ABC):
    """Strings <-> sequences of integers (tokens)."""
    def encode(self, string: str) -> list[int]:
        raise NotImplementedError
    def decode(self, indices: list[int]) -> str:
        raise NotImplementedError

Two definitions we will use constantly:

Vocabulary size — the number of distinct integers the tokenizer can emit. The model needs one embedding row and one output-softmax column per vocabulary entry, so vocabulary size directly sizes two big matrices.
Compression ratio — how many raw UTF-8 bytes of the original text each token represents on average:
compression ratio = (number of bytes) ÷ (number of tokens)
Higher is better: it means each token packs in more text, so the sequence the Transformer has to process is shorter and cheaper.

A non-negotiable property: round-tripping. decode(encode(s)) == s for every string s. If you can't perfectly reconstruct the input, you've thrown away information before the model ever ran. Every tokenizer we build will be checked with an assert that the round-trip holds.

Live tokenizer playground — watch text become tokens

Type anything. We segment it with a GPT-2-style rule (each colored chunk is one token) and report the byte count, token count, and compression ratio. Try "hello hello" (note the two "hello"s tokenize differently — one has a leading space). Try a number like "12345". Try an emoji.

bytes: — tokens: — compression: —

You probably noticed three things, and they're exactly the observations the lecture highlights from the real GPT-2 tokenizer:

A word and its preceding space belong to the same token (e.g. " world" including the space).
The same word looks different at the start of text versus in the middle ("hello" vs " hello") — because of that space.
Numbers get chopped into chunks of a few digits, not kept whole.

None of those rules were designed by hand. They emerged from a training procedure run over a giant corpus — which is the whole point of Byte-Pair Encoding, and where we're headed. But to appreciate why BPE is clever, we first have to watch the three obvious approaches fail. Let's start with the most naive one.

A tokenizer turns a 40-byte string into 10 tokens. What is its compression ratio, and is higher or lower better?

0.25; lower is better because it means more tokens. 4.0; lower is better because tokens are simpler. 4.0 (40 bytes ÷ 10 tokens); higher is better — each token packs more text, so the sequence is shorter and cheaper for the Transformer. 10.0; higher is better.

Chapter 3: Attempt #1 — One Token per Character

The most literal idea: a string is a sequence of Unicode characters, so let each character be one token. Python hands us this for free. The function ord(c) maps a character to its integer code point; chr(i) maps back.

python
assert ord("a") == 97
assert ord("🌍") == 127757
assert chr(97) == "a"

class CharacterTokenizer(Tokenizer):
    """Represent a string as a sequence of Unicode code points."""
    def encode(self, string):  # 'str' -> list[int]
        return list(map(ord, string))
    def decode(self, indices): # list[int] -> 'str'
        return "".join(map(chr, indices))

It round-trips perfectly — decode(encode(s)) == s always holds, because ord and chr are exact inverses. So far so good. Let's run it on the lecture's test string and look at the actual numbers.

Characters → code points

Each box is one character with its ord() code point underneath. Notice the wild range: ASCII letters sit near 100, but 🌍 is 127757 and the Chinese characters are in the tens of thousands. Edit the text to explore.

characters (tokens): — UTF-8 bytes: — compression: —

The compression ratio looks decent here — multi-byte characters mean bytes > characters — but two fatal problems sink this approach:

Problem 1: the vocabulary is enormous. There are roughly 150,000 Unicode characters. That's 150,000 embedding rows and 150,000 softmax outputs — a huge, expensive table, most of which you'll barely use.

Problem 2: most of it is wasted. The overwhelming majority of those characters (rare CJK glyphs, obscure symbols, 🌍) appear almost never in real text. You're paying for 150K slots to get value from a few thousand. That's inefficient use of the vocabulary — and remember, efficiency is the whole game.

So pure characters give a bloated, mostly-idle vocabulary. The natural reaction is: "fine, go smaller — what's the smallest possible alphabet?" That pushes us all the way down to raw bytes.

Why is a character-level tokenizer's ~150,000-entry vocabulary considered inefficient?

Because 150,000 is too small to cover all languages. Because characters can't round-trip. The vocabulary sizes two huge matrices (embeddings + output softmax), yet the vast majority of those characters are rare and barely used — you pay for capacity you don't exercise. Because ord and chr are slow.

Chapter 4: Attempt #2 — One Token per Byte

Every Unicode string can be serialized into bytes — integers from 0 to 255 — using an encoding. The dominant one is UTF-8. Common characters (all of ASCII) take a single byte; rarer characters take two, three, or four. For example:

python
assert bytes("a", encoding="utf-8") == b"a"            # 1 byte:  [97]
assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d"  # 4 bytes: [240,159,140,141]

class ByteTokenizer(Tokenizer):
    """Represent a string as a sequence of bytes."""
    def encode(self, string):
        return list(map(int, string.encode("utf-8")))
    def decode(self, indices):
        return bytes(indices).decode("utf-8")

This fixes Problem 1 beautifully: the vocabulary is exactly 256. Tiny, complete, language-agnostic — it can represent any text in any language or script, including emoji, with zero out-of-vocabulary cases. It's the most elegant tokenizer imaginable. So why doesn't everyone use it?

Characters explode into UTF-8 bytes

Top row: the original characters. Bottom: the UTF-8 bytes each one expands into (hex). ASCII letters stay one byte; 🌍 becomes four bytes, the Chinese characters three each. Count the bottom row — that's your token count, and it's much longer than the text.

vocab size: 256 tokens (= bytes): — compression: —

Here's the killer. For a byte tokenizer, the number of tokens equals the number of bytes, by definition. So the compression ratio is:

compression ratio = (number of bytes) ÷ (number of tokens) = (number of bytes) ÷ (number of bytes) = 1.0

A compression ratio of exactly 1 means no compression at all — the worst possible. Sequences are as long as they can be. And recall from Chapter 1: a Transformer's attention cost scales quadratically with sequence length. Doubling the sequence quadruples the attention compute. A long byte sequence is a compute disaster, and the context window (how much text the model can see at once) gets eaten up by individual bytes.

The trade we've discovered. Characters: small sequences but a giant vocabulary. Bytes: a tiny vocabulary but huge sequences. These two failures point in opposite directions, which is the clue. We want the best of both: a modest vocabulary and short sequences. Neither extreme gives us that. Maybe splitting on whole words does?

The byte tokenizer has a perfect vocabulary of 256 and round-trips any text. What's the fatal flaw?

It can't represent emoji. Its compression ratio is exactly 1 — no compression — so sequences are maximally long, and attention cost grows quadratically with length. Its vocabulary is too large. It doesn't round-trip.

Chapter 5: Attempt #3 — One Token per Word

Classical NLP split text into words. A word usually carries a unit of meaning, so this should give short sequences (one token per word) and a vocabulary of "real" linguistic units. The first cut is a simple regular expression:

python
import regex
string = "I'll say supercalifragilisticexpialidocious!"
segments = regex.findall(r"\w+|.", string)  # keep runs of word-chars together
# ['I', "'", 'll', ' ', 'say', ' ', 'supercali...docious', '!']

The actual GPT-2 tokenizer uses a much more careful pattern. It is worth dissecting every piece, because this exact regex reappears later as BPE's "pre-tokenizer":

the GPT-2 pre-tokenization regex
'(?:[sdmt]|ll|ve|re)   # contractions: 's 'd 'm 't 'll 've 're  ("don't", "we'll")
 ?\p{L}+             # an optional space + a run of LETTERS  (" world", "Hello")
 ?\p{N}+             # an optional space + a run of NUMBERS  (" 2025")
 ?[^\s\p{L}\p{N}]+   # optional space + a run of PUNCTUATION/symbols ("!!!", " ---")
\s+(?!\S)         # trailing whitespace at end of a line/string
\s+                 # any other run of whitespace

Read it slowly. \p{L} means "any Unicode letter," \p{N} "any number." The leading ? (an optional space) is why a word carries its preceding space into the same token — the observation we saw in Chapter 2. This is a deliberate, hand-built rule that bakes the "space belongs to the next word" convention right into the segmentation.

Watch the GPT-2 regex carve text into segments

Each chip is one segment the regex produced. Blue = letters, green = digits, orange = punctuation, grey = whitespace. The little dot · marks a leading space captured inside a token. Edit and watch.

segments: —

The compression looks great — one token per word is short. But word tokenization carries three crippling problems:

The vocabulary is huge and open-ended. Real corpora contain millions of distinct word-forms (typos, names, URLs, "supercali...docious"). Like characters, this bloats the embedding/softmax tables.
Rare words are barely learned. A word seen three times in the whole corpus gets a near-useless embedding. Most of the vocabulary is statistical dead weight.
No fixed vocabulary size, and the UNK problem. You can't enumerate all words in advance, so any new word at test time becomes a special UNK ("unknown") token — ugly, lossy, and it corrupts perplexity calculations (you can't even score a sentence containing a novel word properly).

The lesson from three failures. Characters & words both fail on a giant, open vocabulary. Bytes fail on sequence length. The ideal sits in between: start from the safety of bytes (vocabulary of 256, no UNK ever, perfect round-trip) but then merge common byte-runs into single tokens to shorten sequences — growing the vocabulary only where the data says it pays off. That data-driven middle path is Byte-Pair Encoding.

What is the "UNK problem" with word-level tokenization?

Words are too short to be useful tokens. The regex can't handle punctuation. You can't list every possible word in advance, so any word unseen during training becomes a special UNK token — information is lost and perplexity scoring breaks. Words have a compression ratio of exactly 1.

Chapter 6: Byte-Pair Encoding — The Idea

Byte-Pair Encoding (BPE) was invented by Philip Gage in 1994 as a data compression trick, then adapted to NLP for neural machine translation by Sennrich et al. in 2016, and adopted by GPT-2. Its central idea is one sentence:

The BPE idea: Don't hand-design the vocabulary — train it. Start with every byte as its own token, then repeatedly find the most frequent adjacent pair of tokens in the corpus and merge it into a single new token. Common sequences (like "th", then "the") become single tokens; rare sequences stay broken into many. The data decides.

Why is this the best of both worlds? You begin from bytes, so you inherit their virtues — vocabulary starts at 256, every string round-trips, there is never an UNK (worst case, a novel word falls back to its individual bytes). But each merge shortens the sequence for common text, climbing the compression ratio. You stop after a chosen number of merges, which gives you a fixed vocabulary size you control: 256 + (number of merges).

Let's hand-trace it on the lecture's tiny example so the mechanism is concrete. We'll train on the string "the cat in the hat" with 3 merges. To keep the trace readable we'll trace it at the character level (real BPE works on bytes, but for pure ASCII the two are identical — each character is one byte).

Start. Split into individual symbols (each is a token):

t·h·e·⎵·c·a·t·⎵·i·n·⎵·t·h·e·⎵·h·a·t (⎵ = space, 18 tokens)

Merge 1 — count every adjacent pair. The pair (t, h) appears twice ("the cat ... the hat"); (h, e) appears twice; (e, ⎵) appears twice; everything else once. There's a tie at count 2. Ties are broken deterministically (e.g. the first pair encountered / lexicographic) — say we take (t, h). Create new token 256 = "th" and replace:

[256]·e·⎵·c·a·t·⎵·i·n·⎵·[256]·e·⎵·h·a·t (16 tokens)

Merge 2 — recount on the new sequence. Now (256, e) = "th"+"e" appears twice. It wins. Create 257 = "the":

[257]·⎵·c·a·t·⎵·i·n·⎵·[257]·⎵·h·a·t (14 tokens)

Merge 3 — recount again. Now (257, ⎵) = "the"+" " appears twice. Create 258 = "the ":

[258]·c·a·t·⎵·i·n·⎵·[258]·h·a·t (12 tokens)

In three merges we went from 18 tokens to 12 — and we discovered that "the " is a meaningful unit, purely from counting. No linguist told it that "the" is a word. The merges we learned, in order, are the tokenizer's entire trained parameters:

Step	Pair merged	New token id	Bytes it stands for
1	(t, h)	256	`"th"`
2	(256, e)	257	`"the"`
3	(257, ⎵)	258	`"the "`

Two tables are all a BPE tokenizer is. (1) a vocab mapping each token id → the bytes it represents (ids 0–255 are the raw bytes; 256+ are merges), and (2) an ordered list of merges mapping a pair of ids → the new id. To encode new text, you re-apply the merges in the same order. To decode, you look up each id's bytes and concatenate. That's the whole tokenizer — two dictionaries.

That's the algorithm by hand. Now let's make it live: train BPE on a corpus of your choosing and watch the sequence shrink merge by merge.

In BPE training, after you merge the most frequent pair, what must you do before choosing the next merge?

Restart from the original bytes. Re-count adjacent pairs on the new sequence (which now contains the merged token), because merging changes which pairs are adjacent. Remove the merged token from the vocabulary. Pick the least frequent pair next, for balance.

Chapter 7: Showcase — Train BPE One Merge at a Time

This is the payoff. Pick a corpus, then step through training. At each step you'll see the current token sequence (chips), the top adjacent pairs by count (the winner highlighted), and the live vocabulary size, sequence length, and compression ratio. Slide to any number of merges, or step one at a time to watch a single pair collapse.

Interactive BPE trainer

merges applied0

vocab size: 256 sequence length: — compression: —

Things to try, and what they teach:

Use the "repetitive text" preset. Highly repetitive text compresses fast — a few merges swallow whole phrases, and the compression ratio shoots up. This is exactly why deduplicated, structured data tokenizes efficiently.
Watch the early merges. They're almost always frequent short pieces — " t", "he", "in", "th" — the building blocks of common words. BPE rediscovers morphology from raw counts.
Push merges high. The compression-ratio curve rises quickly then flattens: the first merges grab the biggest wins (most frequent pairs), and each later merge buys less. Diminishing returns — which is why real vocabularies stop in the tens of thousands, not millions.

What you're really watching. The "training" of a tokenizer is nothing but greedy frequency counting — there is no neural network, no gradient, no loss here. It's a heuristic over corpus statistics. That's both its charm (dead simple, fast, deterministic) and its limit (greedy: it never reconsiders an early merge, so the vocabulary it finds isn't provably optimal — just good enough, reliably).

Check yourself (no quiz — the sim is the test): Set merges to 0 and read the sequence length — that's the raw byte count, compression 1.0. Now step until the ratio passes 1.5. Can you name the first three merges and say why each was chosen? If yes, you understand BPE training.

Chapter 8: Every Line of the Code

Now we implement the whole thing, exactly as the lecture does, so you could reproduce it from memory. Three pieces: a merge helper, a train_bpe function, and the BPETokenizer that uses the trained tables.

Piece 1 — the `merge` helper

Replace every occurrence of an adjacent pair with a single new index. This is the one primitive used by both training and encoding.

python
def merge(indices, pair, new_index):
    """Return `indices` with every adjacent `pair` replaced by `new_index`."""
    new_indices = []
    i = 0
    while i < len(indices):
        # if the pair starts here, emit the merged token and skip 2
        if i + 1 < len(indices) and indices[i] == pair[0] and indices[i+1] == pair[1]:
            new_indices.append(new_index)
            i += 2
        else:                       # otherwise copy this token and step 1
            new_indices.append(indices[i])
            i += 1
    return new_indices

Data flow: in → a list[int] of length L; out → a list[int] of length L minus (number of merges performed). The while loop walks left to right; the i += 2 is what consumes both halves of a matched pair so they're not double-counted.

Piece 2 — `train_bpe`

The trainer from Chapters 6–7, in full. It returns the two tables: vocab (id → bytes) and merges (pair → id).

python
from collections import defaultdict
from dataclasses import dataclass

@dataclass(frozen=True)
class BPETokenizerParams:
    vocab:  dict[int, bytes]               # id -> bytes
    merges: dict[tuple[int,int], int]    # (id1,id2) -> new id

def train_bpe(string, num_merges):
    # 1. start: each byte is a token; vocab has the 256 raw bytes
    indices = list(map(int, string.encode("utf-8")))
    merges = {}
    vocab  = {x: bytes([x]) for x in range(256)}

    for i in range(num_merges):
        # 2. count every adjacent pair
        counts = defaultdict(int)
        for a, b in zip(indices, indices[1:]):
            counts[(a, b)] += 1

        # 3. pick the most frequent pair
        pair = max(counts, key=counts.get)
        a, b = pair

        # 4. mint a new id, record it in both tables, rewrite the sequence
        new_index = 256 + i
        merges[pair]      = new_index
        vocab[new_index]  = vocab[a] + vocab[b]
        indices = merge(indices, pair, new_index)

    return BPETokenizerParams(vocab=vocab, merges=merges)

Piece 3 — the `BPETokenizer`

Encoding re-applies the learned merges in order; decoding looks up bytes and concatenates.

python
class BPETokenizer(Tokenizer):
    def __init__(self, params): self.params = params

    def encode(self, string):
        indices = list(map(int, string.encode("utf-8")))
        # NOTE: slow — re-applies EVERY merge over the whole sequence
        for pair, new_index in self.params.merges.items():
            indices = merge(indices, pair, new_index)
        return indices

    def decode(self, indices):
        bytes_list = list(map(self.params.vocab.get, indices))
        return b"".join(bytes_list).decode("utf-8")

def get_compression_ratio(string, indices):
    num_bytes  = len(bytes(string, encoding="utf-8"))
    num_tokens = len(indices)
    return num_bytes / num_tokens

Why the encode is "very slow" (the lecture flags this). It loops over every merge and scans the entire sequence each time — O(merges × length). On a real vocabulary of 50,000 merges that's catastrophic. In Assignment 1 you fix exactly this, plus three more gaps:

Naive version here	Assignment 1 upgrade
Loops over all 50k merges every encode.	Only apply merges that can fire on this text (priority by merge rank).
No special tokens.	Detect & preserve tokens like `<\|endoftext\|>` as atomic units.
Merges can cross word boundaries.	Pre-tokenize with the GPT-2 regex first, then run BPE within each segment (so "the" never merges across a space-word boundary it shouldn't).
Pure-Python, single-threaded.	Optimize for speed (the corpus is huge).

And in practice you don't reimplement this for production — you'd reach for OpenAI's tiktoken:

python
import tiktoken
enc = tiktoken.get_encoding("gpt2")        # or "cl100k_base" for GPT-3.5/4
ids = enc.encode("Hello, 🌍! 你好!")        # -> [15496, 11, 995, ...]
assert enc.decode(ids) == "Hello, 🌍! 你好!"   # round-trips

Why does Assignment 1 add a pre-tokenization step (running the GPT-2 regex) before BPE merges?

To make the vocabulary larger. Because BPE can't handle Unicode. So merges happen within word-like segments, preventing tokens from greedily merging across boundaries (e.g. swallowing spaces in odd ways) and matching how GPT-2 actually tokenizes. To remove the need for a decode method.

Chapter 9: Connections & Cheat Sheet

We motivated the whole course (understanding via building; efficiency = the game), mapped its five stages, and then built the first gear from zero: four tokenizers, three failures, and BPE as the data-driven compromise. Here is the comparison that summarizes the whole journey.

Tokenizer	Vocab size	Compression	Verdict
Character	~150,000	good	Vocab too big, mostly wasted.
Byte	256	1.0 (none)	Sequences far too long (quadratic attention).
Word	millions (open)	great	Open vocab + UNK problem.
BPE	256 + #merges (you choose)	tunable, ~3–5×	Best of both; the standard.

Cheat sheet.

Tokenizer = (encode: str→list[int], decode: list[int]→str), must round-trip.
Compression ratio = bytes / tokens. Higher = shorter sequences = cheaper. Bytes give exactly 1.
BPE train: start from 256 bytes → repeat {count adjacent pairs, merge the most frequent into a new id} for k merges → vocab size = 256 + k.
BPE tables: vocab (id→bytes) + ordered merges (pair→id). Encode = re-apply merges in order; decode = concat bytes.
Why BPE wins: bytes' safety (no UNK, round-trips, vocab=256 floor) + words' brevity (merge common runs), with a vocab size you control.
GPT-2 pre-tokenization regex keeps a word's leading space; numbers split into digit-chunks; run BPE inside each segment.

The tokenizer-free future. The lecture is candid: tokenization is "a necessary evil." Several lines of work try to drop it and feed raw bytes to the model — ByT5, MegaByte, the Byte Latent Transformer (BLT), and token-free methods (TFree). They're promising and elegant (no vocabulary to design, no quirks) but have not yet been scaled to the frontier, because of exactly the sequence-length problem we hit in Chapter 4. One day, maybe, we'll just do it from bytes.

Where to go next on this site:

Next CS336 lecture: PyTorch & Resource Accounting — once text is tokens, how many FLOPs and bytes of memory does training actually cost?
What consumes the tokens: The Transformer and GPT.
The quadratic-attention pressure that makes compression matter: Attention Variants.
Why these design choices are bets, not proofs: Bias, Variance & Double Descent.

In one sentence: why is BPE described as "the best of both worlds" between byte and word tokenization?

Because it uses a neural network to learn the vocabulary. Because it has the largest possible vocabulary. Because it never compresses, guaranteeing round-trips. It starts from bytes (small vocab floor, no UNK, always round-trips) and merges common runs into single tokens (short sequences like words), with a vocabulary size you control.

One line to remember the whole lesson: a language model never sees text — it sees integers, and BPE is the trained, byte-grounded dictionary that turns one into the other as compactly as the data allows.

The Whole Machine & Its First Gear: Tokenization