Why anyone would build a language model from scratch when you can just call an API — and then the very first thing that happens to your text before a neural network ever sees it. We build four tokenizers, watch three of them fail, and grow the fourth (BPE) one merge at a time.
You can get a world-class language model to write you a sonnet in three seconds by typing into a box. So why would a sane person spend a quarter implementing a tokenizer, a Transformer, an optimizer, and a training loop by hand? The honest answer is the same reason a chef learns to break down a chicken even though the supermarket sells breasts in plastic: the abstractions are leaky, and the interesting work lives underneath them.
Here is the trend the course opens with. Watch how researchers have related to the technology over time:
Each step up this ladder buys productivity — you get more done per hour. But it costs understanding, and unlike the abstractions in a programming language or an operating system (which almost never leak), these ones leak constantly. When your fine-tune mysteriously degrades, or your prompt works at 7B and breaks at 70B, the explanation lives several rungs below where you're standing.
Slide through the three eras. The productivity bar rises as you climb; the stack you actually understand shrinks. CS336 deliberately drops you to the bottom rung — you will rebuild the whole stack.
The course's thesis is a single phrase: understanding via building. You cannot reproduce a frontier model in a class — GPT-4 reportedly has ~1.8 trillion parameters and cost ~$100M to train; xAI built a cluster of 200,000 H100s; the Stargate project commits $500B over four years. Those numbers are out of reach. But you can rebuild the machinery at small scale, and three kinds of knowledge come out of that exercise:
That last caveat is important and honest: more is different. A model under 1B parameters (what we train in this class) is not a tiny photograph of a 1T model. The fraction of compute spent in attention versus the MLP shifts with scale; some capabilities emerge only past a size threshold and are simply absent below it. So mechanics and mindset transfer cleanly; intuitions you must hold loosely.
This reframes the entire course as one optimization problem: given a fixed compute and data budget, what is the best model you can build? Maximize efficiency. Every design decision in the next sixteen lectures — tokenization, architecture tweaks, kernels, parallelism, data filtering — is ultimately a move to squeeze more accuracy out of the same resources.
Before we zoom into tokenization, let's see the whole machine, because every later lecture is a chapter of this one diagram. The course frames the entire problem with a concrete prompt: given a Common Crawl dump and 32 H100 GPUs for two weeks, what should you do? Answering that requires five stages, and they map almost one-to-one onto the rest of this lesson series.
Five stages turn raw internet text + GPUs into a useful assistant. Click each box. The stage we build today — tokenization — is the very first transformation your text undergoes.
Here is the same map in words, with the assignment each stage corresponds to — this is your roadmap for the lessons that follow:
| Stage | What it covers | The efficiency angle |
|---|---|---|
| Basics | Tokenization, Transformer architecture, training loop (optimizer, LR schedule). | Get one full pipeline working end-to-end at all. |
| Systems | GPU kernels (Triton), parallelism across many GPUs, inference (prefill/decode). | Squeeze the most out of the hardware you have. |
| Scaling laws | Predict large-model loss/hyperparameters from cheap small-model experiments. | Don't waste compute tuning at full scale. |
| Data | Curate, transform, filter, and deduplicate the training corpus. | Don't waste updates on bad or repeated data. |
| Alignment | Supervised fine-tuning + learning from feedback (DPO, PPO, GRPO). | A well-aligned model can be smaller for the same usefulness. |
Notice the framing in the last column: every single stage is a war on waste. That's the through-line. Today we're at the very start of "Basics," doing the first thing that ever happens to a piece of text. So let's ask the obvious beginner question and refuse to hand-wave it.
A neural network does not understand text. It understands numbers — specifically, it consumes sequences of integers and looks each one up in a big table of vectors (the embedding matrix). So before any modeling can happen, we need a deterministic procedure that converts a human string into a list of integers, and another that converts those integers back. That pair of procedures is a tokenizer.
Concretely, a tokenizer is any object implementing two methods:
python from abc import ABC class Tokenizer(ABC): """Strings <-> sequences of integers (tokens).""" def encode(self, string: str) -> list[int]: raise NotImplementedError def decode(self, indices: list[int]) -> str: raise NotImplementedError
Two definitions we will use constantly:
decode(encode(s)) == s for every string s. If you can't perfectly reconstruct the input, you've thrown away information before the model ever ran. Every tokenizer we build will be checked with an assert that the round-trip holds.Type anything. We segment it with a GPT-2-style rule (each colored chunk is one token) and report the byte count, token count, and compression ratio. Try "hello hello" (note the two "hello"s tokenize differently — one has a leading space). Try a number like "12345". Try an emoji.
You probably noticed three things, and they're exactly the observations the lecture highlights from the real GPT-2 tokenizer:
" world" including the space)."hello" vs " hello") — because of that space.None of those rules were designed by hand. They emerged from a training procedure run over a giant corpus — which is the whole point of Byte-Pair Encoding, and where we're headed. But to appreciate why BPE is clever, we first have to watch the three obvious approaches fail. Let's start with the most naive one.
The most literal idea: a string is a sequence of Unicode characters, so let each character be one token. Python hands us this for free. The function ord(c) maps a character to its integer code point; chr(i) maps back.
python assert ord("a") == 97 assert ord("🌍") == 127757 assert chr(97) == "a" class CharacterTokenizer(Tokenizer): """Represent a string as a sequence of Unicode code points.""" def encode(self, string): # 'str' -> list[int] return list(map(ord, string)) def decode(self, indices): # list[int] -> 'str' return "".join(map(chr, indices))
It round-trips perfectly — decode(encode(s)) == s always holds, because ord and chr are exact inverses. So far so good. Let's run it on the lecture's test string and look at the actual numbers.
Each box is one character with its ord() code point underneath. Notice the wild range: ASCII letters sit near 100, but 🌍 is 127757 and the Chinese characters are in the tens of thousands. Edit the text to explore.
The compression ratio looks decent here — multi-byte characters mean bytes > characters — but two fatal problems sink this approach:
So pure characters give a bloated, mostly-idle vocabulary. The natural reaction is: "fine, go smaller — what's the smallest possible alphabet?" That pushes us all the way down to raw bytes.
Every Unicode string can be serialized into bytes — integers from 0 to 255 — using an encoding. The dominant one is UTF-8. Common characters (all of ASCII) take a single byte; rarer characters take two, three, or four. For example:
python assert bytes("a", encoding="utf-8") == b"a" # 1 byte: [97] assert bytes("🌍", encoding="utf-8") == b"\xf0\x9f\x8c\x8d" # 4 bytes: [240,159,140,141] class ByteTokenizer(Tokenizer): """Represent a string as a sequence of bytes.""" def encode(self, string): return list(map(int, string.encode("utf-8"))) def decode(self, indices): return bytes(indices).decode("utf-8")
This fixes Problem 1 beautifully: the vocabulary is exactly 256. Tiny, complete, language-agnostic — it can represent any text in any language or script, including emoji, with zero out-of-vocabulary cases. It's the most elegant tokenizer imaginable. So why doesn't everyone use it?
Top row: the original characters. Bottom: the UTF-8 bytes each one expands into (hex). ASCII letters stay one byte; 🌍 becomes four bytes, the Chinese characters three each. Count the bottom row — that's your token count, and it's much longer than the text.
Here's the killer. For a byte tokenizer, the number of tokens equals the number of bytes, by definition. So the compression ratio is:
A compression ratio of exactly 1 means no compression at all — the worst possible. Sequences are as long as they can be. And recall from Chapter 1: a Transformer's attention cost scales quadratically with sequence length. Doubling the sequence quadruples the attention compute. A long byte sequence is a compute disaster, and the context window (how much text the model can see at once) gets eaten up by individual bytes.
Classical NLP split text into words. A word usually carries a unit of meaning, so this should give short sequences (one token per word) and a vocabulary of "real" linguistic units. The first cut is a simple regular expression:
python import regex string = "I'll say supercalifragilisticexpialidocious!" segments = regex.findall(r"\w+|.", string) # keep runs of word-chars together # ['I', "'", 'll', ' ', 'say', ' ', 'supercali...docious', '!']
The actual GPT-2 tokenizer uses a much more careful pattern. It is worth dissecting every piece, because this exact regex reappears later as BPE's "pre-tokenizer":
the GPT-2 pre-tokenization regex '(?:[sdmt]|ll|ve|re) # contractions: 's 'd 'm 't 'll 've 're ("don't", "we'll") ?\p{L}+ # an optional space + a run of LETTERS (" world", "Hello") ?\p{N}+ # an optional space + a run of NUMBERS (" 2025") ?[^\s\p{L}\p{N}]+ # optional space + a run of PUNCTUATION/symbols ("!!!", " ---") \s+(?!\S) # trailing whitespace at end of a line/string \s+ # any other run of whitespace
Read it slowly. \p{L} means "any Unicode letter," \p{N} "any number." The leading ? (an optional space) is why a word carries its preceding space into the same token — the observation we saw in Chapter 2. This is a deliberate, hand-built rule that bakes the "space belongs to the next word" convention right into the segmentation.
Each chip is one segment the regex produced. Blue = letters, green = digits, orange = punctuation, grey = whitespace. The little dot · marks a leading space captured inside a token. Edit and watch.
The compression looks great — one token per word is short. But word tokenization carries three crippling problems:
Byte-Pair Encoding (BPE) was invented by Philip Gage in 1994 as a data compression trick, then adapted to NLP for neural machine translation by Sennrich et al. in 2016, and adopted by GPT-2. Its central idea is one sentence:
"th", then "the") become single tokens; rare sequences stay broken into many. The data decides.Why is this the best of both worlds? You begin from bytes, so you inherit their virtues — vocabulary starts at 256, every string round-trips, there is never an UNK (worst case, a novel word falls back to its individual bytes). But each merge shortens the sequence for common text, climbing the compression ratio. You stop after a chosen number of merges, which gives you a fixed vocabulary size you control: 256 + (number of merges).
Let's hand-trace it on the lecture's tiny example so the mechanism is concrete. We'll train on the string "the cat in the hat" with 3 merges. To keep the trace readable we'll trace it at the character level (real BPE works on bytes, but for pure ASCII the two are identical — each character is one byte).
Start. Split into individual symbols (each is a token):
Merge 1 — count every adjacent pair. The pair (t, h) appears twice ("the cat ... the hat"); (h, e) appears twice; (e, ⎵) appears twice; everything else once. There's a tie at count 2. Ties are broken deterministically (e.g. the first pair encountered / lexicographic) — say we take (t, h). Create new token 256 = "th" and replace:
Merge 2 — recount on the new sequence. Now (256, e) = "th"+"e" appears twice. It wins. Create 257 = "the":
Merge 3 — recount again. Now (257, ⎵) = "the"+" " appears twice. Create 258 = "the ":
In three merges we went from 18 tokens to 12 — and we discovered that "the " is a meaningful unit, purely from counting. No linguist told it that "the" is a word. The merges we learned, in order, are the tokenizer's entire trained parameters:
| Step | Pair merged | New token id | Bytes it stands for |
|---|---|---|---|
| 1 | (t, h) | 256 | "th" |
| 2 | (256, e) | 257 | "the" |
| 3 | (257, ⎵) | 258 | "the " |
That's the algorithm by hand. Now let's make it live: train BPE on a corpus of your choosing and watch the sequence shrink merge by merge.
This is the payoff. Pick a corpus, then step through training. At each step you'll see the current token sequence (chips), the top adjacent pairs by count (the winner highlighted), and the live vocabulary size, sequence length, and compression ratio. Slide to any number of merges, or step one at a time to watch a single pair collapse.
Things to try, and what they teach:
" t", "he", "in", "th" — the building blocks of common words. BPE rediscovers morphology from raw counts.Now we implement the whole thing, exactly as the lecture does, so you could reproduce it from memory. Three pieces: a merge helper, a train_bpe function, and the BPETokenizer that uses the trained tables.
merge helperReplace every occurrence of an adjacent pair with a single new index. This is the one primitive used by both training and encoding.
python def merge(indices, pair, new_index): """Return `indices` with every adjacent `pair` replaced by `new_index`.""" new_indices = [] i = 0 while i < len(indices): # if the pair starts here, emit the merged token and skip 2 if i + 1 < len(indices) and indices[i] == pair[0] and indices[i+1] == pair[1]: new_indices.append(new_index) i += 2 else: # otherwise copy this token and step 1 new_indices.append(indices[i]) i += 1 return new_indices
Data flow: in → a list[int] of length L; out → a list[int] of length L minus (number of merges performed). The while loop walks left to right; the i += 2 is what consumes both halves of a matched pair so they're not double-counted.
train_bpeThe trainer from Chapters 6–7, in full. It returns the two tables: vocab (id → bytes) and merges (pair → id).
python from collections import defaultdict from dataclasses import dataclass @dataclass(frozen=True) class BPETokenizerParams: vocab: dict[int, bytes] # id -> bytes merges: dict[tuple[int,int], int] # (id1,id2) -> new id def train_bpe(string, num_merges): # 1. start: each byte is a token; vocab has the 256 raw bytes indices = list(map(int, string.encode("utf-8"))) merges = {} vocab = {x: bytes([x]) for x in range(256)} for i in range(num_merges): # 2. count every adjacent pair counts = defaultdict(int) for a, b in zip(indices, indices[1:]): counts[(a, b)] += 1 # 3. pick the most frequent pair pair = max(counts, key=counts.get) a, b = pair # 4. mint a new id, record it in both tables, rewrite the sequence new_index = 256 + i merges[pair] = new_index vocab[new_index] = vocab[a] + vocab[b] indices = merge(indices, pair, new_index) return BPETokenizerParams(vocab=vocab, merges=merges)
BPETokenizerEncoding re-applies the learned merges in order; decoding looks up bytes and concatenates.
python class BPETokenizer(Tokenizer): def __init__(self, params): self.params = params def encode(self, string): indices = list(map(int, string.encode("utf-8"))) # NOTE: slow — re-applies EVERY merge over the whole sequence for pair, new_index in self.params.merges.items(): indices = merge(indices, pair, new_index) return indices def decode(self, indices): bytes_list = list(map(self.params.vocab.get, indices)) return b"".join(bytes_list).decode("utf-8") def get_compression_ratio(string, indices): num_bytes = len(bytes(string, encoding="utf-8")) num_tokens = len(indices) return num_bytes / num_tokens
| Naive version here | Assignment 1 upgrade |
|---|---|
| Loops over all 50k merges every encode. | Only apply merges that can fire on this text (priority by merge rank). |
| No special tokens. | Detect & preserve tokens like <|endoftext|> as atomic units. |
| Merges can cross word boundaries. | Pre-tokenize with the GPT-2 regex first, then run BPE within each segment (so "the" never merges across a space-word boundary it shouldn't). |
| Pure-Python, single-threaded. | Optimize for speed (the corpus is huge). |
And in practice you don't reimplement this for production — you'd reach for OpenAI's tiktoken:
python import tiktoken enc = tiktoken.get_encoding("gpt2") # or "cl100k_base" for GPT-3.5/4 ids = enc.encode("Hello, 🌍! 你好!") # -> [15496, 11, 995, ...] assert enc.decode(ids) == "Hello, 🌍! 你好!" # round-trips
We motivated the whole course (understanding via building; efficiency = the game), mapped its five stages, and then built the first gear from zero: four tokenizers, three failures, and BPE as the data-driven compromise. Here is the comparison that summarizes the whole journey.
| Tokenizer | Vocab size | Compression | Verdict |
|---|---|---|---|
| Character | ~150,000 | good | Vocab too big, mostly wasted. |
| Byte | 256 | 1.0 (none) | Sequences far too long (quadratic attention). |
| Word | millions (open) | great | Open vocab + UNK problem. |
| BPE | 256 + #merges (you choose) | tunable, ~3–5× | Best of both; the standard. |
vocab (id→bytes) + ordered merges (pair→id). Encode = re-apply merges in order; decode = concat bytes.The tokenizer-free future. The lecture is candid: tokenization is "a necessary evil." Several lines of work try to drop it and feed raw bytes to the model — ByT5, MegaByte, the Byte Latent Transformer (BLT), and token-free methods (TFree). They're promising and elegant (no vocabulary to design, no quirks) but have not yet been scaled to the frontier, because of exactly the sequence-length problem we hit in Chapter 4. One day, maybe, we'll just do it from bytes.
Where to go next on this site: