You have 1.4 trillion tokens from Common Crawl. Most of it is spam, bot-generated drivel, and near-identical copies of the same page. To get a model that actually learns from the data, you need to keep the signal and throw out the noise — at web scale, in linear time. This lesson derives every algorithm: n-gram perplexity (KenLM), fastText classifiers, importance resampling (DSIR), Bloom filters, MinHash, and Locality-Sensitive Hashing. You will compute Jaccard similarity by hand, derive the MinHash collision probability, trace the LSH S-curve math, and size a Bloom filter from first principles. By the end, you can build the dedup+filter stack that powers FineWeb, Dolma, and DCLM.
You have just downloaded a Common Crawl snapshot. One and a half trillion tokens of raw text. Enough to fill a million novels. You fire up a tokenizer and start reading page one.
It is a product description for a gas mask. The same sentence — "by combining fantastic ideas, interesting arrangements, and follow the current trends in the field" — repeated 61,036 times. Not 61,036 pages. 61,036 repetitions on what is technically a single document in C4. Your model will memorize that sentence. Every gradient step will slightly tilt the weights toward regurgitating it on demand.
Page two: a navigation menu followed by a cookie banner, followed by the same navigation menu in French, followed by a sitemap in XML, followed by the same article published again with a slightly different URL. Page three: a Terms of Service document. The MIT license. Another Terms of Service. Another MIT license. These are exact duplicates but they are everywhere, so your model will learn to produce legalese on demand whether you want it to or not.
This lecture is about two cleaning stages that happen after HTML extraction (Lecture 13) and before your model sees a single token:
Together they explain why FineWeb starts with 95 CC snapshots (~90T raw tokens) but ships 15T tokens — and why DCLM starts with 240T and ships 3.8T. That 1–16% survival rate is not waste. It is signal extraction.
Drag the sliders to see how much survives. Compare to real pipelines: C4 retains ~11%, FineWeb ~17%, DCLM ~1.6%. The exact numbers depend on how aggressively each stage filters.
Every filtering algorithm answers the same question: given some target data T (text you want your model to sound like) and a huge pile of raw data R (everything from Common Crawl), find a subset of R that resembles T.
That framing immediately tells you what you need: (1) a way to measure resemblance, (2) a way to do it fast enough to process trillions of tokens. Three families of algorithms solve this, each estimating the resemblance differently.
A language model perplexity measures how surprised the model is by a document. Low perplexity means the text looks like what the model was trained on. High perplexity means the text is unusual — either genuinely weird, or high-quality niche text.
Train a fast n-gram model (specifically a trigram with Kneser-Ney smoothing) on Wikipedia or another quality target. Then score every document in Common Crawl. Documents that score low perplexity look like Wikipedia. Documents that score high look different.
For a trigram model: p("cat" | "the fat") = count("the fat cat") / count("the fat"). The problem is data sparsity — most trigrams never appear, so the count is zero, which sends the log to −∞. Kneser-Ney smoothing solves this by redistributing probability mass to lower-order n-grams based on how many distinct contexts each word appears in (not just raw counts).
python import kenlm, math model = kenlm.Model("en.arpa.bin") # trained on Wikipedia def perplexity(text): text = "<s> " + text + " </s>" score = model.score(text) # sum of log10 p(w_i | context) n = len(list(model.full_scores(text))) return math.exp(-score * math.log(10) / n) perplexity("Stanford University was founded in 1885") # ~100 (reasonable) perplexity("asdf asdf asdf asdf asdf") # ~50000 (garbage) perplexity("the the the the the the the the") # ~2 (repetitive but low ppl!)
CCNet used this approach: train KenLM on Wikipedia, sort all CC paragraphs by perplexity, keep the top third (lowest perplexity). LLaMA adopted CCNet's filtered output directly.
Instead of measuring perplexity, train a binary classifier: positive = quality documents (Wikipedia pages), negative = random CC pages. Then score = P(quality | document).
fastText is a deliberately simple architecture: embed each token, average the embeddings, apply a linear head. With the hashing trick for n-grams, the whole model fits in a few hundred MB and runs at millions of documents per second.
For quality filtering K=2 (good/bad), so this reduces to a linear classifier over averaged word embeddings. Hash the bigrams to avoid a runaway vocabulary: hash("the cat") % 10_000_000 gives an index into the embedding table. Collision rate is negligible at 10M bins.
python import mmh3 # MurmurHash3 — fast, not cryptographic bigrams = ["the cat", "cat in", "in the", "the hat"] hashed = [mmh3.hash(bg) % 10_000_000 for bg in bigrams] # [3142891, 7601234, 1892345, 9012456] (example values) # These index into the 10M-row embedding table
Data Selection via Importance Resampling is the most principled approach. Fit a distribution p to target data T and a distribution q to raw data R. The importance weight for document x is p(x)/q(x). Documents that look much more like T than R get high weights; documents that look equally likely under both get weight ≈ 1; documents rare in T but common in R get low weights.
The trick: rather than fitting a neural LM (expensive), fit hashed unigram distributions. Hash each word to one of 10K bins, estimate p and q as histograms. Fast to fit, fast to score. DSIR slightly outperforms fastText on GLUE benchmarks despite similar cost — because it explicitly captures distributional diversity rather than just a binary quality signal.
Each dot is a (corpus quality score, relative compute cost) pair. Drag the "quality target" slider to see which algorithms meet the bar. KenLM is cheapest; BERT-based classifiers are highest quality but 100× slower.
The same algorithmic toolkit — KenLM, fastText, DSIR — gets applied to different filtering tasks depending on what you want to remove. Three major applications: language identification, quality filtering, and toxicity filtering.
If you're training an English-first model, most of Common Crawl is noise. The web is multilingual: roughly 60% English, 10% German/French/Russian combined, and the remaining 30% spread across hundreds of languages. A language ID classifier runs first, before any quality filter.
The standard tool is the fastText lid.176.bin model: a fastText classifier trained on Wikipedia, Tatoeba, and SETimes data. It supports 176 languages and runs at millions of documents per second. It outputs a probability per language.
A practical decision: Dolma sets p(English) ≥ 0.5, which is lenient. RefinedWeb sets 0.65, which is stricter. The tradeoff: higher threshold = cleaner English corpus but more false-negative loss of genuinely multilingual quality content (e.g., a page that mixes English and citations in another language).
Quality filtering is the most consequential decision in the pipeline. There is no universal "quality" label — it is always defined relative to a target distribution.
| Pipeline | Method | Positives (target) | Key threshold |
|---|---|---|---|
| GPT-3 | Linear classifier on word features | WebText2, Wikipedia, Books | Stochastic: keep with prob ∝ score^9 |
| LLaMA/RedPajama | fastText classifier | Pages linked by Wikipedia | Hard threshold on classifier score |
| phi-1 | Random forest on codegen embeddings | GPT-4 rated "educational" code | Binary keep/drop |
| DCLM | fastText on OpenHermes positives | High-quality instruction data | Top-k% by score |
| FineWeb | Heuristic rules (Gopher rules) | N/A — rule-based | Multiple rule thresholds |
The GPT-3 stochastic keep rule is elegant: rather than a hard threshold at score 0.7 (which abruptly discards documents just below the line), it keeps document x with probability proportional to score(x)^9. Documents near zero are almost never kept; near one are almost always kept; but there is no hard cliff. This smoothness avoids the "threshold gambling" effect where tiny changes in the threshold cut 5% of the corpus.
python import numpy as np def keep_gpt3(score: float) -> bool: # Pareto(alpha=9) trick: keep if U > 1 - score # Equivalent to: keep with prob proportional to score^9 return np.random.pareto(9) > 1 - score # score=0.9: keep ~90% of the time # score=0.5: keep ~50% but concentrated near high end # score=0.1: keep very rarely
Many pipelines skip the classifier entirely and use deterministic rules. The Gopher rules (from DeepMind's paper) check properties like: fraction of words that are alphabetic must exceed 80%; mean word length between 3 and 10 characters; no more than 10% lines ending in ellipsis (a spam signal); document must be at least 50 words. FineWeb runs these rules before anything else.
python def passes_gopher(text: str) -> bool: words = text.split() if len(words) < 50: return False # too short alpha_frac = sum(1 for w in words if w.isalpha()) / len(words) if alpha_frac < 0.8: return False # too many numbers/symbols avg_word_len = sum(len(w) for w in words) / len(words) if avg_word_len < 3 or avg_word_len > 10: return False # garbled or concatenated lines = text.split("\n") ellipsis_frac = sum(1 for l in lines if l.endswith("...")) / max(1, len(lines)) if ellipsis_frac > 0.1: return False # spam signal return True
Deduplication is fundamentally a comparison problem: is document A the same as document B? Checking every pair is O(N²) — with N=100 billion documents, that means 10²² comparisons. You need an O(N) algorithm. Hash functions are the foundation.
A hash function h maps any input (a string of arbitrary length) to a fixed-size integer output (the hash value). The key properties:
There is a fundamental tradeoff: cryptographic hash functions (SHA-256, SHA-512) are collision-resistant by design — finding two inputs with the same hash requires astronomical compute. But they are slow: ~150 MB/s. Non-cryptographic hash functions (MurmurHash3, CityHash, xxHash) have no collision-resistance guarantees but run at 10–50 GB/s — 100× faster. For deduplication, you want speed, not cryptographic guarantees. A rare accidental collision (two different documents hashing to the same value) is not catastrophic.
python import mmh3 # MurmurHash3 h1 = mmh3.hash("hello") # → some 32-bit int, e.g. -1151537390 h2 = mmh3.hash("hello") # → same value (deterministic) h3 = mmh3.hash("hellp") # → completely different (avalanche) # With seed: h(x, seed=1) gives a different but still deterministic hash # This is how we simulate multiple independent hash functions h4 = mmh3.hash("hello", seed=0) # "first" hash function h5 = mmh3.hash("hello", seed=1) # "second" hash function
Exact deduplication using hashes is straightforward. Hash every document, group by hash, keep one from each group. In a MapReduce framework: Map phase emits (hash, document) pairs; Reduce phase keeps one per hash group.
python import itertools, mmh3 items = ["Hello!", "hello", "hello there", "hello", "hi", "bye"] # Sort by hash, then groupby hash, keep first from each group by_hash = itertools.groupby(sorted(items, key=mmh3.hash), key=mmh3.hash) deduped = [next(group) for h, group in by_hash] # → ["Hello!", "hello there", "hi", "bye"] (one "hello" kept)
C4's 3-sentence span deduplication is a variant: instead of hashing full documents, hash every 3-sentence span. Any span appearing more than once is removed from its document. The upside: catches boilerplate in the middle of otherwise-good pages. The downside: removing a span from the middle can make the surrounding document incoherent.
You want to test whether a document has been seen before. A simple hash table stores the actual hash (64 bits per document). With 10 billion documents that is 80 GB just for the index. Can you do better?
A Bloom filter is a probabilistic data structure that answers "is this in the set?" using far less memory. The price: it can say "yes" when the answer is actually "no" (a false positive). But it never says "no" when the answer is "yes" — there are no false negatives.
Allocate an array of m bits, all initialized to 0. To insert item x: compute k independent hash functions h₁(x), h₂(x), ..., hₖ(x), each returning a value in [0, m). Set those k bit positions to 1. To query item x: check those same k positions. If all are 1, return "probably yes." If any is 0, return "definitely no."
python from bitarray import bitarray import mmh3 def build_bloom(items, m, k): table = bitarray(m) table.setall(0) for item in items: for seed in range(k): h = mmh3.hash(item, seed) % m table[h] = 1 return table def query_bloom(table, item, m, k): return all(table[mmh3.hash(item, s) % m] for s in range(k)) items = ["the", "cat", "in", "the", "hat"] table = build_bloom(items, m=1000, k=7) query_bloom(table, "the", 1000, 7) # → True (definitely in set) query_bloom(table, "dog", 1000, 7) # → False or True (True = false positive)
Assume the k hash functions are independent and each maps uniformly to [0, m). After inserting n items:
A false positive requires ALL k positions checked for a test item to be set. These positions are also independent (k different hash functions), so:
Worked example. m = 1000 bits, n = 100 items inserted, k = 7 hash functions.
Optimal k. Given fixed m and n, what k minimizes the false-positive rate? Take the derivative of log(FP) with respect to k and set to zero:
At optimal k, every bit is set with probability exactly 0.5 — the filter is half-full. For m/n = 10 bits per item: kopt = 10 × 0.693 ≈ 7, FP = 0.57 ≈ 0.8%.
Sizing for a target FP rate. Dolma uses a Bloom filter with target FP = 10−15. Solving: (0.5)k_opt = 10−15 → k_opt = 15 / log₁₀(2) ≈ 50. Then m/n = k_opt / ln(2) ≈ 72 bits per item. For n = 10 billion paragraphs: m ≈ 720 billion bits = 90 GB. Much less than storing actual hashes (640 GB for 64-bit hashes at 10B items).
Adjust m (bits) and k (hash functions). The formula (1−e−kn/m)k is plotted live. The optimal k = (m/n)×ln(2) is marked in orange.
Exact deduplication handles mirror copies of the same page. But the real problem is near-duplicates — documents that differ by a few words, a different header, some formatting changes. The same news article republished on 500 aggregator sites. The same Terms of Service with the company name swapped. Exact hashing misses all of these.
You need a similarity measure that handles this fuzzy case. The standard is Jaccard similarity: represent each document as a set of "shingles" (n-grams of words or characters), and measure the fraction of the union they share.
Worked example. Two documents share the first 3 paragraphs but differ in the last one. After shingling with word 5-grams:
Typical threshold: documents with Jaccard ≥ 0.8 are near-duplicates. The problem: computing Jaccard exactly requires comparing every pair of document shingle-sets. O(N²) again.
MinHash is a randomized technique that converts a set into a compact "signature" such that the collision probability between two signatures equals their Jaccard similarity.
Pick a random hash function h. Compute h(s) for every shingle s in the document. The MinHash value is the minimum across all shingles:
Why does P[MinHash(A) = MinHash(B)] = Jaccard(A, B)? Think of the hash function as inducing a random ordering of all possible shingles. The "winner" (minimum hash value) is some shingle s*. Now ask: is s* in both A and B?
Use n independent hash functions (different seeds) and average the match fraction:
python import mmh3 def minhash(shingle_set, seed): return min(mmh3.hash(s, seed) for s in shingle_set) A = {"1", "2", "3", "4"} B = {"1", "2", "3", "5"} # True Jaccard: |{1,2,3}|/|{1,2,3,4,5}| = 3/5 = 0.6 n = 200 # number of hash functions matches = [minhash(A, seed) == minhash(B, seed) for seed in range(n)] estimated_jaccard = sum(matches) / n # ≈ 0.6 with std dev ≈ sqrt(0.6*0.4/200) ≈ 0.035
With n=100 hashes, the standard deviation of the estimate is √(J(1−J)/n) ≈ 0.05. With n=200, it halves to ~0.035. The tradeoff: more hashes = more accuracy but also more storage (200 × 4 bytes = 800 bytes per document for the signature).
Set the true Jaccard similarity of two document shingle-sets, then watch how accurately MinHash estimates it as you add more hash functions. The orange band shows ±1 std dev of the estimate.
You have n=200 MinHash values per document. Two documents with Jaccard=0.6 will have their MinHash signatures match on ~60% of the 200 positions. But you want to declare them duplicates only if Jaccard ≥ 0.8. How do you sharpen the threshold?
With a single MinHash function, P[collision] = Jaccard — a linear, undiscriminating curve. Documents with Jaccard=0.3 collide 30% of the time. Documents with Jaccard=0.9 collide 90%. There is no sharp cutoff.
Locality-Sensitive Hashing (LSH) uses the and-or structure of bands to convert this gentle slope into a sharp S-curve.
Divide your n MinHash values into b bands, each of r rows. Two documents collide (are declared potential duplicates) if and only if at least one band is identical in all r rows.
Where does this formula come from?
The threshold is the similarity s* where the collision probability equals 0.5. Solve:
Worked example from the dedup paper. The "Deduplicating Training Data Makes Language Models Better" paper (Lee et al. 2022) uses n=9000, b=20, r=450:
The "sharp" part: with r=450, the curve goes from P=0.05 to P=0.95 across a Jaccard range of only ~0.004. Almost a step function.
Tuning b and r:
python def lsh_collision_prob(s, b, r): # P[collision | Jaccard = s] with b bands of r rows return 1 - (1 - s**r)**b # b=5, r=10 (n=50 total hashes) lsh_collision_prob(0.5, b=5, r=10) # → 0.028 (5% Jaccard = rarely flagged) lsh_collision_prob(0.8, b=5, r=10) # → 0.672 (80% Jaccard = usually flagged) lsh_collision_prob(0.9, b=5, r=10) # → 0.987 (90% Jaccard = almost always) # threshold where prob = 0.5: threshold = (math.log(2) / b) ** (1 / r) # ≈ 0.812 for b=5, r=10
Adjust bands b and rows r. Watch the S-curve sharpen. The orange vertical line marks the threshold (where P=0.5). Total hashes n = b × r is shown. Compare: single MinHash (r=1, b=n) gives a diagonal line — no sharpening.
Let's build the complete pipeline. You start with raw extracted text from Common Crawl. By the time it reaches your model, it has survived language ID, quality filtering, exact deduplication, and near-deduplication. This showcase lets you tune each stage and watch the corpus transform.
The simulator runs on a synthetic corpus of 200 "documents" — each represented as a point in a 2D quality/similarity space. Quality (x-axis) is the fastText classifier score. Similarity (y-axis) pairs documents by near-duplicate status. Watch how each filter stage changes what survives.
Each dot is a document. Color = survived all stages so far (green) or filtered (red/grey). Tune each stage. The bar at bottom shows tokens remaining and estimated model quality.
| Pipeline | Language | Quality | Exact dedup | Near-dedup | Retention |
|---|---|---|---|---|---|
| C4 | langdetect≥0.99 | Heuristics (Gopher-style) | 3-sentence spans | — | ~11% |
| CCNet/LLaMA | fastText LID | KenLM perplexity top-1/3 | SHA-1 paragraphs | MinHash doc-level | ~8% |
| FineWeb | lid.176 p≥0.65 | Gopher + custom rules | URL-level | MinHash b=9, r=13 | ~17% |
| Dolma | lid.176 p≥0.5 | Gopher rules | Bloom filter paragraphs | MinHash docs | ~12% |
| DCLM | lid.176 p≥0.5 | fastText OpenHermes | URL hash | MinHash b=8, r=3 | ~1.6% |
DCLM's aggressive 1.6% retention is deliberate: the fastText classifier trained on high-quality instruction data sets a very high bar. The resulting 3.8T token corpus, despite being tiny relative to its raw input, trains models that outperform C4 and RefinedWeb on standard benchmarks — validating the "quality beats quantity" thesis.
Quality and deduplication are not the only reasons to remove documents. Two more concerns require their own pipeline stages: safety (removing content that teaches harmful behavior) and contamination (removing content that leaks evaluation data into training).
Training on toxic content does not just make the model capable of producing toxic text — it shifts the base distribution so that toxic completions become more probable even in non-toxic contexts. You want to filter before pretraining, not rely entirely on RLHF post-training to patch it.
Dolma's approach uses two fastText classifiers trained on the Jigsaw Toxic Comments dataset (2018): Wikipedia talk page comments labeled across six toxicity categories.
python import fasttext model = fasttext.load_model("jigsaw_fasttext_bigrams_nsfw_final.bin") model.predict(["I love strawberries"]) # → (__label__non_nsfw, 0.999) model.predict(["I hate strawberries"]) # → (__label__non_nsfw, 0.987) — hate ≠ toxic here model.predict(["Stupid piece of ... die!"]) # → (__label__nsfw, 0.931)
Personally Identifiable Information (PII) is a legal and ethical concern. Models trained on raw web text learn social security numbers, email addresses, phone numbers, and names paired with private contexts. The model can then regurgitate them on demand.
PII removal is typically done with regex patterns (email, phone, SSN, credit card patterns) or trained NER (Named Entity Recognition) models that identify personal names in context. Dolma runs regex-based PII removal before the quality filter.
python import re EMAIL_RE = rr'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' PHONE_RE = rr'\b(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b' SSN_RE = rr'\b\d{3}-\d{2}-\d{4}\b' def remove_pii(text): text = re.sub(EMAIL_RE, "[EMAIL]", text) text = re.sub(PHONE_RE, "[PHONE]", text) text = re.sub(SSN_RE, "[SSN]", text) return text
Decontamination is the removal of evaluation benchmark examples from training data. If a model is trained on documents containing MMLU questions and their correct answers, its benchmark score is inflated — it memorized the test answers rather than reasoning toward them.
The standard approach: for each evaluation example, generate all n-grams of length 13 from the question text. Search the training corpus for documents containing any of these n-grams. Remove matching documents (or the matching spans). 13-grams are long enough to be specific but short enough to catch paraphrased repetitions.
python def get_ngrams(text, n=13): words = text.lower().split() return {" ".join(words[i:i+n]) for i in range(len(words)-n+1)} eval_ngrams = set() for example in eval_set: eval_ngrams.update(get_ngrams(example.question)) def is_contaminated(doc): doc_ngrams = get_ngrams(doc.text) return bool(doc_ngrams & eval_ngrams) # any overlap → contaminated
You now have every tool in the standard LLM data cleaning toolkit. Here is the full recipe, ordered as it runs in production:
| Algorithm | What it does | Key parameter | Complexity |
|---|---|---|---|
| KenLM (n-gram LM) | Perplexity of doc under target LM | n-gram order, perplexity threshold | O(L) per doc |
| fastText classifier | P(quality | doc) binary | Classifier threshold | O(L) per doc |
| DSIR | pT(x)/pR(x) importance weight | Target T distribution | O(L) per doc |
| Bloom filter | Set membership (approx) | m bits, k hash functions | O(k) per query |
| MurmurHash exact dedup | Remove exact copies | Item granularity (doc/para/span) | O(L) hash + O(1) lookup |
| MinHash | Estimate Jaccard similarity | n hash functions | O(S × n) where S=shingles |
| LSH banding | Find near-duplicate candidates | b bands, r rows, threshold s*=(ln2/b)^(1/r) | O(N × n) total |
You now have a clean, deduplicated corpus. The next question is: how much of this corpus should you actually train on, and in what proportions? That is the data mix problem covered in Lecture 13. After that comes the architecture question (which this lesson series covers in Lectures 1–12) and then alignment: turning a base model into one that follows instructions and behaves safely (Lecture 15 and beyond).
| Lesson | What it covers | Connection to this lesson |
|---|---|---|
| CS336 L13 | Data sources, HTML extraction, data mix | The pipeline stage BEFORE filtering |
| CS336 L15 (upcoming) | Alignment, RLHF, safety post-training | What handles toxicity that filtering misses |
| RAG | Retrieval-Augmented Generation | Uses similar dedup/similarity search at query time |
| Vector Databases | HNSW, IVF for approximate nearest neighbor | LSH is a simpler ANN algorithm — same goal, different mechanism |
"Now you have the tools (the mechanics), just have to spend time with data (the intuitions)." — Percy Liang, CS336 Lecture 14