Data II: Filtering & Deduplication — Language Modeling from Scratch (CS336 L14)

Chapter 0: The Junk Problem

You have just downloaded a Common Crawl snapshot. One and a half trillion tokens of raw text. Enough to fill a million novels. You fire up a tokenizer and start reading page one.

It is a product description for a gas mask. The same sentence — "by combining fantastic ideas, interesting arrangements, and follow the current trends in the field" — repeated 61,036 times. Not 61,036 pages. 61,036 repetitions on what is technically a single document in C4. Your model will memorize that sentence. Every gradient step will slightly tilt the weights toward regurgitating it on demand.

Page two: a navigation menu followed by a cookie banner, followed by the same navigation menu in French, followed by a sitemap in XML, followed by the same article published again with a slightly different URL. Page three: a Terms of Service document. The MIT license. Another Terms of Service. Another MIT license. These are exact duplicates but they are everywhere, so your model will learn to produce legalese on demand whether you want it to or not.

The fundamental tension. You cannot inspect 1.5 trillion tokens manually. You need an automated signal that predicts whether a given document is worth training on — but the signal must generalize. A filter calibrated too loosely leaves the junk. A filter calibrated too aggressively removes domain-specific writing that looks unusual but is perfectly valuable (think: medical records, legal briefs, ancient texts). Every threshold is a bet.

This lecture is about two cleaning stages that happen after HTML extraction (Lecture 13) and before your model sees a single token:

Quality Filtering

Score each document on "goodness" using perplexity models, classifiers, or heuristic rules. Discard low-scorers. Retain ~20–60% of what extraction gives you.

↓

Deduplication

Find and remove near-identical documents. A single web page can appear under thousands of URLs. Remove all but one. This cuts another 30–50%.

Together they explain why FineWeb starts with 95 CC snapshots (~90T raw tokens) but ships 15T tokens — and why DCLM starts with 240T and ships 3.8T. That 1–16% survival rate is not waste. It is signal extraction.

Corpus survival funnel — raw tokens through each stage

Drag the sliders to see how much survives. Compare to real pipelines: C4 retains ~11%, FineWeb ~17%, DCLM ~1.6%. The exact numbers depend on how aggressively each stage filters.

Language ID (keep target lang) 40%

Quality filter (heuristic + classifier) 30%

Exact dedup (remove copies) 80%

Near-dedup MinHash (remove near-copies) 60%

A product description appears 61,036 times in a training corpus. Why is this specifically harmful beyond just wasting compute?

It increases training loss, making the model converge more slowly. It increases the vocabulary size, slowing tokenization. The model memorizes the repeated text and learns to regurgitate it, and the overrepresented phrase distorts the learned distribution away from the intended training objective. Duplicates cause gradient explosions due to repeated identical gradients.

Chapter 1: Filtering Algorithms

Every filtering algorithm answers the same question: given some target data T (text you want your model to sound like) and a huge pile of raw data R (everything from Common Crawl), find a subset of R that resembles T.

That framing immediately tells you what you need: (1) a way to measure resemblance, (2) a way to do it fast enough to process trillions of tokens. Three families of algorithms solve this, each estimating the resemblance differently.

The general framework. Every filtering algorithm has two steps: (1) fit some model using T and R to get a scoring function score(x), and (2) keep documents x with score(x) above some threshold. The three paradigms differ only in what the model is and what the score means.

Algorithm 1: KenLM — n-gram Perplexity

A language model perplexity measures how surprised the model is by a document. Low perplexity means the text looks like what the model was trained on. High perplexity means the text is unusual — either genuinely weird, or high-quality niche text.

Train a fast n-gram model (specifically a trigram with Kneser-Ney smoothing) on Wikipedia or another quality target. Then score every document in Common Crawl. Documents that score low perplexity look like Wikipedia. Documents that score high look different.

score(x) = PPL_T(x) = exp(− (1/N) ∑_i log p_T(w_i | w_i−2, w_i−1))

For a trigram model: p("cat" | "the fat") = count("the fat cat") / count("the fat"). The problem is data sparsity — most trigrams never appear, so the count is zero, which sends the log to −∞. Kneser-Ney smoothing solves this by redistributing probability mass to lower-order n-grams based on how many distinct contexts each word appears in (not just raw counts).

python
import kenlm, math

model = kenlm.Model("en.arpa.bin")  # trained on Wikipedia

def perplexity(text):
    text = "<s> " + text + " </s>"
    score = model.score(text)               # sum of log10 p(w_i | context)
    n = len(list(model.full_scores(text)))
    return math.exp(-score * math.log(10) / n)

perplexity("Stanford University was founded in 1885")     # ~100 (reasonable)
perplexity("asdf asdf asdf asdf asdf")                    # ~50000 (garbage)
perplexity("the the the the the the the the")            # ~2 (repetitive but low ppl!)

The repetition trap. "the the the the the the" has LOW perplexity because "the" is the most common English word and an n-gram model will confidently predict it repeatedly. This is why perplexity filtering alone is insufficient — you also need heuristic rules to catch repetitive junk.

CCNet used this approach: train KenLM on Wikipedia, sort all CC paragraphs by perplexity, keep the top third (lowest perplexity). LLaMA adopted CCNet's filtered output directly.

Algorithm 2: fastText Classifier

Instead of measuring perplexity, train a binary classifier: positive = quality documents (Wikipedia pages), negative = random CC pages. Then score = P(quality | document).

fastText is a deliberately simple architecture: embed each token, average the embeddings, apply a linear head. With the hashing trick for n-grams, the whole model fits in a few hundred MB and runs at millions of documents per second.

y = softmax(U · mean(W_i)) where W ∈ R^V×H, U ∈ R^H×K

For quality filtering K=2 (good/bad), so this reduces to a linear classifier over averaged word embeddings. Hash the bigrams to avoid a runaway vocabulary: hash("the cat") % 10_000_000 gives an index into the embedding table. Collision rate is negligible at 10M bins.

python
import mmh3   # MurmurHash3 — fast, not cryptographic

bigrams = ["the cat", "cat in", "in the", "the hat"]
hashed  = [mmh3.hash(bg) % 10_000_000 for bg in bigrams]
# [3142891, 7601234, 1892345, 9012456]  (example values)
# These index into the 10M-row embedding table

Algorithm 3: DSIR — Importance Resampling

Data Selection via Importance Resampling is the most principled approach. Fit a distribution p to target data T and a distribution q to raw data R. The importance weight for document x is p(x)/q(x). Documents that look much more like T than R get high weights; documents that look equally likely under both get weight ≈ 1; documents rare in T but common in R get low weights.

w(x) = p(x) / q(x) resample R proportionally to w

The trick: rather than fitting a neural LM (expensive), fit hashed unigram distributions. Hash each word to one of 10K bins, estimate p and q as histograms. Fast to fit, fast to score. DSIR slightly outperforms fastText on GLUE benchmarks despite similar cost — because it explicitly captures distributional diversity rather than just a binary quality signal.

Three filtering paradigms compared — quality vs. compute tradeoff

Each dot is a (corpus quality score, relative compute cost) pair. Drag the "quality target" slider to see which algorithms meet the bar. KenLM is cheapest; BERT-based classifiers are highest quality but 100× slower.

Quality bar (minimum score required) 50

fastText for quality filtering uses K=2 classes. What does this mean architecturally, and what is the effective computation?

K=2 means two hidden layers are used, making it deeper than standard fastText. K=2 means two separate models are trained in parallel. K=2 means the output head U is H×2, which reduces to a single linear decision boundary — a linear classifier over averaged n-gram embeddings. K=2 means the vocabulary is restricted to 2-grams (bigrams) only.

Chapter 2: Filtering Applications

The same algorithmic toolkit — KenLM, fastText, DSIR — gets applied to different filtering tasks depending on what you want to remove. Three major applications: language identification, quality filtering, and toxicity filtering.

Language Identification

If you're training an English-first model, most of Common Crawl is noise. The web is multilingual: roughly 60% English, 10% German/French/Russian combined, and the remaining 30% spread across hundreds of languages. A language ID classifier runs first, before any quality filter.

The standard tool is the fastText lid.176.bin model: a fastText classifier trained on Wikipedia, Tatoeba, and SETimes data. It supports 176 languages and runs at millions of documents per second. It outputs a probability per language.

The edge cases are harder than they look. Short text (<20 words) is unreliable — "Hello!" is labeled English but "Bonjour!" is trivially French only because of training data density. Code-switching ("Feliz Navidad / I wanna wish you a Merry Christmas") gets misclassified. Low-resource dialects of English (Scottish, AAVE) can be incorrectly filtered. Dolma's threshold: keep if p(English) ≥ 0.5.

A practical decision: Dolma sets p(English) ≥ 0.5, which is lenient. RefinedWeb sets 0.65, which is stricter. The tradeoff: higher threshold = cleaner English corpus but more false-negative loss of genuinely multilingual quality content (e.g., a page that mixes English and citations in another language).

Quality Filtering

Quality filtering is the most consequential decision in the pipeline. There is no universal "quality" label — it is always defined relative to a target distribution.

Pipeline	Method	Positives (target)	Key threshold
GPT-3	Linear classifier on word features	WebText2, Wikipedia, Books	Stochastic: keep with prob ∝ score^9
LLaMA/RedPajama	fastText classifier	Pages linked by Wikipedia	Hard threshold on classifier score
phi-1	Random forest on codegen embeddings	GPT-4 rated "educational" code	Binary keep/drop
DCLM	fastText on OpenHermes positives	High-quality instruction data	Top-k% by score
FineWeb	Heuristic rules (Gopher rules)	N/A — rule-based	Multiple rule thresholds

The GPT-3 stochastic keep rule is elegant: rather than a hard threshold at score 0.7 (which abruptly discards documents just below the line), it keeps document x with probability proportional to score(x)^9. Documents near zero are almost never kept; near one are almost always kept; but there is no hard cliff. This smoothness avoids the "threshold gambling" effect where tiny changes in the threshold cut 5% of the corpus.

python
import numpy as np

def keep_gpt3(score: float) -> bool:
    # Pareto(alpha=9) trick: keep if U > 1 - score
    # Equivalent to: keep with prob proportional to score^9
    return np.random.pareto(9) > 1 - score

# score=0.9: keep ~90% of the time
# score=0.5: keep ~50% but concentrated near high end
# score=0.1: keep very rarely

Heuristic Rules (Gopher / Dolma)

Many pipelines skip the classifier entirely and use deterministic rules. The Gopher rules (from DeepMind's paper) check properties like: fraction of words that are alphabetic must exceed 80%; mean word length between 3 and 10 characters; no more than 10% lines ending in ellipsis (a spam signal); document must be at least 50 words. FineWeb runs these rules before anything else.

python
def passes_gopher(text: str) -> bool:
    words = text.split()
    if len(words) < 50: return False                    # too short
    alpha_frac = sum(1 for w in words if w.isalpha()) / len(words)
    if alpha_frac < 0.8: return False                   # too many numbers/symbols
    avg_word_len = sum(len(w) for w in words) / len(words)
    if avg_word_len < 3 or avg_word_len > 10: return False  # garbled or concatenated
    lines = text.split("\n")
    ellipsis_frac = sum(1 for l in lines if l.endswith("...")) / max(1, len(lines))
    if ellipsis_frac > 0.1: return False                  # spam signal
    return True

GPT-3 uses a "stochastic keep" rule (Pareto-based) rather than a hard score threshold. What is the advantage of this approach?

Stochastic rules are faster to compute than thresholds. It allows the model to learn which threshold to use during training. It guarantees the final corpus has exactly the right proportion of high-quality documents. It avoids a hard cliff where documents just below a fixed threshold are fully discarded, instead keeping higher-scoring documents more often in a smooth, differentiable way.

Chapter 3: Hash Functions

Deduplication is fundamentally a comparison problem: is document A the same as document B? Checking every pair is O(N²) — with N=100 billion documents, that means 10²² comparisons. You need an O(N) algorithm. Hash functions are the foundation.

A hash function h maps any input (a string of arbitrary length) to a fixed-size integer output (the hash value). The key properties:

Deterministic: h("hello") always returns the same integer.
Fast: Computing h takes O(len(input)) time, not O(N).
Approximately collision-free: h(x) = h(y) for x ≠ y is rare.
Avalanche effect: One changed bit flips ~half the output bits.

There is a fundamental tradeoff: cryptographic hash functions (SHA-256, SHA-512) are collision-resistant by design — finding two inputs with the same hash requires astronomical compute. But they are slow: ~150 MB/s. Non-cryptographic hash functions (MurmurHash3, CityHash, xxHash) have no collision-resistance guarantees but run at 10–50 GB/s — 100× faster. For deduplication, you want speed, not cryptographic guarantees. A rare accidental collision (two different documents hashing to the same value) is not catastrophic.

python
import mmh3  # MurmurHash3

h1 = mmh3.hash("hello")         # → some 32-bit int, e.g. -1151537390
h2 = mmh3.hash("hello")         # → same value (deterministic)
h3 = mmh3.hash("hellp")         # → completely different (avalanche)

# With seed: h(x, seed=1) gives a different but still deterministic hash
# This is how we simulate multiple independent hash functions
h4 = mmh3.hash("hello", seed=0)  # "first" hash function
h5 = mmh3.hash("hello", seed=1)  # "second" hash function

Exact deduplication using hashes is straightforward. Hash every document, group by hash, keep one from each group. In a MapReduce framework: Map phase emits (hash, document) pairs; Reduce phase keeps one per hash group.

python
import itertools, mmh3

items = ["Hello!", "hello", "hello there", "hello", "hi", "bye"]

# Sort by hash, then groupby hash, keep first from each group
by_hash = itertools.groupby(sorted(items, key=mmh3.hash), key=mmh3.hash)
deduped = [next(group) for h, group in by_hash]
# → ["Hello!", "hello there", "hi", "bye"]  (one "hello" kept)

C4's 3-sentence span deduplication is a variant: instead of hashing full documents, hash every 3-sentence span. Any span appearing more than once is removed from its document. The upside: catches boilerplate in the middle of otherwise-good pages. The downside: removing a span from the middle can make the surrounding document incoherent.

Hash tables run the world. The same principle — hash to a bucket, compare within the bucket — is how Python dicts work, how database indexes work, and how your browser caches work. Deduplication at scale is just a hash table with a very large key space.

Why use MurmurHash instead of SHA-256 for large-scale document deduplication?

SHA-256 has too many collisions for this use case. MurmurHash is ~100× faster (10–50 GB/s vs ~150 MB/s), and deduplication does not require collision resistance — an occasional false match between two different documents is acceptable. MurmurHash produces longer hash values, reducing collisions. SHA-256 requires a GPU; MurmurHash runs on CPU.

Chapter 4: Bloom Filters

You want to test whether a document has been seen before. A simple hash table stores the actual hash (64 bits per document). With 10 billion documents that is 80 GB just for the index. Can you do better?

A Bloom filter is a probabilistic data structure that answers "is this in the set?" using far less memory. The price: it can say "yes" when the answer is actually "no" (a false positive). But it never says "no" when the answer is "yes" — there are no false negatives.

How It Works

Allocate an array of m bits, all initialized to 0. To insert item x: compute k independent hash functions h₁(x), h₂(x), ..., hₖ(x), each returning a value in [0, m). Set those k bit positions to 1. To query item x: check those same k positions. If all are 1, return "probably yes." If any is 0, return "definitely no."

python
from bitarray import bitarray
import mmh3

def build_bloom(items, m, k):
    table = bitarray(m)
    table.setall(0)
    for item in items:
        for seed in range(k):
            h = mmh3.hash(item, seed) % m
            table[h] = 1
    return table

def query_bloom(table, item, m, k):
    return all(table[mmh3.hash(item, s) % m] for s in range(k))

items = ["the", "cat", "in", "the", "hat"]
table = build_bloom(items, m=1000, k=7)
query_bloom(table, "the", 1000, 7)   # → True (definitely in set)
query_bloom(table, "dog", 1000, 7)   # → False or True (True = false positive)

Deriving the False-Positive Rate

Assume the k hash functions are independent and each maps uniformly to [0, m). After inserting n items:

P[bit i = 1] = 1 − (1 − 1/m)^kn ≈ 1 − e^−kn/m

A false positive requires ALL k positions checked for a test item to be set. These positions are also independent (k different hash functions), so:

P[false positive] = (1 − e^−kn/m)^k

Worked example. m = 1000 bits, n = 100 items inserted, k = 7 hash functions.

Exponent: kn/m = 7 × 100 / 1000 = 0.7
Each bit set: 1 − e^−0.7 = 1 − 0.497 = 0.503
FP rate: 0.503⁷ = 0.0081 ≈ 0.8%

Optimal k. Given fixed m and n, what k minimizes the false-positive rate? Take the derivative of log(FP) with respect to k and set to zero:

k_opt = (m/n) · ln(2) → FP_opt = (0.5)^k_opt

At optimal k, every bit is set with probability exactly 0.5 — the filter is half-full. For m/n = 10 bits per item: k_opt = 10 × 0.693 ≈ 7, FP = 0.5⁷ ≈ 0.8%.

Sizing for a target FP rate. Dolma uses a Bloom filter with target FP = 10⁻¹⁵. Solving: (0.5)^k_opt = 10⁻¹⁵ → k_opt = 15 / log₁₀(2) ≈ 50. Then m/n = k_opt / ln(2) ≈ 72 bits per item. For n = 10 billion paragraphs: m ≈ 720 billion bits = 90 GB. Much less than storing actual hashes (640 GB for 64-bit hashes at 10B items).

Memory savings with Bloom filters. A standard hash set storing 64-bit hashes at 10B items needs ~80 GB. A Bloom filter achieving FP=10⁻¹⁵ needs ~90 GB. The savings shrink at ultra-low FP rates. But at FP=1% (perfectly fine for deduplication), a Bloom filter needs only ~9.6 bits/item — a 7× reduction over 64-bit hashes.

Bloom filter simulator — insert items, visualize bit array, watch FP rate

Adjust m (bits) and k (hash functions). The formula (1−e^−kn/m)^k is plotted live. The optimal k = (m/n)×ln(2) is marked in orange.

m — bits in filter 100

k — number of hash functions 7

n — items inserted 15

A Bloom filter says "this document was NOT seen before" (query returns false). What can you conclude with certainty?

The document was definitely not seen before, with high probability. The document might have been seen; Bloom filters have false negatives. The document was DEFINITELY not inserted — Bloom filters have no false negatives. A "no" answer is always correct. Nothing certain — both false positives and false negatives are possible.

Chapter 5: MinHash & Jaccard Similarity

Exact deduplication handles mirror copies of the same page. But the real problem is near-duplicates — documents that differ by a few words, a different header, some formatting changes. The same news article republished on 500 aggregator sites. The same Terms of Service with the company name swapped. Exact hashing misses all of these.

You need a similarity measure that handles this fuzzy case. The standard is Jaccard similarity: represent each document as a set of "shingles" (n-grams of words or characters), and measure the fraction of the union they share.

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Worked example. Two documents share the first 3 paragraphs but differ in the last one. After shingling with word 5-grams:

A = {s₁, s₂, s₃, s₄, s₅, s₆, s₇, s₈} (8 unique shingles)
B = {s₁, s₂, s₃, s₄, s₅, s₆, s₉, s₁₀} (shares 6, has 2 unique)
|A ∩ B| = 6, |A ∪ B| = 10
Jaccard = 6/10 = 0.6

Typical threshold: documents with Jaccard ≥ 0.8 are near-duplicates. The problem: computing Jaccard exactly requires comparing every pair of document shingle-sets. O(N²) again.

MinHash: Estimating Jaccard in O(1) per pair

MinHash is a randomized technique that converts a set into a compact "signature" such that the collision probability between two signatures equals their Jaccard similarity.

Pick a random hash function h. Compute h(s) for every shingle s in the document. The MinHash value is the minimum across all shingles:

MinHash(A) = min_{s ∈ A} h(s)

Why does P[MinHash(A) = MinHash(B)] = Jaccard(A, B)? Think of the hash function as inducing a random ordering of all possible shingles. The "winner" (minimum hash value) is some shingle s*. Now ask: is s* in both A and B?

If s* is in A ∩ B: both A and B have the same minimum. They match.
If s* is in A only or B only (in A △ B): A and B have different minimums. They don't match.
The probability that the minimum element is in A ∩ B rather than the rest of A ∪ B is exactly |A ∩ B| / |A ∪ B| = Jaccard(A, B).

Use n independent hash functions (different seeds) and average the match fraction:

python
import mmh3

def minhash(shingle_set, seed):
    return min(mmh3.hash(s, seed) for s in shingle_set)

A = {"1", "2", "3", "4"}
B = {"1", "2", "3", "5"}
# True Jaccard: |{1,2,3}|/|{1,2,3,4,5}| = 3/5 = 0.6

n = 200   # number of hash functions
matches = [minhash(A, seed) == minhash(B, seed) for seed in range(n)]
estimated_jaccard = sum(matches) / n   # ≈ 0.6 with std dev ≈ sqrt(0.6*0.4/200) ≈ 0.035

With n=100 hashes, the standard deviation of the estimate is √(J(1−J)/n) ≈ 0.05. With n=200, it halves to ~0.035. The tradeoff: more hashes = more accuracy but also more storage (200 × 4 bytes = 800 bytes per document for the signature).

MinHash is the key insight. You just compressed a document (potentially megabytes of shingles) into a 200-integer signature, such that the probability of two signatures matching equals their Jaccard similarity. But you still need to compare all pairs of signatures — still O(N²). That is what LSH (next chapter) solves.

MinHash estimator — convergence as number of hash functions grows

Set the true Jaccard similarity of two document shingle-sets, then watch how accurately MinHash estimates it as you add more hash functions. The orange band shows ±1 std dev of the estimate.

True Jaccard similarity 0.60

Number of hash functions (n) 50

Sets A = {a, b, c, d} and B = {b, c, d, e, f}. What is the exact Jaccard similarity? And why is P[MinHash(A) = MinHash(B)] equal to that value?

Jaccard = 3/5 = 0.6. MinHash probabilities match because hash functions are bijective. Jaccard = 3/4 = 0.75. MinHash works because the same elements always get the same hash. Jaccard = 3/6 = 0.5. MinHash: the minimum element over A∪B={a,b,c,d,e,f} falls in A∩B={b,c,d} with prob 3/6 = 0.5, so P[match] = 0.5 = Jaccard. Jaccard = 2/6 = 0.33. MinHash estimates by averaging collisions over hash functions.

Chapter 6: LSH: Sharpen the Threshold

You have n=200 MinHash values per document. Two documents with Jaccard=0.6 will have their MinHash signatures match on ~60% of the 200 positions. But you want to declare them duplicates only if Jaccard ≥ 0.8. How do you sharpen the threshold?

With a single MinHash function, P[collision] = Jaccard — a linear, undiscriminating curve. Documents with Jaccard=0.3 collide 30% of the time. Documents with Jaccard=0.9 collide 90%. There is no sharp cutoff.

Locality-Sensitive Hashing (LSH) uses the and-or structure of bands to convert this gentle slope into a sharp S-curve.

The Banding Trick

Divide your n MinHash values into b bands, each of r rows. Two documents collide (are declared potential duplicates) if and only if at least one band is identical in all r rows.

P[collision | Jaccard = s] = 1 − (1 − s^r)^b

Where does this formula come from?

P[a single row matches] = s (by MinHash property)
P[all r rows in one band match] = s^r (independent rows)
P[a given band does NOT match] = 1 − s^r
P[NO band matches] = (1 − s^r)^b
P[at least one band matches] = 1 − (1 − s^r)^b

Deriving the Threshold

The threshold is the similarity s* where the collision probability equals 0.5. Solve:

1 − (1 − (s*)^r)^b = 0.5 ⇒ (s*)^r = 1 − 0.5^1/b ≈ ln(2)/b ⇒ s* = (ln(2)/b)^1/r

Worked example from the dedup paper. The "Deduplicating Training Data Makes Language Models Better" paper (Lee et al. 2022) uses n=9000, b=20, r=450:

s* = (ln(2)/20)^1/450 = (0.0347)^0.00222
= exp(0.00222 × ln(0.0347)) = exp(0.00222 × (−3.36)) = exp(−0.00746) ≈ 0.9926
So documents are flagged as near-duplicates only if Jaccard ≥ 0.993 — an extremely tight threshold.

The "sharp" part: with r=450, the curve goes from P=0.05 to P=0.95 across a Jaccard range of only ~0.004. Almost a step function.

Tuning b and r:

Increase r (hold b fixed): sharpens the S-curve. Smaller transition region. Threshold stays approximately the same.
Increase b (hold r fixed): moves curve left. Lower threshold — catches more near-duplicates, but more false positives.
Increase both (increase n = b×r): sharper curve AND can tune threshold independently.

python
def lsh_collision_prob(s, b, r):
    # P[collision | Jaccard = s] with b bands of r rows
    return 1 - (1 - s**r)**b

# b=5, r=10 (n=50 total hashes)
lsh_collision_prob(0.5, b=5, r=10)   # → 0.028  (5% Jaccard = rarely flagged)
lsh_collision_prob(0.8, b=5, r=10)   # → 0.672  (80% Jaccard = usually flagged)
lsh_collision_prob(0.9, b=5, r=10)   # → 0.987  (90% Jaccard = almost always)

# threshold where prob = 0.5:
threshold = (math.log(2) / b) ** (1 / r)   # ≈ 0.812 for b=5, r=10

LSH S-curve — collision probability vs. Jaccard similarity

Adjust bands b and rows r. Watch the S-curve sharpen. The orange vertical line marks the threshold (where P=0.5). Total hashes n = b × r is shown. Compare: single MinHash (r=1, b=n) gives a diagonal line — no sharpening.

b — number of bands 10

r — rows per band 10

In LSH banding: you increase r from 10 to 20 while holding b=10 fixed. What happens to the S-curve and why?

The curve moves left, flagging more near-duplicates at lower Jaccard. The curve becomes flatter, reducing selectivity. The curve becomes sharper (steeper transition), and the threshold moves slightly right, because P[all r rows match] = s^r falls faster for large r when s<1, requiring higher Jaccard to get a band match. No change — the threshold is determined only by b, not r.

Chapter 7: Showcase: Full Dedup + Filter Pipeline

Let's build the complete pipeline. You start with raw extracted text from Common Crawl. By the time it reaches your model, it has survived language ID, quality filtering, exact deduplication, and near-deduplication. This showcase lets you tune each stage and watch the corpus transform.

The simulator runs on a synthetic corpus of 200 "documents" — each represented as a point in a 2D quality/similarity space. Quality (x-axis) is the fastText classifier score. Similarity (y-axis) pairs documents by near-duplicate status. Watch how each filter stage changes what survives.

Full pipeline simulator — 200 synthetic documents through all 4 stages

Each dot is a document. Color = survived all stages so far (green) or filtered (red/grey). Tune each stage. The bar at bottom shows tokens remaining and estimated model quality.

Language ID threshold (keep above) 0.50

Quality classifier threshold 0.35

Near-dedup Jaccard threshold (flag above) 0.80

Stage: all

What Real Pipelines Do

Pipeline	Language	Quality	Exact dedup	Near-dedup	Retention
C4	langdetect≥0.99	Heuristics (Gopher-style)	3-sentence spans	—	~11%
CCNet/LLaMA	fastText LID	KenLM perplexity top-1/3	SHA-1 paragraphs	MinHash doc-level	~8%
FineWeb	lid.176 p≥0.65	Gopher + custom rules	URL-level	MinHash b=9, r=13	~17%
Dolma	lid.176 p≥0.5	Gopher rules	Bloom filter paragraphs	MinHash docs	~12%
DCLM	lid.176 p≥0.5	fastText OpenHermes	URL hash	MinHash b=8, r=3	~1.6%

DCLM's aggressive 1.6% retention is deliberate: the fastText classifier trained on high-quality instruction data sets a very high bar. The resulting 3.8T token corpus, despite being tiny relative to its raw input, trains models that outperform C4 and RefinedWeb on standard benchmarks — validating the "quality beats quantity" thesis.

The FineWeb insight. HuggingFace found that Gopher rules alone (applied to 95 CC dumps) outperformed all ML classifier approaches in their ablations. The intuition: Gopher rules reject the truly incoherent text (too many symbols, too short, repetitive), while ML classifiers over-filter diverse-but-valid writing that happens not to sound like Wikipedia. FineWeb keeps 17% — 3× more than DCLM — and achieves competitive benchmark scores.

Chapter 8: Toxicity, PII & Decontamination

Quality and deduplication are not the only reasons to remove documents. Two more concerns require their own pipeline stages: safety (removing content that teaches harmful behavior) and contamination (removing content that leaks evaluation data into training).

Toxicity Filtering

Training on toxic content does not just make the model capable of producing toxic text — it shifts the base distribution so that toxic completions become more probable even in non-toxic contexts. You want to filter before pretraining, not rely entirely on RLHF post-training to patch it.

Dolma's approach uses two fastText classifiers trained on the Jigsaw Toxic Comments dataset (2018): Wikipedia talk page comments labeled across six toxicity categories.

NSFW classifier: positive = {obscene}, negative = all else. Removes sexual content.
Hate classifier: positive = {toxic, threat, insult, identity_hate}, negative = all else. Removes hateful speech.

python
import fasttext

model = fasttext.load_model("jigsaw_fasttext_bigrams_nsfw_final.bin")
model.predict(["I love strawberries"])    # → (__label__non_nsfw, 0.999)
model.predict(["I hate strawberries"])    # → (__label__non_nsfw, 0.987) — hate ≠ toxic here
model.predict(["Stupid piece of ... die!"]) # → (__label__nsfw, 0.931)

False positives are costly. A classifier trained to detect toxic content will over-remove legitimate discussions of difficult topics: academic papers on hate speech, news articles about violence, survivor testimony. The tradeoff is threshold-sensitive. Dolma uses a relatively high threshold (low sensitivity) to preserve more data at the cost of some toxic content remaining, trusting post-training to handle the rest.

PII Removal

Personally Identifiable Information (PII) is a legal and ethical concern. Models trained on raw web text learn social security numbers, email addresses, phone numbers, and names paired with private contexts. The model can then regurgitate them on demand.

PII removal is typically done with regex patterns (email, phone, SSN, credit card patterns) or trained NER (Named Entity Recognition) models that identify personal names in context. Dolma runs regex-based PII removal before the quality filter.

python
import re

EMAIL_RE = rr'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
PHONE_RE = rr'\b(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
SSN_RE   = rr'\b\d{3}-\d{2}-\d{4}\b'

def remove_pii(text):
    text = re.sub(EMAIL_RE, "[EMAIL]", text)
    text = re.sub(PHONE_RE, "[PHONE]", text)
    text = re.sub(SSN_RE,   "[SSN]",   text)
    return text

Decontamination

Decontamination is the removal of evaluation benchmark examples from training data. If a model is trained on documents containing MMLU questions and their correct answers, its benchmark score is inflated — it memorized the test answers rather than reasoning toward them.

The standard approach: for each evaluation example, generate all n-grams of length 13 from the question text. Search the training corpus for documents containing any of these n-grams. Remove matching documents (or the matching spans). 13-grams are long enough to be specific but short enough to catch paraphrased repetitions.

python
def get_ngrams(text, n=13):
    words = text.lower().split()
    return {" ".join(words[i:i+n]) for i in range(len(words)-n+1)}

eval_ngrams = set()
for example in eval_set:
    eval_ngrams.update(get_ngrams(example.question))

def is_contaminated(doc):
    doc_ngrams = get_ngrams(doc.text)
    return bool(doc_ngrams & eval_ngrams)   # any overlap → contaminated

How much does it matter? Eleuther AI found that decontaminating The Pile against their evaluation suite changed benchmark scores by less than 1% in absolute terms. But for models specifically optimized on benchmarks (as many commercial models are), contamination can inflate numbers substantially. The honest thing to do is to always decontaminate and report it.

Why is 13-gram matching used for decontamination rather than, say, exact document matching?

13-grams are faster to compute than document-level hashes. An evaluation question may appear inside a larger document (e.g., a study guide or test-prep page) — 13-gram matching catches these partial inclusions that exact document matching would miss. 13 is the standard n-gram length for all NLP deduplication tasks. 13-grams are less likely to produce false positives than shorter n-grams.

Chapter 9: Connections & Data Cleaning Recipe

You now have every tool in the standard LLM data cleaning toolkit. Here is the full recipe, ordered as it runs in production:

1. Language ID

fastText lid.176. Keep target language p ≥ 0.5–0.65. Run first — cheapest filter, removes ~50% of raw CC.

↓

2. Heuristic rules

Gopher rules: min 50 words, 80% alphabetic, mean word length 3–10, <10% ellipsis lines. Remove HTML artifacts, excessive symbols, cookie consent text not caught by extraction.

↓

3. Quality classifier (optional)

fastText trained on quality positives (Wikipedia / OpenHermes / pages linked from Wikipedia). Threshold or stochastic keep. DCLM: aggressive. FineWeb: skips this stage.

↓

4. PII removal

Regex patterns for emails, phones, SSNs. Optional NER for personal names in sensitive context.

↓

5. Toxicity filter

fastText NSFW + hate classifiers (Dolma Jigsaw models). Remove or flag. Threshold tuning is a policy decision.

↓

6. Exact dedup

Hash paragraphs or 3-sentence spans (MurmurHash). Group, keep one. Bloom filter for streaming membership testing.

↓

7. Near-dedup (MinHash + LSH)

Shingle docs into 5-grams. Compute 200 MinHash signatures. Band into b bands of r rows. Documents sharing a band bucket → candidate pairs → remove the lower-quality copy.

↓

8. Decontamination

Build 13-gram set from all eval benchmarks. Remove training docs that overlap. Report which benchmarks were decontaminated.

Algorithm Cheat Sheet

Algorithm	What it does	Key parameter	Complexity
KenLM (n-gram LM)	Perplexity of doc under target LM	n-gram order, perplexity threshold	O(L) per doc
fastText classifier	P(quality \| doc) binary	Classifier threshold	O(L) per doc
DSIR	p_T(x)/p_R(x) importance weight	Target T distribution	O(L) per doc
Bloom filter	Set membership (approx)	m bits, k hash functions	O(k) per query
MurmurHash exact dedup	Remove exact copies	Item granularity (doc/para/span)	O(L) hash + O(1) lookup
MinHash	Estimate Jaccard similarity	n hash functions	O(S × n) where S=shingles
LSH banding	Find near-duplicate candidates	b bands, r rows, threshold s*=(ln2/b)^(1/r)	O(N × n) total

What Comes Next

You now have a clean, deduplicated corpus. The next question is: how much of this corpus should you actually train on, and in what proportions? That is the data mix problem covered in Lecture 13. After that comes the architecture question (which this lesson series covers in Lectures 1–12) and then alignment: turning a base model into one that follows instructions and behaves safely (Lecture 15 and beyond).

Lesson	What it covers	Connection to this lesson
CS336 L13	Data sources, HTML extraction, data mix	The pipeline stage BEFORE filtering
CS336 L15 (upcoming)	Alignment, RLHF, safety post-training	What handles toxicity that filtering misses
RAG	Retrieval-Augmented Generation	Uses similar dedup/similarity search at query time
Vector Databases	HNSW, IVF for approximate nearest neighbor	LSH is a simpler ANN algorithm — same goal, different mechanism

Feynman test. Can you explain to someone who has never programmed why a Bloom filter can have false positives but never false negatives? Can you derive the MinHash collision probability from the characteristic matrix? Can you tell someone how to choose b and r in LSH to hit a target similarity threshold? If yes, you have genuinely understood this lecture.

"Now you have the tools (the mechanics), just have to spend time with data (the intuitions)." — Percy Liang, CS336 Lecture 14