Language Modeling from Scratch · CS336 · Lecture 14

Data II: Filtering & Deduplication

You have 1.4 trillion tokens from Common Crawl. Most of it is spam, bot-generated drivel, and near-identical copies of the same page. To get a model that actually learns from the data, you need to keep the signal and throw out the noise — at web scale, in linear time. This lesson derives every algorithm: n-gram perplexity (KenLM), fastText classifiers, importance resampling (DSIR), Bloom filters, MinHash, and Locality-Sensitive Hashing. You will compute Jaccard similarity by hand, derive the MinHash collision probability, trace the LSH S-curve math, and size a Bloom filter from first principles. By the end, you can build the dedup+filter stack that powers FineWeb, Dolma, and DCLM.

Prerequisites: CS336 Lec 13 (data pipeline, Common Crawl, HTML extraction). Basic probability (P[A], independence).
10
Chapters
5
Live Canvases
Derived
From First Principles

Chapter 0: The Junk Problem

You have just downloaded a Common Crawl snapshot. One and a half trillion tokens of raw text. Enough to fill a million novels. You fire up a tokenizer and start reading page one.

It is a product description for a gas mask. The same sentence — "by combining fantastic ideas, interesting arrangements, and follow the current trends in the field" — repeated 61,036 times. Not 61,036 pages. 61,036 repetitions on what is technically a single document in C4. Your model will memorize that sentence. Every gradient step will slightly tilt the weights toward regurgitating it on demand.

Page two: a navigation menu followed by a cookie banner, followed by the same navigation menu in French, followed by a sitemap in XML, followed by the same article published again with a slightly different URL. Page three: a Terms of Service document. The MIT license. Another Terms of Service. Another MIT license. These are exact duplicates but they are everywhere, so your model will learn to produce legalese on demand whether you want it to or not.

The fundamental tension. You cannot inspect 1.5 trillion tokens manually. You need an automated signal that predicts whether a given document is worth training on — but the signal must generalize. A filter calibrated too loosely leaves the junk. A filter calibrated too aggressively removes domain-specific writing that looks unusual but is perfectly valuable (think: medical records, legal briefs, ancient texts). Every threshold is a bet.

This lecture is about two cleaning stages that happen after HTML extraction (Lecture 13) and before your model sees a single token:

Quality Filtering
Score each document on "goodness" using perplexity models, classifiers, or heuristic rules. Discard low-scorers. Retain ~20–60% of what extraction gives you.
Deduplication
Find and remove near-identical documents. A single web page can appear under thousands of URLs. Remove all but one. This cuts another 30–50%.

Together they explain why FineWeb starts with 95 CC snapshots (~90T raw tokens) but ships 15T tokens — and why DCLM starts with 240T and ships 3.8T. That 1–16% survival rate is not waste. It is signal extraction.

Corpus survival funnel — raw tokens through each stage

Drag the sliders to see how much survives. Compare to real pipelines: C4 retains ~11%, FineWeb ~17%, DCLM ~1.6%. The exact numbers depend on how aggressively each stage filters.

Language ID (keep target lang) 40%
Quality filter (heuristic + classifier) 30%
Exact dedup (remove copies) 80%
Near-dedup MinHash (remove near-copies) 60%
A product description appears 61,036 times in a training corpus. Why is this specifically harmful beyond just wasting compute?

Chapter 1: Filtering Algorithms

Every filtering algorithm answers the same question: given some target data T (text you want your model to sound like) and a huge pile of raw data R (everything from Common Crawl), find a subset of R that resembles T.

That framing immediately tells you what you need: (1) a way to measure resemblance, (2) a way to do it fast enough to process trillions of tokens. Three families of algorithms solve this, each estimating the resemblance differently.

The general framework. Every filtering algorithm has two steps: (1) fit some model using T and R to get a scoring function score(x), and (2) keep documents x with score(x) above some threshold. The three paradigms differ only in what the model is and what the score means.

Algorithm 1: KenLM — n-gram Perplexity

A language model perplexity measures how surprised the model is by a document. Low perplexity means the text looks like what the model was trained on. High perplexity means the text is unusual — either genuinely weird, or high-quality niche text.

Train a fast n-gram model (specifically a trigram with Kneser-Ney smoothing) on Wikipedia or another quality target. Then score every document in Common Crawl. Documents that score low perplexity look like Wikipedia. Documents that score high look different.

score(x) = PPLT(x) = exp(− (1/N) ∑i log pT(wi | wi−2, wi−1))

For a trigram model: p("cat" | "the fat") = count("the fat cat") / count("the fat"). The problem is data sparsity — most trigrams never appear, so the count is zero, which sends the log to −∞. Kneser-Ney smoothing solves this by redistributing probability mass to lower-order n-grams based on how many distinct contexts each word appears in (not just raw counts).

python
import kenlm, math

model = kenlm.Model("en.arpa.bin")  # trained on Wikipedia

def perplexity(text):
    text = "<s> " + text + " </s>"
    score = model.score(text)               # sum of log10 p(w_i | context)
    n = len(list(model.full_scores(text)))
    return math.exp(-score * math.log(10) / n)

perplexity("Stanford University was founded in 1885")     # ~100 (reasonable)
perplexity("asdf asdf asdf asdf asdf")                    # ~50000 (garbage)
perplexity("the the the the the the the the")            # ~2 (repetitive but low ppl!)
The repetition trap. "the the the the the the" has LOW perplexity because "the" is the most common English word and an n-gram model will confidently predict it repeatedly. This is why perplexity filtering alone is insufficient — you also need heuristic rules to catch repetitive junk.

CCNet used this approach: train KenLM on Wikipedia, sort all CC paragraphs by perplexity, keep the top third (lowest perplexity). LLaMA adopted CCNet's filtered output directly.

Algorithm 2: fastText Classifier

Instead of measuring perplexity, train a binary classifier: positive = quality documents (Wikipedia pages), negative = random CC pages. Then score = P(quality | document).

fastText is a deliberately simple architecture: embed each token, average the embeddings, apply a linear head. With the hashing trick for n-grams, the whole model fits in a few hundred MB and runs at millions of documents per second.

y = softmax(U · mean(Wi))    where W ∈ RV×H, U ∈ RH×K

For quality filtering K=2 (good/bad), so this reduces to a linear classifier over averaged word embeddings. Hash the bigrams to avoid a runaway vocabulary: hash("the cat") % 10_000_000 gives an index into the embedding table. Collision rate is negligible at 10M bins.

python
import mmh3   # MurmurHash3 — fast, not cryptographic

bigrams = ["the cat", "cat in", "in the", "the hat"]
hashed  = [mmh3.hash(bg) % 10_000_000 for bg in bigrams]
# [3142891, 7601234, 1892345, 9012456]  (example values)
# These index into the 10M-row embedding table

Algorithm 3: DSIR — Importance Resampling

Data Selection via Importance Resampling is the most principled approach. Fit a distribution p to target data T and a distribution q to raw data R. The importance weight for document x is p(x)/q(x). Documents that look much more like T than R get high weights; documents that look equally likely under both get weight ≈ 1; documents rare in T but common in R get low weights.

w(x) = p(x) / q(x)    resample R proportionally to w

The trick: rather than fitting a neural LM (expensive), fit hashed unigram distributions. Hash each word to one of 10K bins, estimate p and q as histograms. Fast to fit, fast to score. DSIR slightly outperforms fastText on GLUE benchmarks despite similar cost — because it explicitly captures distributional diversity rather than just a binary quality signal.

Three filtering paradigms compared — quality vs. compute tradeoff

Each dot is a (corpus quality score, relative compute cost) pair. Drag the "quality target" slider to see which algorithms meet the bar. KenLM is cheapest; BERT-based classifiers are highest quality but 100× slower.

Quality bar (minimum score required) 50
fastText for quality filtering uses K=2 classes. What does this mean architecturally, and what is the effective computation?

Chapter 2: Filtering Applications

The same algorithmic toolkit — KenLM, fastText, DSIR — gets applied to different filtering tasks depending on what you want to remove. Three major applications: language identification, quality filtering, and toxicity filtering.

Language Identification

If you're training an English-first model, most of Common Crawl is noise. The web is multilingual: roughly 60% English, 10% German/French/Russian combined, and the remaining 30% spread across hundreds of languages. A language ID classifier runs first, before any quality filter.

The standard tool is the fastText lid.176.bin model: a fastText classifier trained on Wikipedia, Tatoeba, and SETimes data. It supports 176 languages and runs at millions of documents per second. It outputs a probability per language.

The edge cases are harder than they look. Short text (<20 words) is unreliable — "Hello!" is labeled English but "Bonjour!" is trivially French only because of training data density. Code-switching ("Feliz Navidad / I wanna wish you a Merry Christmas") gets misclassified. Low-resource dialects of English (Scottish, AAVE) can be incorrectly filtered. Dolma's threshold: keep if p(English) ≥ 0.5.

A practical decision: Dolma sets p(English) ≥ 0.5, which is lenient. RefinedWeb sets 0.65, which is stricter. The tradeoff: higher threshold = cleaner English corpus but more false-negative loss of genuinely multilingual quality content (e.g., a page that mixes English and citations in another language).

Quality Filtering

Quality filtering is the most consequential decision in the pipeline. There is no universal "quality" label — it is always defined relative to a target distribution.

PipelineMethodPositives (target)Key threshold
GPT-3Linear classifier on word featuresWebText2, Wikipedia, BooksStochastic: keep with prob ∝ score^9
LLaMA/RedPajamafastText classifierPages linked by WikipediaHard threshold on classifier score
phi-1Random forest on codegen embeddingsGPT-4 rated "educational" codeBinary keep/drop
DCLMfastText on OpenHermes positivesHigh-quality instruction dataTop-k% by score
FineWebHeuristic rules (Gopher rules)N/A — rule-basedMultiple rule thresholds

The GPT-3 stochastic keep rule is elegant: rather than a hard threshold at score 0.7 (which abruptly discards documents just below the line), it keeps document x with probability proportional to score(x)^9. Documents near zero are almost never kept; near one are almost always kept; but there is no hard cliff. This smoothness avoids the "threshold gambling" effect where tiny changes in the threshold cut 5% of the corpus.

python
import numpy as np

def keep_gpt3(score: float) -> bool:
    # Pareto(alpha=9) trick: keep if U > 1 - score
    # Equivalent to: keep with prob proportional to score^9
    return np.random.pareto(9) > 1 - score

# score=0.9: keep ~90% of the time
# score=0.5: keep ~50% but concentrated near high end
# score=0.1: keep very rarely

Heuristic Rules (Gopher / Dolma)

Many pipelines skip the classifier entirely and use deterministic rules. The Gopher rules (from DeepMind's paper) check properties like: fraction of words that are alphabetic must exceed 80%; mean word length between 3 and 10 characters; no more than 10% lines ending in ellipsis (a spam signal); document must be at least 50 words. FineWeb runs these rules before anything else.

python
def passes_gopher(text: str) -> bool:
    words = text.split()
    if len(words) < 50: return False                    # too short
    alpha_frac = sum(1 for w in words if w.isalpha()) / len(words)
    if alpha_frac < 0.8: return False                   # too many numbers/symbols
    avg_word_len = sum(len(w) for w in words) / len(words)
    if avg_word_len < 3 or avg_word_len > 10: return False  # garbled or concatenated
    lines = text.split("\n")
    ellipsis_frac = sum(1 for l in lines if l.endswith("...")) / max(1, len(lines))
    if ellipsis_frac > 0.1: return False                  # spam signal
    return True
GPT-3 uses a "stochastic keep" rule (Pareto-based) rather than a hard score threshold. What is the advantage of this approach?

Chapter 3: Hash Functions

Deduplication is fundamentally a comparison problem: is document A the same as document B? Checking every pair is O(N²) — with N=100 billion documents, that means 10²² comparisons. You need an O(N) algorithm. Hash functions are the foundation.

A hash function h maps any input (a string of arbitrary length) to a fixed-size integer output (the hash value). The key properties:

There is a fundamental tradeoff: cryptographic hash functions (SHA-256, SHA-512) are collision-resistant by design — finding two inputs with the same hash requires astronomical compute. But they are slow: ~150 MB/s. Non-cryptographic hash functions (MurmurHash3, CityHash, xxHash) have no collision-resistance guarantees but run at 10–50 GB/s — 100× faster. For deduplication, you want speed, not cryptographic guarantees. A rare accidental collision (two different documents hashing to the same value) is not catastrophic.

python
import mmh3  # MurmurHash3

h1 = mmh3.hash("hello")         # → some 32-bit int, e.g. -1151537390
h2 = mmh3.hash("hello")         # → same value (deterministic)
h3 = mmh3.hash("hellp")         # → completely different (avalanche)

# With seed: h(x, seed=1) gives a different but still deterministic hash
# This is how we simulate multiple independent hash functions
h4 = mmh3.hash("hello", seed=0)  # "first" hash function
h5 = mmh3.hash("hello", seed=1)  # "second" hash function

Exact deduplication using hashes is straightforward. Hash every document, group by hash, keep one from each group. In a MapReduce framework: Map phase emits (hash, document) pairs; Reduce phase keeps one per hash group.

python
import itertools, mmh3

items = ["Hello!", "hello", "hello there", "hello", "hi", "bye"]

# Sort by hash, then groupby hash, keep first from each group
by_hash = itertools.groupby(sorted(items, key=mmh3.hash), key=mmh3.hash)
deduped = [next(group) for h, group in by_hash]
# → ["Hello!", "hello there", "hi", "bye"]  (one "hello" kept)

C4's 3-sentence span deduplication is a variant: instead of hashing full documents, hash every 3-sentence span. Any span appearing more than once is removed from its document. The upside: catches boilerplate in the middle of otherwise-good pages. The downside: removing a span from the middle can make the surrounding document incoherent.

Hash tables run the world. The same principle — hash to a bucket, compare within the bucket — is how Python dicts work, how database indexes work, and how your browser caches work. Deduplication at scale is just a hash table with a very large key space.
Why use MurmurHash instead of SHA-256 for large-scale document deduplication?

Chapter 4: Bloom Filters

You want to test whether a document has been seen before. A simple hash table stores the actual hash (64 bits per document). With 10 billion documents that is 80 GB just for the index. Can you do better?

A Bloom filter is a probabilistic data structure that answers "is this in the set?" using far less memory. The price: it can say "yes" when the answer is actually "no" (a false positive). But it never says "no" when the answer is "yes" — there are no false negatives.

How It Works

Allocate an array of m bits, all initialized to 0. To insert item x: compute k independent hash functions h₁(x), h₂(x), ..., hₖ(x), each returning a value in [0, m). Set those k bit positions to 1. To query item x: check those same k positions. If all are 1, return "probably yes." If any is 0, return "definitely no."

python
from bitarray import bitarray
import mmh3

def build_bloom(items, m, k):
    table = bitarray(m)
    table.setall(0)
    for item in items:
        for seed in range(k):
            h = mmh3.hash(item, seed) % m
            table[h] = 1
    return table

def query_bloom(table, item, m, k):
    return all(table[mmh3.hash(item, s) % m] for s in range(k))

items = ["the", "cat", "in", "the", "hat"]
table = build_bloom(items, m=1000, k=7)
query_bloom(table, "the", 1000, 7)   # → True (definitely in set)
query_bloom(table, "dog", 1000, 7)   # → False or True (True = false positive)

Deriving the False-Positive Rate

Assume the k hash functions are independent and each maps uniformly to [0, m). After inserting n items:

P[bit i = 1] = 1 − (1 − 1/m)kn ≈ 1 − e−kn/m

A false positive requires ALL k positions checked for a test item to be set. These positions are also independent (k different hash functions), so:

P[false positive] = (1 − e−kn/m)k

Worked example. m = 1000 bits, n = 100 items inserted, k = 7 hash functions.

Optimal k. Given fixed m and n, what k minimizes the false-positive rate? Take the derivative of log(FP) with respect to k and set to zero:

kopt = (m/n) · ln(2)    →    FPopt = (0.5)kopt

At optimal k, every bit is set with probability exactly 0.5 — the filter is half-full. For m/n = 10 bits per item: kopt = 10 × 0.693 ≈ 7, FP = 0.57 ≈ 0.8%.

Sizing for a target FP rate. Dolma uses a Bloom filter with target FP = 10−15. Solving: (0.5)k_opt = 10−15 → k_opt = 15 / log₁₀(2) ≈ 50. Then m/n = k_opt / ln(2) ≈ 72 bits per item. For n = 10 billion paragraphs: m ≈ 720 billion bits = 90 GB. Much less than storing actual hashes (640 GB for 64-bit hashes at 10B items).

Memory savings with Bloom filters. A standard hash set storing 64-bit hashes at 10B items needs ~80 GB. A Bloom filter achieving FP=10−15 needs ~90 GB. The savings shrink at ultra-low FP rates. But at FP=1% (perfectly fine for deduplication), a Bloom filter needs only ~9.6 bits/item — a 7× reduction over 64-bit hashes.
Bloom filter simulator — insert items, visualize bit array, watch FP rate

Adjust m (bits) and k (hash functions). The formula (1−e−kn/m)k is plotted live. The optimal k = (m/n)×ln(2) is marked in orange.

m — bits in filter 100
k — number of hash functions 7
n — items inserted 15
A Bloom filter says "this document was NOT seen before" (query returns false). What can you conclude with certainty?

Chapter 5: MinHash & Jaccard Similarity

Exact deduplication handles mirror copies of the same page. But the real problem is near-duplicates — documents that differ by a few words, a different header, some formatting changes. The same news article republished on 500 aggregator sites. The same Terms of Service with the company name swapped. Exact hashing misses all of these.

You need a similarity measure that handles this fuzzy case. The standard is Jaccard similarity: represent each document as a set of "shingles" (n-grams of words or characters), and measure the fraction of the union they share.

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Worked example. Two documents share the first 3 paragraphs but differ in the last one. After shingling with word 5-grams:

Typical threshold: documents with Jaccard ≥ 0.8 are near-duplicates. The problem: computing Jaccard exactly requires comparing every pair of document shingle-sets. O(N²) again.

MinHash: Estimating Jaccard in O(1) per pair

MinHash is a randomized technique that converts a set into a compact "signature" such that the collision probability between two signatures equals their Jaccard similarity.

Pick a random hash function h. Compute h(s) for every shingle s in the document. The MinHash value is the minimum across all shingles:

MinHash(A) = mins ∈ A h(s)

Why does P[MinHash(A) = MinHash(B)] = Jaccard(A, B)? Think of the hash function as inducing a random ordering of all possible shingles. The "winner" (minimum hash value) is some shingle s*. Now ask: is s* in both A and B?

Use n independent hash functions (different seeds) and average the match fraction:

python
import mmh3

def minhash(shingle_set, seed):
    return min(mmh3.hash(s, seed) for s in shingle_set)

A = {"1", "2", "3", "4"}
B = {"1", "2", "3", "5"}
# True Jaccard: |{1,2,3}|/|{1,2,3,4,5}| = 3/5 = 0.6

n = 200   # number of hash functions
matches = [minhash(A, seed) == minhash(B, seed) for seed in range(n)]
estimated_jaccard = sum(matches) / n   # ≈ 0.6 with std dev ≈ sqrt(0.6*0.4/200) ≈ 0.035

With n=100 hashes, the standard deviation of the estimate is √(J(1−J)/n) ≈ 0.05. With n=200, it halves to ~0.035. The tradeoff: more hashes = more accuracy but also more storage (200 × 4 bytes = 800 bytes per document for the signature).

MinHash is the key insight. You just compressed a document (potentially megabytes of shingles) into a 200-integer signature, such that the probability of two signatures matching equals their Jaccard similarity. But you still need to compare all pairs of signatures — still O(N²). That is what LSH (next chapter) solves.
MinHash estimator — convergence as number of hash functions grows

Set the true Jaccard similarity of two document shingle-sets, then watch how accurately MinHash estimates it as you add more hash functions. The orange band shows ±1 std dev of the estimate.

True Jaccard similarity 0.60
Number of hash functions (n) 50
Sets A = {a, b, c, d} and B = {b, c, d, e, f}. What is the exact Jaccard similarity? And why is P[MinHash(A) = MinHash(B)] equal to that value?

Chapter 6: LSH: Sharpen the Threshold

You have n=200 MinHash values per document. Two documents with Jaccard=0.6 will have their MinHash signatures match on ~60% of the 200 positions. But you want to declare them duplicates only if Jaccard ≥ 0.8. How do you sharpen the threshold?

With a single MinHash function, P[collision] = Jaccard — a linear, undiscriminating curve. Documents with Jaccard=0.3 collide 30% of the time. Documents with Jaccard=0.9 collide 90%. There is no sharp cutoff.

Locality-Sensitive Hashing (LSH) uses the and-or structure of bands to convert this gentle slope into a sharp S-curve.

The Banding Trick

Divide your n MinHash values into b bands, each of r rows. Two documents collide (are declared potential duplicates) if and only if at least one band is identical in all r rows.

P[collision | Jaccard = s] = 1 − (1 − sr)b

Where does this formula come from?

Deriving the Threshold

The threshold is the similarity s* where the collision probability equals 0.5. Solve:

1 − (1 − (s*)r)b = 0.5  ⇒  (s*)r = 1 − 0.51/b ≈ ln(2)/b  ⇒  s* = (ln(2)/b)1/r

Worked example from the dedup paper. The "Deduplicating Training Data Makes Language Models Better" paper (Lee et al. 2022) uses n=9000, b=20, r=450:

The "sharp" part: with r=450, the curve goes from P=0.05 to P=0.95 across a Jaccard range of only ~0.004. Almost a step function.

Tuning b and r:

python
def lsh_collision_prob(s, b, r):
    # P[collision | Jaccard = s] with b bands of r rows
    return 1 - (1 - s**r)**b

# b=5, r=10 (n=50 total hashes)
lsh_collision_prob(0.5, b=5, r=10)   # → 0.028  (5% Jaccard = rarely flagged)
lsh_collision_prob(0.8, b=5, r=10)   # → 0.672  (80% Jaccard = usually flagged)
lsh_collision_prob(0.9, b=5, r=10)   # → 0.987  (90% Jaccard = almost always)

# threshold where prob = 0.5:
threshold = (math.log(2) / b) ** (1 / r)   # ≈ 0.812 for b=5, r=10
LSH S-curve — collision probability vs. Jaccard similarity

Adjust bands b and rows r. Watch the S-curve sharpen. The orange vertical line marks the threshold (where P=0.5). Total hashes n = b × r is shown. Compare: single MinHash (r=1, b=n) gives a diagonal line — no sharpening.

b — number of bands 10
r — rows per band 10
In LSH banding: you increase r from 10 to 20 while holding b=10 fixed. What happens to the S-curve and why?

Chapter 7: Showcase: Full Dedup + Filter Pipeline

Let's build the complete pipeline. You start with raw extracted text from Common Crawl. By the time it reaches your model, it has survived language ID, quality filtering, exact deduplication, and near-deduplication. This showcase lets you tune each stage and watch the corpus transform.

The simulator runs on a synthetic corpus of 200 "documents" — each represented as a point in a 2D quality/similarity space. Quality (x-axis) is the fastText classifier score. Similarity (y-axis) pairs documents by near-duplicate status. Watch how each filter stage changes what survives.

Full pipeline simulator — 200 synthetic documents through all 4 stages

Each dot is a document. Color = survived all stages so far (green) or filtered (red/grey). Tune each stage. The bar at bottom shows tokens remaining and estimated model quality.

Language ID threshold (keep above) 0.50
Quality classifier threshold 0.35
Near-dedup Jaccard threshold (flag above) 0.80
Stage: all

What Real Pipelines Do

PipelineLanguageQualityExact dedupNear-dedupRetention
C4langdetect≥0.99Heuristics (Gopher-style)3-sentence spans~11%
CCNet/LLaMAfastText LIDKenLM perplexity top-1/3SHA-1 paragraphsMinHash doc-level~8%
FineWeblid.176 p≥0.65Gopher + custom rulesURL-levelMinHash b=9, r=13~17%
Dolmalid.176 p≥0.5Gopher rulesBloom filter paragraphsMinHash docs~12%
DCLMlid.176 p≥0.5fastText OpenHermesURL hashMinHash b=8, r=3~1.6%

DCLM's aggressive 1.6% retention is deliberate: the fastText classifier trained on high-quality instruction data sets a very high bar. The resulting 3.8T token corpus, despite being tiny relative to its raw input, trains models that outperform C4 and RefinedWeb on standard benchmarks — validating the "quality beats quantity" thesis.

The FineWeb insight. HuggingFace found that Gopher rules alone (applied to 95 CC dumps) outperformed all ML classifier approaches in their ablations. The intuition: Gopher rules reject the truly incoherent text (too many symbols, too short, repetitive), while ML classifiers over-filter diverse-but-valid writing that happens not to sound like Wikipedia. FineWeb keeps 17% — 3× more than DCLM — and achieves competitive benchmark scores.

Chapter 8: Toxicity, PII & Decontamination

Quality and deduplication are not the only reasons to remove documents. Two more concerns require their own pipeline stages: safety (removing content that teaches harmful behavior) and contamination (removing content that leaks evaluation data into training).

Toxicity Filtering

Training on toxic content does not just make the model capable of producing toxic text — it shifts the base distribution so that toxic completions become more probable even in non-toxic contexts. You want to filter before pretraining, not rely entirely on RLHF post-training to patch it.

Dolma's approach uses two fastText classifiers trained on the Jigsaw Toxic Comments dataset (2018): Wikipedia talk page comments labeled across six toxicity categories.

python
import fasttext

model = fasttext.load_model("jigsaw_fasttext_bigrams_nsfw_final.bin")
model.predict(["I love strawberries"])    # → (__label__non_nsfw, 0.999)
model.predict(["I hate strawberries"])    # → (__label__non_nsfw, 0.987) — hate ≠ toxic here
model.predict(["Stupid piece of ... die!"]) # → (__label__nsfw, 0.931)
False positives are costly. A classifier trained to detect toxic content will over-remove legitimate discussions of difficult topics: academic papers on hate speech, news articles about violence, survivor testimony. The tradeoff is threshold-sensitive. Dolma uses a relatively high threshold (low sensitivity) to preserve more data at the cost of some toxic content remaining, trusting post-training to handle the rest.

PII Removal

Personally Identifiable Information (PII) is a legal and ethical concern. Models trained on raw web text learn social security numbers, email addresses, phone numbers, and names paired with private contexts. The model can then regurgitate them on demand.

PII removal is typically done with regex patterns (email, phone, SSN, credit card patterns) or trained NER (Named Entity Recognition) models that identify personal names in context. Dolma runs regex-based PII removal before the quality filter.

python
import re

EMAIL_RE = rr'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
PHONE_RE = rr'\b(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
SSN_RE   = rr'\b\d{3}-\d{2}-\d{4}\b'

def remove_pii(text):
    text = re.sub(EMAIL_RE, "[EMAIL]", text)
    text = re.sub(PHONE_RE, "[PHONE]", text)
    text = re.sub(SSN_RE,   "[SSN]",   text)
    return text

Decontamination

Decontamination is the removal of evaluation benchmark examples from training data. If a model is trained on documents containing MMLU questions and their correct answers, its benchmark score is inflated — it memorized the test answers rather than reasoning toward them.

The standard approach: for each evaluation example, generate all n-grams of length 13 from the question text. Search the training corpus for documents containing any of these n-grams. Remove matching documents (or the matching spans). 13-grams are long enough to be specific but short enough to catch paraphrased repetitions.

python
def get_ngrams(text, n=13):
    words = text.lower().split()
    return {" ".join(words[i:i+n]) for i in range(len(words)-n+1)}

eval_ngrams = set()
for example in eval_set:
    eval_ngrams.update(get_ngrams(example.question))

def is_contaminated(doc):
    doc_ngrams = get_ngrams(doc.text)
    return bool(doc_ngrams & eval_ngrams)   # any overlap → contaminated
How much does it matter? Eleuther AI found that decontaminating The Pile against their evaluation suite changed benchmark scores by less than 1% in absolute terms. But for models specifically optimized on benchmarks (as many commercial models are), contamination can inflate numbers substantially. The honest thing to do is to always decontaminate and report it.
Why is 13-gram matching used for decontamination rather than, say, exact document matching?

Chapter 9: Connections & Data Cleaning Recipe

You now have every tool in the standard LLM data cleaning toolkit. Here is the full recipe, ordered as it runs in production:

1. Language ID
fastText lid.176. Keep target language p ≥ 0.5–0.65. Run first — cheapest filter, removes ~50% of raw CC.
2. Heuristic rules
Gopher rules: min 50 words, 80% alphabetic, mean word length 3–10, <10% ellipsis lines. Remove HTML artifacts, excessive symbols, cookie consent text not caught by extraction.
3. Quality classifier (optional)
fastText trained on quality positives (Wikipedia / OpenHermes / pages linked from Wikipedia). Threshold or stochastic keep. DCLM: aggressive. FineWeb: skips this stage.
4. PII removal
Regex patterns for emails, phones, SSNs. Optional NER for personal names in sensitive context.
5. Toxicity filter
fastText NSFW + hate classifiers (Dolma Jigsaw models). Remove or flag. Threshold tuning is a policy decision.
6. Exact dedup
Hash paragraphs or 3-sentence spans (MurmurHash). Group, keep one. Bloom filter for streaming membership testing.
7. Near-dedup (MinHash + LSH)
Shingle docs into 5-grams. Compute 200 MinHash signatures. Band into b bands of r rows. Documents sharing a band bucket → candidate pairs → remove the lower-quality copy.
8. Decontamination
Build 13-gram set from all eval benchmarks. Remove training docs that overlap. Report which benchmarks were decontaminated.

Algorithm Cheat Sheet

AlgorithmWhat it doesKey parameterComplexity
KenLM (n-gram LM)Perplexity of doc under target LMn-gram order, perplexity thresholdO(L) per doc
fastText classifierP(quality | doc) binaryClassifier thresholdO(L) per doc
DSIRpT(x)/pR(x) importance weightTarget T distributionO(L) per doc
Bloom filterSet membership (approx)m bits, k hash functionsO(k) per query
MurmurHash exact dedupRemove exact copiesItem granularity (doc/para/span)O(L) hash + O(1) lookup
MinHashEstimate Jaccard similarityn hash functionsO(S × n) where S=shingles
LSH bandingFind near-duplicate candidatesb bands, r rows, threshold s*=(ln2/b)^(1/r)O(N × n) total

What Comes Next

You now have a clean, deduplicated corpus. The next question is: how much of this corpus should you actually train on, and in what proportions? That is the data mix problem covered in Lecture 13. After that comes the architecture question (which this lesson series covers in Lectures 1–12) and then alignment: turning a base model into one that follows instructions and behaves safely (Lecture 15 and beyond).

LessonWhat it coversConnection to this lesson
CS336 L13Data sources, HTML extraction, data mixThe pipeline stage BEFORE filtering
CS336 L15 (upcoming)Alignment, RLHF, safety post-trainingWhat handles toxicity that filtering misses
RAGRetrieval-Augmented GenerationUses similar dedup/similarity search at query time
Vector DatabasesHNSW, IVF for approximate nearest neighborLSH is a simpler ANN algorithm — same goal, different mechanism
Feynman test. Can you explain to someone who has never programmed why a Bloom filter can have false positives but never false negatives? Can you derive the MinHash collision probability from the characteristic matrix? Can you tell someone how to choose b and r in LSH to hit a target similarity threshold? If yes, you have genuinely understood this lecture.
"Now you have the tools (the mechanics), just have to spend time with data (the intuitions)." — Percy Liang, CS336 Lecture 14