Data I: Sources & The Pipeline — Language Modeling from Scratch (CS336 L13)

Chapter 0: The Wasteland Problem

You just got access to 32 H100s. Someone hands you the latest Common Crawl snapshot — 3.5 petabytes of raw HTTP responses. Everything that was publicly accessible on the web in the last month. You're ready to train a world-class language model.

Then you start looking at the actual text. Page one: a navigation menu followed by a cookie consent banner, then three paragraphs of lorem ipsum. Page two: the same article repeated 47 times with slightly different URLs. Page three: a spam page in 14 different languages at once, auto-generated to game search engines. Page four: 2,000 tokens of JavaScript error messages that got accidentally included in the HTML-to-text conversion.

This is the reality of raw web data. Percy Liang put it bluntly: Common Crawl is not a dataset, it is a raw dump of the internet, and the internet is mostly garbage. After all filtering and deduplication, a typical pipeline retains somewhere between 1% and 15% of the raw text depending on quality thresholds. The rest is discarded.

The secret ingredient. All the major LLM developers publish their architecture. Many publish training code. Almost none publish their data pipeline in detail. Percy's observation: "Open-weight models like Llama 3 have full transparency into architecture and even training procedures — but basically no information on data." The reason? Two words: competitive advantage and copyright liability. Data is where the real moat is.

What makes a data pipeline hard? Three interacting challenges:

Acquisition at scale

Web crawls produce PBs of WARC/WET files. You can't inspect them manually. You need automated quality signals.

↓

Text extraction fidelity

HTML → plain text is lossy. Aggressive stripping loses structure. Weak stripping includes boilerplate. The right tool matters.

↓

Composition decisions

What fraction of tokens should be web text vs. books vs. code vs. math? This "data mix" is arguably the most impactful hyperparameter in all of LLM training.

The pipeline has four stages: acquire (get the raw data) → extract (convert to text) → filter (remove low quality) → deduplicate (remove copies) → mix (set domain proportions). This lecture covers acquire + extract. Lecture 14 covers filter + deduplicate.

The data pipeline funnel — bytes surviving each stage

A typical Common Crawl snapshot starts at ~3.5 PB of raw WARC files. Drag the funnel sliders to see how many tokens survive. The final number that reaches the GPU is often 1–5% of the raw input.

Language filter (keep English) 40%

Quality filter (heuristic rules) 30%

Deduplication (remove near-dups) 50%

Why do major LLM developers (including Meta/Llama) disclose architecture and training code but almost never their full data pipeline?

The data pipeline is too simple to be worth documenting. Data pipelines are too complex to describe in a paper. Competitive advantage (data mix decisions differentiate models) and legal liability (copyright exposure from training data sources). Regulators require secrecy about training data.

Chapter 1: Training Stages & The Data They Need

Modern LLM training doesn't happen in a single pass. It unfolds in three stages — and each stage wants fundamentally different data with different quality, format, and quantity tradeoffs.

Pre-training is where the model learns the basic fabric of language: grammar, facts, reasoning patterns, world knowledge. It needs enormous quantity — trillions of tokens — with moderate quality. Raw web text works here because the sheer volume trains general language understanding even if individual documents are mediocre. Llama 3 used 15T tokens; Qwen3 used 36T.

Mid-training enhances specific capabilities after the base model exists. If you want better math reasoning, you feed the model more math. Better code? More code. Long context? Long documents. This stage uses smaller amounts (millions to low billions of tokens) of higher-quality, curated data targeted at specific capability gaps.

Post-training (fine-tuning + RLHF) shapes the model's behavior: instruction following, safety, formatting preferences, tone. It uses the smallest amounts (tens of thousands to a few million examples) of extremely high-quality data, often human-written or synthetically generated.

Big data → small data. The pattern across all modern LLMs is a funnel: start with a flood of lower-quality text, then progressively sharpen to smaller amounts of higher-quality signal. OLMo 2 from AI2 made this explicit with three named stages: Dolma pretraining (3T tokens), Dolmino mid-training (high-quality curated), and Tulu post-training (instruction data). Each stage roughly trims volume by 1000× but multiplies quality requirements.

The terminology matters. A base model is the result of pre-training plus mid-training — it can predict text well but doesn't necessarily follow instructions gracefully. An instruct/chat model has been further shaped by post-training to respond helpfully to natural language instructions. When you use the Claude API or ChatGPT, you're talking to an instruct model built on top of a base model.

Data needs cascade from what capabilities you want:

Capability	Data needed	Stage	Key sources
General language	Diverse web + books	Pre	Common Crawl, Wikipedia, books
Code generation	GitHub, StackOverflow	Pre + Mid	The Stack, StarCoder data
Math reasoning	arXiv, textbooks, math forums	Mid	PeS2o, ProofPile, MATH dataset
Long context (128K+)	Long documents (books, papers)	Mid	PG-19, Proof-Pile, long web docs
Instruction following	Q&A pairs, chat logs	Post	ShareGPT, Alpaca, Flan tasks
Safety / refusal	Red-team data, preference data	Post	Human annotations, RLHF

Misconception: more pre-training data is always better. This was the consensus until 2022. Then Chinchilla showed that for a fixed compute budget, you get better results from a smaller model trained on more data — but only up to the point where the data stays high-quality. Past that point, repeating epochs on lower-quality data hurts performance more than it helps. Data quantity and data quality are in tension: you want both, and you can't buy one with the other.

A team wants their model to excel at protein structure prediction (like AlphaFold). They have the base model trained on standard web + code data. What type of training stage and data would most effectively add this capability?

Pre-training from scratch on all-biology data. Mid-training on curated biology literature: PubMed, bioRxiv, UniProt sequences + annotations, protein structure descriptions. Post-training (RLHF) using human preferences about protein descriptions. No additional training needed — the base model already has this from web data.

Chapter 2: Primary Sources — A Taxonomy

Where does the actual text come from? Every major training corpus draws from a small set of canonical sources. Let's characterize each one: what it contains, how many tokens it yields, the quality level, and the legal risk.

The Web (via Common Crawl)

The single largest source by orders of magnitude. Common Crawl has been running monthly since 2008 — about 100 snapshots by 2025. Each snapshot captures ~2–4 billion web pages, ~3–4 PB of raw data. After all filtering and deduplication, a single snapshot yields roughly 200–400 billion tokens of usable English text. The quality variance is enormous: a StackOverflow answer and a spam blog post look the same to a crawler.

Wikipedia

Started in 2001. By 2024: 62 million articles across 329 language editions, with English, Spanish, German, and French the largest. Wikipedia is extremely clean, factually dense, and encyclopedic — but also narrow in topic range (only "notable" topics get articles) and dry in style. Most models up-weight Wikipedia far beyond its raw byte count because of its high information density.

A small number of contributors write most of Wikipedia. Steven Pruitt holds the English-language record with over 5 million edits. This makes Wikipedia simultaneously very high quality (committed editors who revert vandalism quickly) and potentially biased (the views of a relatively small, homogeneous group of power editors). Wikipedia also exports periodic dumps every few weeks — this introduces a data poisoning vulnerability: malicious edits injected right before a dump, before administrators can revert them.

Books

BooksCorpus (used by BERT): 7,000 self-published books from Smashwords, ~985M words. Taken down because it violated Smashwords's terms of service. Project Gutenberg: ~75K public-domain books since 1971. High quality but old (pre-1928 US law). Books3: 196K books from the shadow library Bibliotik — taken down due to copyright lawsuits; Meta was sued for using LibGen. Long, coherent narrative structure makes books especially valuable for training models that reason across long contexts.

Code (GitHub)

GitHub started in 2008; acquired by Microsoft in 2018. By 2022, at least 28M public repositories. The Stack (2022) cloned 137M repos, found 51 billion files, retained 3.1 TB of permissively licensed (MIT, Apache) unique code. Code is valuable not just for coding tasks: folklore says code data improves logical reasoning, step-by-step following, and structured output quality — even in non-code domains.

Scientific Literature

arXiv: 2.3M preprints since 1991. LaTeX source is available, which is much cleaner than PDF extraction. PubMed Central: 5M papers mandated public by NIH for federally funded work. PeS2o (Semantic Scholar): 40M papers. Dense, high-reasoning text — but extremely domain-specific.

Q&A and Forums

StackExchange: started with StackOverflow in 2008, now 100+ topic sites. Uses reputation points and upvoting. Q&A format closely mirrors instruction tuning — questions resemble user prompts, top-voted answers resemble ideal responses. Reddit (via Pushshift): billions of posts and comments 2005–2023; GPT-2's WebText was defined as "pages linked from highly-upvoted Reddit posts" as a proxy for quality.

Source vs. capability matrix — which sources unlock which capabilities

Each cell shows how much each source contributes to a capability (brighter = more relevant). Hover or click a source row to see its contribution profile.

You're building a model that needs excellent mathematical proof-writing. Which source combination would you prioritize in mid-training?

Large Common Crawl dump — it contains math content from millions of sites. Wikipedia + books — they're the highest quality text overall. arXiv LaTeX source + math StackExchange + ProofPile — dense formal reasoning in clean form. GitHub code — logical structure in code transfers to math reasoning.

Chapter 3: Common Crawl — The Internet's Archive

Common Crawl is a non-profit started in 2007 that does something simple in principle and staggering in scale: it crawls the entire public web on a roughly monthly schedule, stores the raw HTTP responses, and makes them freely available.

How the crawl works

The crawler uses Apache Nutch, starting from a seed list of hundreds of millions of URLs. It downloads each page, extracts all outgoing links, adds them to a priority queue, and repeats. A 2016 crawl took 10–12 days running on 100 machines. Each crawl tries to diversify — both revisiting frequently-updated pages and discovering new domains.

Four crawler policies govern what happens:

Selection policy: which pages to download? (prioritize by PageRank, freshness, diversity)
Politeness policy: respect robots.txt; don't hammer any single server
Re-visit policy: how often to recrawl pages that change frequently
Dedup-avoidance policy: many URLs point to essentially the same content (URL parameters, tracking codes)

Two output formats

Common Crawl releases two formats for each snapshot:

Format	Contents	Size (per snapshot)	Best for
WARC	Raw HTTP response: full HTML + headers	~3–4 PB	Custom HTML→text pipelines
WET	Pre-converted plain text (lossy)	~200–300 TB	Quick experiments

The WET files are convenient but have a critical flaw: the HTML-to-text conversion Common Crawl does is lossy in the wrong direction. It strips too much structure (losing formatting cues like headers) and sometimes too little (keeping navigation menus, footers, cookie banners). The DCLM paper (2024) showed that using WARC + a better HTML extractor (trafilatura or resiliparse) improves downstream task accuracy measurably compared to using WET directly.

Misconception: WET files are ready to use. Many early papers used Common Crawl WET directly. This is a mistake. WET conversion is mediocre: it loses semantic structure and includes boilerplate. The DCLM team showed that their custom HTML extraction from WARC improved MMLU scores by several percentage points — just from better text extraction, with no model changes. Format choice is a hyperparameter.

Scale in numbers

A single April 2025 snapshot contains approximately:

~3.5 billion web pages crawled
~3.5 PB of WARC files
~1.4 trillion raw tokens (before any filtering) — this is the C4 starting point
~200B tokens of English text after language identification
~100–150B tokens after quality filtering
~50–80B unique tokens after deduplication

Common Crawl funnel — one snapshot from raw to training-ready

One Common Crawl snapshot: ~1.4T raw tokens. Drag sliders to see how many survive each filtering stage. Real-world numbers from C4, RefinedWeb, and FineWeb pipelines shown for reference.

Language ID threshold (English confidence) p > 0.65

Quality filter aggressiveness medium

Why do most serious LLM data pipelines use WARC files rather than WET files from Common Crawl?

WARC files are smaller and faster to process. WET files are not freely available. WARC files preserve raw HTML, allowing higher-quality custom HTML→text extraction that outperforms Common Crawl's built-in lossy WET conversion. WET files don't contain English text.

Chapter 4: HTML → Text Extraction

Every web page is HTML. You want plain text. This sounds trivial — just strip the tags. It isn't. HTML contains two kinds of text: content (the article, post, or document you care about) and boilerplate (navigation menus, footers, cookie banners, ad slots, sidebar links). A naive tag-stripper keeps both equally. A good extractor keeps almost only content.

The challenge: there's no explicit label telling you which text is content and which is boilerplate. You have to infer it from HTML structure (semantic tags, DOM depth, text density, link density). A paragraph of article text sits inside a few div layers. A navigation menu sits inside an anchors-heavy, short-text node. The heuristic: high text density + low link density = likely content.

The tools that matter

trafilatura: Python library that uses density-based heuristics to find the main content block. Handles most mainstream news/blog formats well. Used by RefinedWeb (Falcon), FineWeb (HuggingFace).

resiliparse: Rust-based extraction (via Python bindings), very fast. Similar approach to trafilatura but better at some edge cases.

jusText: older algorithm, paragraph-level classification. Used by The Pile (Pile-CC). Nemotron-CC found it returned more tokens than trafilatura by being less aggressive about removing borderline paragraphs.

The extraction–quality tradeoff. More aggressive extraction = fewer tokens but cleaner. Less aggressive = more tokens but noisier. Nemotron-CC (2024) explicitly chose jusText over trafilatura because they needed more tokens (targeting 6.3T) while still maintaining acceptable quality. FineWeb chose trafilatura because they prioritized quality. Both decisions were rational given their different goals.

A concrete example: what extraction looks like

html — raw web page (boilerplate + content mixed)
<nav class="main-menu">
  <a href="/home">Home</a> | <a href="/about">About</a> | ...15 more links...
</nav>
<div id="cookie-banner">
  We use cookies to improve your experience. Accept | Decline
</div>
<article class="post-content">
  <h1>The Cambrian Explosion: 543 Million Years of Novelty</h1>
  <p>In the span of 25 million years — a geological eyeblink —
  nearly all major animal phyla appeared in the fossil record. Trilobites,
  molluscs, chordates, arthropods: all emerged from what came before:
  single-celled Ediacaran organisms barely distinguishable from mats.</p>
  <p>Why? The oxygen hypothesis holds that rising O<sub>2</sub> levels...</p>
</article>
<footer>
  © 2023 ScienceBlog.com | Privacy Policy | Terms | Contact | Sitemap
</footer>

text — after trafilatura extraction (content only)
The Cambrian Explosion: 543 Million Years of Novelty

In the span of 25 million years — a geological eyeblink —
nearly all major animal phyla appeared in the fossil record. Trilobites,
molluscs, chordates, arthropods: all emerged from what came before:
single-celled Ediacaran organisms barely distinguishable from mats.

Why? The oxygen hypothesis holds that rising O2 levels...

python — using trafilatura for extraction
import trafilatura

def extract_text(html_bytes: bytes) -> str | None:
    """
    Extract main content from raw HTML.
    Returns None if no content found (e.g., pure navigation page).
    Input:  raw HTML bytes from WARC file
    Output: clean plain text string, or None
    """
    # include_tables=True preserves table structure as text
    # favor_precision=True: when in doubt, discard (quality-first)
    text = trafilatura.extract(
        html_bytes,
        include_tables=True,
        favor_precision=True,
        include_links=False,     # strip hyperlinks
        include_images=False,    # strip image alt text
    )
    return text  # may be None for navigation-only pages

# Reading a WARC file
from warcio.archiveiterator import ArchiveIterator

with open('CC-MAIN-2024-10-0001.warc.gz', 'rb') as f:
    for record in ArchiveIterator(f):
        if record.rec_type == 'response':
            html = record.content_stream().read()
            text = extract_text(html)
            if text and len(text) > 200:  # skip very short pages
                process(text, url=record.rec_headers['WARC-Target-URI'])

HTML vs extracted text — toggle to see what gets kept and what gets stripped

A sample web page. Toggle between the raw HTML view (with all boilerplate highlighted in red) and the extracted clean text view. The boilerplate-to-content ratio is typical for a blog or news site.

The DCLM paper found that using WARC + trafilatura instead of WET files improved downstream benchmark accuracy. What is the most likely reason?

WARC files contain more data per page. trafilatura uses better compression, giving faster training. trafilatura removes boilerplate (menus, footers, ads) more accurately, resulting in cleaner training signal — less noise about cookie policies and navigation menus, more actual article content. WET files include duplicate pages that need to be deduplicated manually.

Chapter 5: Data Mix & Domain Weights

You have clean text from many sources: 400B tokens of filtered web, 20B tokens of code, 30B tokens of Wikipedia, 10B tokens of books, 5B tokens of arXiv papers, 15B tokens of StackExchange. You want to train on 1T tokens total. Which sources do you oversample? Which do you undersample? What fraction of the training batch should come from each domain?

This is the data mix problem, and it is arguably the single most important hyperparameter decision in pretraining — more impactful than learning rate or optimizer settings in many ablations. A model trained on 50% web + 50% code will be excellent at coding and mediocre at general knowledge. A model trained on 98% web + 2% code will have the reverse profile.

How real datasets set domain weights

The Pile (2021) used 22 high-quality curated domains. The designers made explicit choices about oversampling: Wikipedia was oversampled 3× its natural frequency because of its high information density. GitHub code was oversampled 2×. The result: a model (GPT-J, GPT-NeoX) with unusually strong coding and scientific reasoning for its era.

GPT-3 used roughly: 60% Common Crawl (processed), 22% WebText2 (Reddit-linked pages), 8% books, 3% Wikipedia, 7% miscellaneous. This heavy web weighting is partly why GPT-3 was fluent and broad but not as technically precise as Codex (which was fine-tuned on code).

Model	Total tokens	Web %	Code %	Books %	Wiki %	Other %
GPT-3 (2020)	300B	~60%	—	~8%	~3%	~29%
The Pile (2021)	825GB	~40%	~11%	~18%	~4%	~27%
LLaMA (2023)	1.2T	~67%	~5%	~4%	~4%	~20%
Dolma (2024)	3T	~60%	~8%	~2%	~3%	~27%
Llama 3 (2024)	15T	~50%	~17%	~3%	~2%	~28%

Misconception: domain proportions = data file sizes. The natural frequency of each source — its raw byte count — is NOT the same as its training weight. Wikipedia is 20–30GB of English text: tiny compared to a 3TB web dump. But almost every model significantly oversamples Wikipedia because high-quality text is scarce and valuable. A 3× oversample rate means each Wikipedia document is seen three times per epoch while a web document is seen once.

The "data-constrained" regime: if you repeat data (epochs > 1), performance starts to degrade. The Chinchilla-era result suggests models benefit from fresh tokens even if those tokens are slightly lower quality than repeating your best data. The Hoffmann et al. discount formula (from Lecture 11) quantifies this tradeoff:

D_eff = D_unique · r^−0.7

Where r is the repetition factor: if you see each token r=3 times, the effective unique data is D_uniq × 3^−0.7 ≈ 0.46 × D_uniq — you've tripled the data but only gained 46% effective coverage. Past r≈4, repeating is barely better than random noise.

python — data-mix sampler (weighted random draw across domains)
import random
from dataclasses import dataclass
from typing import Iterator

MIX = {
    "web":    {"weight": 0.60, "tokens_avail": 400e9},  # 400B tokens
    "code":   {"weight": 0.15, "tokens_avail":  30e9},
    "books":  {"weight": 0.10, "tokens_avail":  25e9},
    "wiki":   {"weight": 0.08, "tokens_avail":   4e9},  # oversample 6×
    "arxiv":  {"weight": 0.04, "tokens_avail":   8e9},
    "stackex":{"weight": 0.03, "tokens_avail":   5e9},
}

def mix_sampler(total_tokens: int) -> Iterator[str]:
    """
    Yields domain names according to MIX weights.
    Caller fetches the next doc from that domain's iterator.
    In production: each domain has a shuffled, tokenized shard list.
    """
    domains = list(MIX.keys())
    weights = [MIX[d]["weight"] for d in domains]
    seen = 0
    while seen < total_tokens:
        domain = random.choices(domains, weights=weights, k=1)[0]
        yield domain
        seen += 1

# Repeat factor for wiki: wiki_weight * total / wiki_tokens_avail
# = 0.08 * 1e12 / 4e9 = 20× — Wikipedia repeated 20 times!

You have 4B tokens of Wikipedia and want to train on 1T tokens total with a 8% Wikipedia weight. How many times will each Wikipedia token be seen during training?

1 time (it's 4B out of 1T = 0.4%, so it's undersampled) 2 times 8 times 20 times (0.08 × 1T / 4B = 80B Wikipedia tokens needed / 4B available = 20×)

Chapter 6: Showcase: Data Mix Explorer

This is the payoff simulation. Set domain weights with the sliders and see the composition pie chart update live. The capability bars predict — based on known empirical patterns from real datasets — which capabilities the resulting mix tends to produce.

How to read this. The capability predictions are qualitative approximations from published ablation studies (The Pile, DCLM, Llama 3 technical report). They show relative expected quality, not absolute scores. Real outcomes depend on quality filtering and deduplication too. This is a mental model tool, not a prediction engine.

Data Mix Explorer — compose your training corpus

Adjust domain weights (they auto-normalize to 100%). The right panel shows predicted capability levels based on domain weights. Hover bars for details.

Web 60

Code 15

Books 10

Wiki 8

Math/Sci 4

Q&A 3

You increase the math/science domain weight from 4% to 40%, proportionally reducing all other domains. Which capability is MOST likely to degrade noticeably?

Mathematical reasoning (the model gets confused by too much math) Code generation (it can't generate code anymore) General factual knowledge and conversational fluency (less diverse web text = narrower world knowledge and stiffer prose) Safety (the model becomes less safe)

Chapter 7: Major Datasets — From BERT to Nemotron

The progression of LLM training datasets tells the story of the field's priorities: more data, cleaner data, more diverse data, then smarter-filtered data.

BERT (2019): Books + Wikipedia

BooksCorpus (985M words, 7K self-published books) + English Wikipedia. Sequences are documents (not sentences — this matters for learning long-range dependencies). Simple and clean. No web data at all.

GPT-2 WebText (2019): Reddit-curated web

Outgoing links from Reddit posts with ≥3 karma = proxy for "humans found this worth reading." 8M pages, 40GB. Open replication: OpenWebText. Used Facebook fastText for English filtering, removed near-duplicates. The Reddit curation is a clever quality signal that doesn't require a classifier.

GPT-3 (2020): 570 GB, 400B tokens

Common Crawl (processed) + WebText2 + Books1/Books2 (mysterious internet-sourced books) + Wikipedia. Trained a quality classifier to distinguish {WebText, Wikipedia, Books1, Books2} from random CC text — a landmark move. Fuzzy deduplication against WebText and benchmarks. First dataset at ~hundreds-of-billions-tokens scale.

The Pile (2021): 825 GB, 22 curated domains

Grassroots effort from EleutherAI, coordinated on Discord. Key sources: Pile-CC (custom WARC extraction with jusText), PubMed Central, arXiv (LaTeX!), Books3 (196K books from Bibliotik — later sued), Project Gutenberg, StackExchange, GitHub, Enron emails. Books3 has since been taken down due to copyright lawsuits. This dataset powered GPT-J and GPT-NeoX.

C4 (2019): 806 GB, 156B tokens

Started with one CC snapshot (1.4T raw tokens). Manual heuristics to filter: keep lines ending in punctuation with ≥5 words, remove pages with <3 sentences, remove "bad word" list hits, remove pages containing '{' (eliminates code), remove lorem ipsum / terms of use pages, keep English at p≥0.99 (langdetect). The most aggressive rule-based pipeline of its era. Powered T5.

LLaMA (2023): 1.2T tokens

CommonCrawl + CCNet pipeline + C4 + GitHub + Wikipedia (20 languages, Jun–Aug 2022) + Project Gutenberg + Books3 + arXiv (inline macros expanded) + StackExchange top 28 sites sorted by score. Reproduced by Together's RedPajama v1. Cerebras SlimPajama: 627B token subset after deduplication with MinHashLSH.

RefinedWeb + FineWeb (2023–2024)

RefinedWeb (Falcon, 2023): "web data is all you need" thesis. trafilatura for WARC extraction. Gopher quality rules. Fuzzy dedup (MinHash over 5-grams). Released 600B tokens out of a 5T-token pipeline. FineWeb (HuggingFace, 2024): started as RefinedWeb replication, improved with 95 CC dumps, URL filtering, p(en) > 0.65 language threshold, additional Gopher + C4 rules, PII anonymization (email + IP addresses). Result: 15T tokens.

Dolma (2024): 3T tokens

AI2 open-source dataset for OLMo. Multi-source: CC (Gopher + C4 rules), Reddit (Pushshift 2005–2023), PeS2o (40M Semantic Scholar papers), C4, Gutenberg, Wikipedia/Wikibooks. Explicit toxicity filtering with Jigsaw classifier. Bloom filter deduplication. No model-based quality classifier (explicitly avoids it to reduce systematic biases).

DCLM (2024): 3.8T tokens

DataComp-LM: benchmark for data processing algorithms. Processed CC to DCLM-pool (240T tokens). DCLM-baseline: trained a fastText classifier using positive examples (OpenHermes-2.5 GPT-4-generated data + ELI5 subreddit) vs negative examples (random RefinedWeb). Applied classifier to all of DCLM-pool. Quality classifier outperforms all rule-based methods on downstream benchmarks.

Nemotron-CC (2024): 6.3T tokens

NVIDIA's response to DCLM. Motivation: FineWebEdu and DCLM filter 90% of data, leaving too few tokens for frontier training. jusText for extraction (more tokens than trafilatura). Classifier ensemble: Nemotron-340B-instruct scored documents by educational value, distilled to fast model; combined with DCLM classifier. Synthetic data: for low-quality docs, LM rephrases them; for high-quality, LM generates QA pairs. Result: 6.3T tokens (1.1T high-quality subset).

The scale progression. BERT: ~3B tokens. GPT-3: 300B. The Pile: 300B. LLaMA: 1.2T. Dolma: 3T. FineWeb: 15T. Llama 3: 15T. Qwen3: 36T. Each order-of-magnitude jump required fundamentally new thinking about filtering and pipeline quality — not just more storage. "Moar data" only works when quality scales with it.

C4 explicitly removes pages containing the character '{'. What kind of content does this filter remove, and what is the tradeoff?

It removes pages with math formulas; tradeoff is losing scientific content. It removes code and structured data (JSON, CSS, code samples); tradeoff is that C4 has no code, which is why LLaMA needed to add GitHub separately. It removes pages with quotations; tradeoff is losing literary text. It removes foreign language text; tradeoff is losing multilingual content.

Chapter 8: Legal Landscape — Copyright, Licenses & Fair Use

The moment you download web data, you've copied copyrighted material. Almost everything on the internet is copyrighted — the threshold for copyright protection is extremely low. Your website is copyrighted the moment you write it. You don't need to register (unlike patents). This creates an unavoidable legal exposure for any organization training on web data.

Copyright law basics

In the United States, copyright law derives primarily from the Copyright Act of 1976. Key facts:

Copyright applies to "original works fixed in any tangible medium" — includes text, images, code, music
Copyright is automatic: no registration needed for protection
Registration is required before a creator can sue for infringement ($65 to register)
Duration: life + 70 years (individual), 95 years from publication (corporate). Works before ~1928 are in the public domain.
Copyright protects expression, not ideas: the implementation of quicksort is copyrightable, the concept of quicksort is not

Why Project Gutenberg is safe. Gutenberg only hosts books whose copyright has expired — pre-1928 US publication (and pre-1978 works without renewal). This is why it's ~75K books: a tiny fraction of all books ever written, but with zero legal exposure. Every other book source involves legal risk.

Licenses: the safe path

A license is essentially "a promise not to sue." Creative Commons licenses enable free distribution of copyrighted work with various conditions. Wikipedia, Open Courseware, Khan Academy content is Creative Commons. Many model developers pay for licenses:

Google + Reddit (2024): content licensing deal for AI training
OpenAI + Shutterstock: six-year licensing partnership
OpenAI + StackExchange: partnership for training data access

For code: permissive licenses (MIT, Apache 2.0) allow use without restriction. GPL is "copyleft" — arguably "infects" anything derived from it. Most code data pipelines (The Stack, StarCoder) filter to permissively-licensed repositories only.

Fair use: the contested path

Fair use (Section 107 of the Copyright Act) allows use of copyrighted material without permission in certain circumstances. Four factors courts weigh:

Factor	Favors fair use when...	LLM training relevance
1. Purpose & character	Educational, transformative, non-commercial	Training is transformative; commercial use is a negative factor
2. Nature of work	Factual, published, non-creative	News articles favor FU more than novels
3. Amount used	Small snippet, not the "heart" of work	Whole-document ingestion is a strong negative factor
4. Market effect	Does not substitute for the original market	Most contested factor: LLMs can reduce demand for original works

The active lawsuits. The New York Times sued OpenAI and Microsoft (2023), claiming ChatGPT can reproduce NYT articles near-verbatim (market substitution). Class-action suits from book authors (George R.R. Martin, Jodi Picoult) against OpenAI. These cases will define the legal framework for LLM training data for years. The current legal uncertainty is a major reason why companies don't publish their data sources.

A key insight from copyright law: training an ML model is transformative in a way that mere copying is not. The model doesn't store the training text — it updates weights. It's interested in patterns (the idea), not the verbatim expression. The counter-argument: if the model can reproduce copyrighted text (as demonstrated with NYT articles), the expression was memorized. Both arguments have merit. Courts will decide.

Terms of service: additional restrictions

Even if you have a license or fair use applies, website terms of service may prohibit scraping. Reddit's ToS prohibits commercial use of its content — yet Reddit was scraped for years (Pushshift) and is now selling API access. robots.txt is advisory, not legally binding (in the US), but violating it can affect politeness norms and occasionally contributes to ToS violation claims.

A model trained on web data can reproduce a New York Times article almost verbatim when prompted. This is most relevant to which fair use factor?

Factor 1 (purpose and character) — it shows the training wasn't transformative. Factor 2 (nature of work) — news articles are factual so they have less protection. Factor 3 (amount used) — it shows the model used too much of the original work. Factor 4 (market effect) — a user who gets the full article from ChatGPT has less reason to visit nytimes.com, directly harming the market for the original work.

Chapter 9: Connections & What's Next

You now have the full picture of the data acquisition and extraction layer. Before we connect forward, here's the landscape you've covered:

Concept	Key insight	Where it matters most
Training stages	Pre → Mid → Post: volume drops, quality rises by 1000×	Resource allocation, dataset strategy
Common Crawl	~3.5 PB/snapshot, ~1–15% usable after pipeline	Every large-scale pretraining corpus
WARC vs WET	WARC + custom extractor beats WET by multiple benchmark points	Pipeline choice for CC-based datasets
HTML extraction	trafilatura (quality) vs jusText (quantity) tradeoff	CC pipelines; DCLM, RefinedWeb, FineWeb
Data mix	Domain weights ≠ byte fractions; Wiki oversampled 20×	Every pretraining run
Legal landscape	Most web data is copyrighted; fair use contested for LLMs	Dataset release decisions, company strategy

What Lecture 14 covers (Data II)

Deduplication and quality filtering — the two stages we flagged as "Lecture 14 material." You'll see:

Exact deduplication: MD5/SHA hashes of normalized documents
Fuzzy deduplication: MinHash + LSH for near-duplicate detection (Jaccard similarity over n-grams)
Quality classifiers: fastText classifiers trained on {good data, random CC} to score quality
Heuristic filters: Gopher rules (alpha ratio, line length, etc.), C4 rules, Dolma rules

How data shapes capabilities: a summary

Data pipeline timeline — when each major dataset appeared

The history of LLM training data, 2019–2024. Each entry shows approximate token count and the key innovation.

Cheat sheet

Data pipeline stages

Acquire: web crawl (WARC) or API/dump
Extract: HTML → text (trafilatura/resiliparse)
Filter: language ID, heuristics, classifier (Lec 14)
Deduplicate: MinHash LSH, Bloom filters (Lec 14)
Mix: set domain weights, oversample quality sources

Key numbers to remember

CC snapshot: ~3.5 PB raw, ~1–15% usable
C4: 1.4T → 156B tokens (11% retention)
FineWeb: 95 CC dumps → 15T tokens
Nemotron-CC: 240T pool → 6.3T (2.6%)
Wikipedia: 4–20B tokens, oversampled 3–20×
GitHub (The Stack): 137M repos → 3.1 TB

Links to related Gleams

CS336 Lec 12 — Evaluation: benchmarks, contamination (train on benchmark answers = inflated scores)
CS336 Lec 1 — Tokenization: how raw text becomes tokens for model input
RAG: retrieval-augmented generation — a different approach to incorporating fresh data at inference time

Percy's summary maxim. "Data does not fall from the sky. You have to work to get it. Live service → raw data → processed data (conversion, filtering, deduplication). Data is the key ingredient that differentiates language models. Much of this pipeline is heuristic — many opportunities to improve."

Which statement best captures the key lesson of this lecture?

More data always produces better models, so the pipeline should maximize token count at all costs. Architecture is the primary differentiator between LLMs; data is a secondary concern. Raw web is mostly junk; the data pipeline (acquisition, extraction, filtering, deduplication, mixing) is where model capability is actually determined — not the architecture. Copyright law makes LLM training from web data illegal in all jurisdictions.