Language Modeling from Scratch · CS336 · Lecture 13

Data I: Sources & The Pipeline

You have 32 H100s and access to Common Crawl. Congratulations — you own a wasteland. Raw web data is ~60% junk: scraped ads, boilerplate navigation menus, duplicate pages, garbled text. The model you get is only as good as the data you curate. This lesson traces every major LLM training corpus from BERT to Nemotron: where the text came from, how HTML becomes clean tokens, why the data mix is a secret weapon, and what legal landmines await. Deduplication and quality filtering come in Lecture 14. Today: acquisition + extraction.

Prerequisites: CS336 Lec 1 (language models, tokens, pretraining). Basic probability.
10
Chapters
5
Live Canvases
Real
Dataset Numbers

Chapter 0: The Wasteland Problem

You just got access to 32 H100s. Someone hands you the latest Common Crawl snapshot — 3.5 petabytes of raw HTTP responses. Everything that was publicly accessible on the web in the last month. You're ready to train a world-class language model.

Then you start looking at the actual text. Page one: a navigation menu followed by a cookie consent banner, then three paragraphs of lorem ipsum. Page two: the same article repeated 47 times with slightly different URLs. Page three: a spam page in 14 different languages at once, auto-generated to game search engines. Page four: 2,000 tokens of JavaScript error messages that got accidentally included in the HTML-to-text conversion.

This is the reality of raw web data. Percy Liang put it bluntly: Common Crawl is not a dataset, it is a raw dump of the internet, and the internet is mostly garbage. After all filtering and deduplication, a typical pipeline retains somewhere between 1% and 15% of the raw text depending on quality thresholds. The rest is discarded.

The secret ingredient. All the major LLM developers publish their architecture. Many publish training code. Almost none publish their data pipeline in detail. Percy's observation: "Open-weight models like Llama 3 have full transparency into architecture and even training procedures — but basically no information on data." The reason? Two words: competitive advantage and copyright liability. Data is where the real moat is.

What makes a data pipeline hard? Three interacting challenges:

Acquisition at scale
Web crawls produce PBs of WARC/WET files. You can't inspect them manually. You need automated quality signals.
Text extraction fidelity
HTML → plain text is lossy. Aggressive stripping loses structure. Weak stripping includes boilerplate. The right tool matters.
Composition decisions
What fraction of tokens should be web text vs. books vs. code vs. math? This "data mix" is arguably the most impactful hyperparameter in all of LLM training.

The pipeline has four stages: acquire (get the raw data) → extract (convert to text) → filter (remove low quality) → deduplicate (remove copies) → mix (set domain proportions). This lecture covers acquire + extract. Lecture 14 covers filter + deduplicate.

The data pipeline funnel — bytes surviving each stage

A typical Common Crawl snapshot starts at ~3.5 PB of raw WARC files. Drag the funnel sliders to see how many tokens survive. The final number that reaches the GPU is often 1–5% of the raw input.

Language filter (keep English) 40%
Quality filter (heuristic rules) 30%
Deduplication (remove near-dups) 50%
Why do major LLM developers (including Meta/Llama) disclose architecture and training code but almost never their full data pipeline?

Chapter 1: Training Stages & The Data They Need

Modern LLM training doesn't happen in a single pass. It unfolds in three stages — and each stage wants fundamentally different data with different quality, format, and quantity tradeoffs.

Pre-training is where the model learns the basic fabric of language: grammar, facts, reasoning patterns, world knowledge. It needs enormous quantity — trillions of tokens — with moderate quality. Raw web text works here because the sheer volume trains general language understanding even if individual documents are mediocre. Llama 3 used 15T tokens; Qwen3 used 36T.

Mid-training enhances specific capabilities after the base model exists. If you want better math reasoning, you feed the model more math. Better code? More code. Long context? Long documents. This stage uses smaller amounts (millions to low billions of tokens) of higher-quality, curated data targeted at specific capability gaps.

Post-training (fine-tuning + RLHF) shapes the model's behavior: instruction following, safety, formatting preferences, tone. It uses the smallest amounts (tens of thousands to a few million examples) of extremely high-quality data, often human-written or synthetically generated.

Big data → small data. The pattern across all modern LLMs is a funnel: start with a flood of lower-quality text, then progressively sharpen to smaller amounts of higher-quality signal. OLMo 2 from AI2 made this explicit with three named stages: Dolma pretraining (3T tokens), Dolmino mid-training (high-quality curated), and Tulu post-training (instruction data). Each stage roughly trims volume by 1000× but multiplies quality requirements.

The terminology matters. A base model is the result of pre-training plus mid-training — it can predict text well but doesn't necessarily follow instructions gracefully. An instruct/chat model has been further shaped by post-training to respond helpfully to natural language instructions. When you use the Claude API or ChatGPT, you're talking to an instruct model built on top of a base model.

Data needs cascade from what capabilities you want:

CapabilityData neededStageKey sources
General languageDiverse web + booksPreCommon Crawl, Wikipedia, books
Code generationGitHub, StackOverflowPre + MidThe Stack, StarCoder data
Math reasoningarXiv, textbooks, math forumsMidPeS2o, ProofPile, MATH dataset
Long context (128K+)Long documents (books, papers)MidPG-19, Proof-Pile, long web docs
Instruction followingQ&A pairs, chat logsPostShareGPT, Alpaca, Flan tasks
Safety / refusalRed-team data, preference dataPostHuman annotations, RLHF
Misconception: more pre-training data is always better. This was the consensus until 2022. Then Chinchilla showed that for a fixed compute budget, you get better results from a smaller model trained on more data — but only up to the point where the data stays high-quality. Past that point, repeating epochs on lower-quality data hurts performance more than it helps. Data quantity and data quality are in tension: you want both, and you can't buy one with the other.
A team wants their model to excel at protein structure prediction (like AlphaFold). They have the base model trained on standard web + code data. What type of training stage and data would most effectively add this capability?

Chapter 2: Primary Sources — A Taxonomy

Where does the actual text come from? Every major training corpus draws from a small set of canonical sources. Let's characterize each one: what it contains, how many tokens it yields, the quality level, and the legal risk.

The Web (via Common Crawl)

The single largest source by orders of magnitude. Common Crawl has been running monthly since 2008 — about 100 snapshots by 2025. Each snapshot captures ~2–4 billion web pages, ~3–4 PB of raw data. After all filtering and deduplication, a single snapshot yields roughly 200–400 billion tokens of usable English text. The quality variance is enormous: a StackOverflow answer and a spam blog post look the same to a crawler.

Wikipedia

Started in 2001. By 2024: 62 million articles across 329 language editions, with English, Spanish, German, and French the largest. Wikipedia is extremely clean, factually dense, and encyclopedic — but also narrow in topic range (only "notable" topics get articles) and dry in style. Most models up-weight Wikipedia far beyond its raw byte count because of its high information density.

A small number of contributors write most of Wikipedia. Steven Pruitt holds the English-language record with over 5 million edits. This makes Wikipedia simultaneously very high quality (committed editors who revert vandalism quickly) and potentially biased (the views of a relatively small, homogeneous group of power editors). Wikipedia also exports periodic dumps every few weeks — this introduces a data poisoning vulnerability: malicious edits injected right before a dump, before administrators can revert them.

Books

BooksCorpus (used by BERT): 7,000 self-published books from Smashwords, ~985M words. Taken down because it violated Smashwords's terms of service. Project Gutenberg: ~75K public-domain books since 1971. High quality but old (pre-1928 US law). Books3: 196K books from the shadow library Bibliotik — taken down due to copyright lawsuits; Meta was sued for using LibGen. Long, coherent narrative structure makes books especially valuable for training models that reason across long contexts.

Code (GitHub)

GitHub started in 2008; acquired by Microsoft in 2018. By 2022, at least 28M public repositories. The Stack (2022) cloned 137M repos, found 51 billion files, retained 3.1 TB of permissively licensed (MIT, Apache) unique code. Code is valuable not just for coding tasks: folklore says code data improves logical reasoning, step-by-step following, and structured output quality — even in non-code domains.

Scientific Literature

arXiv: 2.3M preprints since 1991. LaTeX source is available, which is much cleaner than PDF extraction. PubMed Central: 5M papers mandated public by NIH for federally funded work. PeS2o (Semantic Scholar): 40M papers. Dense, high-reasoning text — but extremely domain-specific.

Q&A and Forums

StackExchange: started with StackOverflow in 2008, now 100+ topic sites. Uses reputation points and upvoting. Q&A format closely mirrors instruction tuning — questions resemble user prompts, top-voted answers resemble ideal responses. Reddit (via Pushshift): billions of posts and comments 2005–2023; GPT-2's WebText was defined as "pages linked from highly-upvoted Reddit posts" as a proxy for quality.

Source vs. capability matrix — which sources unlock which capabilities

Each cell shows how much each source contributes to a capability (brighter = more relevant). Hover or click a source row to see its contribution profile.

You're building a model that needs excellent mathematical proof-writing. Which source combination would you prioritize in mid-training?

Chapter 3: Common Crawl — The Internet's Archive

Common Crawl is a non-profit started in 2007 that does something simple in principle and staggering in scale: it crawls the entire public web on a roughly monthly schedule, stores the raw HTTP responses, and makes them freely available.

How the crawl works

The crawler uses Apache Nutch, starting from a seed list of hundreds of millions of URLs. It downloads each page, extracts all outgoing links, adds them to a priority queue, and repeats. A 2016 crawl took 10–12 days running on 100 machines. Each crawl tries to diversify — both revisiting frequently-updated pages and discovering new domains.

Four crawler policies govern what happens:

Two output formats

Common Crawl releases two formats for each snapshot:

FormatContentsSize (per snapshot)Best for
WARCRaw HTTP response: full HTML + headers~3–4 PBCustom HTML→text pipelines
WETPre-converted plain text (lossy)~200–300 TBQuick experiments

The WET files are convenient but have a critical flaw: the HTML-to-text conversion Common Crawl does is lossy in the wrong direction. It strips too much structure (losing formatting cues like headers) and sometimes too little (keeping navigation menus, footers, cookie banners). The DCLM paper (2024) showed that using WARC + a better HTML extractor (trafilatura or resiliparse) improves downstream task accuracy measurably compared to using WET directly.

Misconception: WET files are ready to use. Many early papers used Common Crawl WET directly. This is a mistake. WET conversion is mediocre: it loses semantic structure and includes boilerplate. The DCLM team showed that their custom HTML extraction from WARC improved MMLU scores by several percentage points — just from better text extraction, with no model changes. Format choice is a hyperparameter.

Scale in numbers

A single April 2025 snapshot contains approximately:

Common Crawl funnel — one snapshot from raw to training-ready

One Common Crawl snapshot: ~1.4T raw tokens. Drag sliders to see how many survive each filtering stage. Real-world numbers from C4, RefinedWeb, and FineWeb pipelines shown for reference.

Language ID threshold (English confidence) p > 0.65
Quality filter aggressiveness medium
Why do most serious LLM data pipelines use WARC files rather than WET files from Common Crawl?

Chapter 4: HTML → Text Extraction

Every web page is HTML. You want plain text. This sounds trivial — just strip the tags. It isn't. HTML contains two kinds of text: content (the article, post, or document you care about) and boilerplate (navigation menus, footers, cookie banners, ad slots, sidebar links). A naive tag-stripper keeps both equally. A good extractor keeps almost only content.

The challenge: there's no explicit label telling you which text is content and which is boilerplate. You have to infer it from HTML structure (semantic tags, DOM depth, text density, link density). A paragraph of article text sits inside a few div layers. A navigation menu sits inside an anchors-heavy, short-text node. The heuristic: high text density + low link density = likely content.

The tools that matter

trafilatura: Python library that uses density-based heuristics to find the main content block. Handles most mainstream news/blog formats well. Used by RefinedWeb (Falcon), FineWeb (HuggingFace).

resiliparse: Rust-based extraction (via Python bindings), very fast. Similar approach to trafilatura but better at some edge cases.

jusText: older algorithm, paragraph-level classification. Used by The Pile (Pile-CC). Nemotron-CC found it returned more tokens than trafilatura by being less aggressive about removing borderline paragraphs.

The extraction–quality tradeoff. More aggressive extraction = fewer tokens but cleaner. Less aggressive = more tokens but noisier. Nemotron-CC (2024) explicitly chose jusText over trafilatura because they needed more tokens (targeting 6.3T) while still maintaining acceptable quality. FineWeb chose trafilatura because they prioritized quality. Both decisions were rational given their different goals.

A concrete example: what extraction looks like

html — raw web page (boilerplate + content mixed)
<nav class="main-menu">
  <a href="/home">Home</a> | <a href="/about">About</a> | ...15 more links...
</nav>
<div id="cookie-banner">
  We use cookies to improve your experience. Accept | Decline
</div>
<article class="post-content">
  <h1>The Cambrian Explosion: 543 Million Years of Novelty</h1>
  <p>In the span of 25 million years — a geological eyeblink —
  nearly all major animal phyla appeared in the fossil record. Trilobites,
  molluscs, chordates, arthropods: all emerged from what came before:
  single-celled Ediacaran organisms barely distinguishable from mats.</p>
  <p>Why? The oxygen hypothesis holds that rising O<sub>2</sub> levels...</p>
</article>
<footer>
  © 2023 ScienceBlog.com | Privacy Policy | Terms | Contact | Sitemap
</footer>
text — after trafilatura extraction (content only)
The Cambrian Explosion: 543 Million Years of Novelty

In the span of 25 million years — a geological eyeblink —
nearly all major animal phyla appeared in the fossil record. Trilobites,
molluscs, chordates, arthropods: all emerged from what came before:
single-celled Ediacaran organisms barely distinguishable from mats.

Why? The oxygen hypothesis holds that rising O2 levels...
python — using trafilatura for extraction
import trafilatura

def extract_text(html_bytes: bytes) -> str | None:
    """
    Extract main content from raw HTML.
    Returns None if no content found (e.g., pure navigation page).
    Input:  raw HTML bytes from WARC file
    Output: clean plain text string, or None
    """
    # include_tables=True preserves table structure as text
    # favor_precision=True: when in doubt, discard (quality-first)
    text = trafilatura.extract(
        html_bytes,
        include_tables=True,
        favor_precision=True,
        include_links=False,     # strip hyperlinks
        include_images=False,    # strip image alt text
    )
    return text  # may be None for navigation-only pages

# Reading a WARC file
from warcio.archiveiterator import ArchiveIterator

with open('CC-MAIN-2024-10-0001.warc.gz', 'rb') as f:
    for record in ArchiveIterator(f):
        if record.rec_type == 'response':
            html = record.content_stream().read()
            text = extract_text(html)
            if text and len(text) > 200:  # skip very short pages
                process(text, url=record.rec_headers['WARC-Target-URI'])
HTML vs extracted text — toggle to see what gets kept and what gets stripped

A sample web page. Toggle between the raw HTML view (with all boilerplate highlighted in red) and the extracted clean text view. The boilerplate-to-content ratio is typical for a blog or news site.

The DCLM paper found that using WARC + trafilatura instead of WET files improved downstream benchmark accuracy. What is the most likely reason?

Chapter 5: Data Mix & Domain Weights

You have clean text from many sources: 400B tokens of filtered web, 20B tokens of code, 30B tokens of Wikipedia, 10B tokens of books, 5B tokens of arXiv papers, 15B tokens of StackExchange. You want to train on 1T tokens total. Which sources do you oversample? Which do you undersample? What fraction of the training batch should come from each domain?

This is the data mix problem, and it is arguably the single most important hyperparameter decision in pretraining — more impactful than learning rate or optimizer settings in many ablations. A model trained on 50% web + 50% code will be excellent at coding and mediocre at general knowledge. A model trained on 98% web + 2% code will have the reverse profile.

How real datasets set domain weights

The Pile (2021) used 22 high-quality curated domains. The designers made explicit choices about oversampling: Wikipedia was oversampled 3× its natural frequency because of its high information density. GitHub code was oversampled 2×. The result: a model (GPT-J, GPT-NeoX) with unusually strong coding and scientific reasoning for its era.

GPT-3 used roughly: 60% Common Crawl (processed), 22% WebText2 (Reddit-linked pages), 8% books, 3% Wikipedia, 7% miscellaneous. This heavy web weighting is partly why GPT-3 was fluent and broad but not as technically precise as Codex (which was fine-tuned on code).

ModelTotal tokensWeb %Code %Books %Wiki %Other %
GPT-3 (2020)300B~60%~8%~3%~29%
The Pile (2021)825GB~40%~11%~18%~4%~27%
LLaMA (2023)1.2T~67%~5%~4%~4%~20%
Dolma (2024)3T~60%~8%~2%~3%~27%
Llama 3 (2024)15T~50%~17%~3%~2%~28%
Misconception: domain proportions = data file sizes. The natural frequency of each source — its raw byte count — is NOT the same as its training weight. Wikipedia is 20–30GB of English text: tiny compared to a 3TB web dump. But almost every model significantly oversamples Wikipedia because high-quality text is scarce and valuable. A 3× oversample rate means each Wikipedia document is seen three times per epoch while a web document is seen once.

The "data-constrained" regime: if you repeat data (epochs > 1), performance starts to degrade. The Chinchilla-era result suggests models benefit from fresh tokens even if those tokens are slightly lower quality than repeating your best data. The Hoffmann et al. discount formula (from Lecture 11) quantifies this tradeoff:

Deff = Dunique · r−0.7

Where r is the repetition factor: if you see each token r=3 times, the effective unique data is Duniq × 3−0.7 ≈ 0.46 × Duniq — you've tripled the data but only gained 46% effective coverage. Past r≈4, repeating is barely better than random noise.

python — data-mix sampler (weighted random draw across domains)
import random
from dataclasses import dataclass
from typing import Iterator

MIX = {
    "web":    {"weight": 0.60, "tokens_avail": 400e9},  # 400B tokens
    "code":   {"weight": 0.15, "tokens_avail":  30e9},
    "books":  {"weight": 0.10, "tokens_avail":  25e9},
    "wiki":   {"weight": 0.08, "tokens_avail":   4e9},  # oversample 6×
    "arxiv":  {"weight": 0.04, "tokens_avail":   8e9},
    "stackex":{"weight": 0.03, "tokens_avail":   5e9},
}

def mix_sampler(total_tokens: int) -> Iterator[str]:
    """
    Yields domain names according to MIX weights.
    Caller fetches the next doc from that domain's iterator.
    In production: each domain has a shuffled, tokenized shard list.
    """
    domains = list(MIX.keys())
    weights = [MIX[d]["weight"] for d in domains]
    seen = 0
    while seen < total_tokens:
        domain = random.choices(domains, weights=weights, k=1)[0]
        yield domain
        seen += 1

# Repeat factor for wiki: wiki_weight * total / wiki_tokens_avail
# = 0.08 * 1e12 / 4e9 = 20× — Wikipedia repeated 20 times!
You have 4B tokens of Wikipedia and want to train on 1T tokens total with a 8% Wikipedia weight. How many times will each Wikipedia token be seen during training?

Chapter 6: Showcase: Data Mix Explorer

This is the payoff simulation. Set domain weights with the sliders and see the composition pie chart update live. The capability bars predict — based on known empirical patterns from real datasets — which capabilities the resulting mix tends to produce.

How to read this. The capability predictions are qualitative approximations from published ablation studies (The Pile, DCLM, Llama 3 technical report). They show relative expected quality, not absolute scores. Real outcomes depend on quality filtering and deduplication too. This is a mental model tool, not a prediction engine.
Data Mix Explorer — compose your training corpus

Adjust domain weights (they auto-normalize to 100%). The right panel shows predicted capability levels based on domain weights. Hover bars for details.

Web 60
Code 15
Books 10
Wiki 8
Math/Sci 4
Q&A 3
You increase the math/science domain weight from 4% to 40%, proportionally reducing all other domains. Which capability is MOST likely to degrade noticeably?

Chapter 7: Major Datasets — From BERT to Nemotron

The progression of LLM training datasets tells the story of the field's priorities: more data, cleaner data, more diverse data, then smarter-filtered data.

BERT (2019): Books + Wikipedia

BooksCorpus (985M words, 7K self-published books) + English Wikipedia. Sequences are documents (not sentences — this matters for learning long-range dependencies). Simple and clean. No web data at all.

GPT-2 WebText (2019): Reddit-curated web

Outgoing links from Reddit posts with ≥3 karma = proxy for "humans found this worth reading." 8M pages, 40GB. Open replication: OpenWebText. Used Facebook fastText for English filtering, removed near-duplicates. The Reddit curation is a clever quality signal that doesn't require a classifier.

GPT-3 (2020): 570 GB, 400B tokens

Common Crawl (processed) + WebText2 + Books1/Books2 (mysterious internet-sourced books) + Wikipedia. Trained a quality classifier to distinguish {WebText, Wikipedia, Books1, Books2} from random CC text — a landmark move. Fuzzy deduplication against WebText and benchmarks. First dataset at ~hundreds-of-billions-tokens scale.

The Pile (2021): 825 GB, 22 curated domains

Grassroots effort from EleutherAI, coordinated on Discord. Key sources: Pile-CC (custom WARC extraction with jusText), PubMed Central, arXiv (LaTeX!), Books3 (196K books from Bibliotik — later sued), Project Gutenberg, StackExchange, GitHub, Enron emails. Books3 has since been taken down due to copyright lawsuits. This dataset powered GPT-J and GPT-NeoX.

C4 (2019): 806 GB, 156B tokens

Started with one CC snapshot (1.4T raw tokens). Manual heuristics to filter: keep lines ending in punctuation with ≥5 words, remove pages with <3 sentences, remove "bad word" list hits, remove pages containing '{' (eliminates code), remove lorem ipsum / terms of use pages, keep English at p≥0.99 (langdetect). The most aggressive rule-based pipeline of its era. Powered T5.

LLaMA (2023): 1.2T tokens

CommonCrawl + CCNet pipeline + C4 + GitHub + Wikipedia (20 languages, Jun–Aug 2022) + Project Gutenberg + Books3 + arXiv (inline macros expanded) + StackExchange top 28 sites sorted by score. Reproduced by Together's RedPajama v1. Cerebras SlimPajama: 627B token subset after deduplication with MinHashLSH.

RefinedWeb + FineWeb (2023–2024)

RefinedWeb (Falcon, 2023): "web data is all you need" thesis. trafilatura for WARC extraction. Gopher quality rules. Fuzzy dedup (MinHash over 5-grams). Released 600B tokens out of a 5T-token pipeline. FineWeb (HuggingFace, 2024): started as RefinedWeb replication, improved with 95 CC dumps, URL filtering, p(en) > 0.65 language threshold, additional Gopher + C4 rules, PII anonymization (email + IP addresses). Result: 15T tokens.

Dolma (2024): 3T tokens

AI2 open-source dataset for OLMo. Multi-source: CC (Gopher + C4 rules), Reddit (Pushshift 2005–2023), PeS2o (40M Semantic Scholar papers), C4, Gutenberg, Wikipedia/Wikibooks. Explicit toxicity filtering with Jigsaw classifier. Bloom filter deduplication. No model-based quality classifier (explicitly avoids it to reduce systematic biases).

DCLM (2024): 3.8T tokens

DataComp-LM: benchmark for data processing algorithms. Processed CC to DCLM-pool (240T tokens). DCLM-baseline: trained a fastText classifier using positive examples (OpenHermes-2.5 GPT-4-generated data + ELI5 subreddit) vs negative examples (random RefinedWeb). Applied classifier to all of DCLM-pool. Quality classifier outperforms all rule-based methods on downstream benchmarks.

Nemotron-CC (2024): 6.3T tokens

NVIDIA's response to DCLM. Motivation: FineWebEdu and DCLM filter 90% of data, leaving too few tokens for frontier training. jusText for extraction (more tokens than trafilatura). Classifier ensemble: Nemotron-340B-instruct scored documents by educational value, distilled to fast model; combined with DCLM classifier. Synthetic data: for low-quality docs, LM rephrases them; for high-quality, LM generates QA pairs. Result: 6.3T tokens (1.1T high-quality subset).

The scale progression. BERT: ~3B tokens. GPT-3: 300B. The Pile: 300B. LLaMA: 1.2T. Dolma: 3T. FineWeb: 15T. Llama 3: 15T. Qwen3: 36T. Each order-of-magnitude jump required fundamentally new thinking about filtering and pipeline quality — not just more storage. "Moar data" only works when quality scales with it.
C4 explicitly removes pages containing the character '{'. What kind of content does this filter remove, and what is the tradeoff?

Chapter 8: Legal Landscape — Copyright, Licenses & Fair Use

The moment you download web data, you've copied copyrighted material. Almost everything on the internet is copyrighted — the threshold for copyright protection is extremely low. Your website is copyrighted the moment you write it. You don't need to register (unlike patents). This creates an unavoidable legal exposure for any organization training on web data.

Copyright law basics

In the United States, copyright law derives primarily from the Copyright Act of 1976. Key facts:

Why Project Gutenberg is safe. Gutenberg only hosts books whose copyright has expired — pre-1928 US publication (and pre-1978 works without renewal). This is why it's ~75K books: a tiny fraction of all books ever written, but with zero legal exposure. Every other book source involves legal risk.

Licenses: the safe path

A license is essentially "a promise not to sue." Creative Commons licenses enable free distribution of copyrighted work with various conditions. Wikipedia, Open Courseware, Khan Academy content is Creative Commons. Many model developers pay for licenses:

For code: permissive licenses (MIT, Apache 2.0) allow use without restriction. GPL is "copyleft" — arguably "infects" anything derived from it. Most code data pipelines (The Stack, StarCoder) filter to permissively-licensed repositories only.

Fair use: the contested path

Fair use (Section 107 of the Copyright Act) allows use of copyrighted material without permission in certain circumstances. Four factors courts weigh:

FactorFavors fair use when...LLM training relevance
1. Purpose & characterEducational, transformative, non-commercialTraining is transformative; commercial use is a negative factor
2. Nature of workFactual, published, non-creativeNews articles favor FU more than novels
3. Amount usedSmall snippet, not the "heart" of workWhole-document ingestion is a strong negative factor
4. Market effectDoes not substitute for the original marketMost contested factor: LLMs can reduce demand for original works
The active lawsuits. The New York Times sued OpenAI and Microsoft (2023), claiming ChatGPT can reproduce NYT articles near-verbatim (market substitution). Class-action suits from book authors (George R.R. Martin, Jodi Picoult) against OpenAI. These cases will define the legal framework for LLM training data for years. The current legal uncertainty is a major reason why companies don't publish their data sources.

A key insight from copyright law: training an ML model is transformative in a way that mere copying is not. The model doesn't store the training text — it updates weights. It's interested in patterns (the idea), not the verbatim expression. The counter-argument: if the model can reproduce copyrighted text (as demonstrated with NYT articles), the expression was memorized. Both arguments have merit. Courts will decide.

Terms of service: additional restrictions

Even if you have a license or fair use applies, website terms of service may prohibit scraping. Reddit's ToS prohibits commercial use of its content — yet Reddit was scraped for years (Pushshift) and is now selling API access. robots.txt is advisory, not legally binding (in the US), but violating it can affect politeness norms and occasionally contributes to ToS violation claims.

A model trained on web data can reproduce a New York Times article almost verbatim when prompted. This is most relevant to which fair use factor?

Chapter 9: Connections & What's Next

You now have the full picture of the data acquisition and extraction layer. Before we connect forward, here's the landscape you've covered:

ConceptKey insightWhere it matters most
Training stagesPre → Mid → Post: volume drops, quality rises by 1000×Resource allocation, dataset strategy
Common Crawl~3.5 PB/snapshot, ~1–15% usable after pipelineEvery large-scale pretraining corpus
WARC vs WETWARC + custom extractor beats WET by multiple benchmark pointsPipeline choice for CC-based datasets
HTML extractiontrafilatura (quality) vs jusText (quantity) tradeoffCC pipelines; DCLM, RefinedWeb, FineWeb
Data mixDomain weights ≠ byte fractions; Wiki oversampled 20×Every pretraining run
Legal landscapeMost web data is copyrighted; fair use contested for LLMsDataset release decisions, company strategy

What Lecture 14 covers (Data II)

Deduplication and quality filtering — the two stages we flagged as "Lecture 14 material." You'll see:

How data shapes capabilities: a summary

Data pipeline timeline — when each major dataset appeared

The history of LLM training data, 2019–2024. Each entry shows approximate token count and the key innovation.

Cheat sheet

Data pipeline stages

  1. Acquire: web crawl (WARC) or API/dump
  2. Extract: HTML → text (trafilatura/resiliparse)
  3. Filter: language ID, heuristics, classifier (Lec 14)
  4. Deduplicate: MinHash LSH, Bloom filters (Lec 14)
  5. Mix: set domain weights, oversample quality sources

Key numbers to remember

  • CC snapshot: ~3.5 PB raw, ~1–15% usable
  • C4: 1.4T → 156B tokens (11% retention)
  • FineWeb: 95 CC dumps → 15T tokens
  • Nemotron-CC: 240T pool → 6.3T (2.6%)
  • Wikipedia: 4–20B tokens, oversampled 3–20×
  • GitHub (The Stack): 137M repos → 3.1 TB

Links to related Gleams

Percy's summary maxim. "Data does not fall from the sky. You have to work to get it. Live service → raw data → processed data (conversion, filtering, deduplication). Data is the key ingredient that differentiates language models. Much of this pipeline is heuristic — many opportunities to improve."
Which statement best captures the key lesson of this lecture?