You have 32 H100s and access to Common Crawl. Congratulations — you own a wasteland. Raw web data is ~60% junk: scraped ads, boilerplate navigation menus, duplicate pages, garbled text. The model you get is only as good as the data you curate. This lesson traces every major LLM training corpus from BERT to Nemotron: where the text came from, how HTML becomes clean tokens, why the data mix is a secret weapon, and what legal landmines await. Deduplication and quality filtering come in Lecture 14. Today: acquisition + extraction.
You just got access to 32 H100s. Someone hands you the latest Common Crawl snapshot — 3.5 petabytes of raw HTTP responses. Everything that was publicly accessible on the web in the last month. You're ready to train a world-class language model.
Then you start looking at the actual text. Page one: a navigation menu followed by a cookie consent banner, then three paragraphs of lorem ipsum. Page two: the same article repeated 47 times with slightly different URLs. Page three: a spam page in 14 different languages at once, auto-generated to game search engines. Page four: 2,000 tokens of JavaScript error messages that got accidentally included in the HTML-to-text conversion.
This is the reality of raw web data. Percy Liang put it bluntly: Common Crawl is not a dataset, it is a raw dump of the internet, and the internet is mostly garbage. After all filtering and deduplication, a typical pipeline retains somewhere between 1% and 15% of the raw text depending on quality thresholds. The rest is discarded.
What makes a data pipeline hard? Three interacting challenges:
The pipeline has four stages: acquire (get the raw data) → extract (convert to text) → filter (remove low quality) → deduplicate (remove copies) → mix (set domain proportions). This lecture covers acquire + extract. Lecture 14 covers filter + deduplicate.
A typical Common Crawl snapshot starts at ~3.5 PB of raw WARC files. Drag the funnel sliders to see how many tokens survive. The final number that reaches the GPU is often 1–5% of the raw input.
Modern LLM training doesn't happen in a single pass. It unfolds in three stages — and each stage wants fundamentally different data with different quality, format, and quantity tradeoffs.
Pre-training is where the model learns the basic fabric of language: grammar, facts, reasoning patterns, world knowledge. It needs enormous quantity — trillions of tokens — with moderate quality. Raw web text works here because the sheer volume trains general language understanding even if individual documents are mediocre. Llama 3 used 15T tokens; Qwen3 used 36T.
Mid-training enhances specific capabilities after the base model exists. If you want better math reasoning, you feed the model more math. Better code? More code. Long context? Long documents. This stage uses smaller amounts (millions to low billions of tokens) of higher-quality, curated data targeted at specific capability gaps.
Post-training (fine-tuning + RLHF) shapes the model's behavior: instruction following, safety, formatting preferences, tone. It uses the smallest amounts (tens of thousands to a few million examples) of extremely high-quality data, often human-written or synthetically generated.
The terminology matters. A base model is the result of pre-training plus mid-training — it can predict text well but doesn't necessarily follow instructions gracefully. An instruct/chat model has been further shaped by post-training to respond helpfully to natural language instructions. When you use the Claude API or ChatGPT, you're talking to an instruct model built on top of a base model.
Data needs cascade from what capabilities you want:
| Capability | Data needed | Stage | Key sources |
|---|---|---|---|
| General language | Diverse web + books | Pre | Common Crawl, Wikipedia, books |
| Code generation | GitHub, StackOverflow | Pre + Mid | The Stack, StarCoder data |
| Math reasoning | arXiv, textbooks, math forums | Mid | PeS2o, ProofPile, MATH dataset |
| Long context (128K+) | Long documents (books, papers) | Mid | PG-19, Proof-Pile, long web docs |
| Instruction following | Q&A pairs, chat logs | Post | ShareGPT, Alpaca, Flan tasks |
| Safety / refusal | Red-team data, preference data | Post | Human annotations, RLHF |
Where does the actual text come from? Every major training corpus draws from a small set of canonical sources. Let's characterize each one: what it contains, how many tokens it yields, the quality level, and the legal risk.
The single largest source by orders of magnitude. Common Crawl has been running monthly since 2008 — about 100 snapshots by 2025. Each snapshot captures ~2–4 billion web pages, ~3–4 PB of raw data. After all filtering and deduplication, a single snapshot yields roughly 200–400 billion tokens of usable English text. The quality variance is enormous: a StackOverflow answer and a spam blog post look the same to a crawler.
Started in 2001. By 2024: 62 million articles across 329 language editions, with English, Spanish, German, and French the largest. Wikipedia is extremely clean, factually dense, and encyclopedic — but also narrow in topic range (only "notable" topics get articles) and dry in style. Most models up-weight Wikipedia far beyond its raw byte count because of its high information density.
BooksCorpus (used by BERT): 7,000 self-published books from Smashwords, ~985M words. Taken down because it violated Smashwords's terms of service. Project Gutenberg: ~75K public-domain books since 1971. High quality but old (pre-1928 US law). Books3: 196K books from the shadow library Bibliotik — taken down due to copyright lawsuits; Meta was sued for using LibGen. Long, coherent narrative structure makes books especially valuable for training models that reason across long contexts.
GitHub started in 2008; acquired by Microsoft in 2018. By 2022, at least 28M public repositories. The Stack (2022) cloned 137M repos, found 51 billion files, retained 3.1 TB of permissively licensed (MIT, Apache) unique code. Code is valuable not just for coding tasks: folklore says code data improves logical reasoning, step-by-step following, and structured output quality — even in non-code domains.
arXiv: 2.3M preprints since 1991. LaTeX source is available, which is much cleaner than PDF extraction. PubMed Central: 5M papers mandated public by NIH for federally funded work. PeS2o (Semantic Scholar): 40M papers. Dense, high-reasoning text — but extremely domain-specific.
StackExchange: started with StackOverflow in 2008, now 100+ topic sites. Uses reputation points and upvoting. Q&A format closely mirrors instruction tuning — questions resemble user prompts, top-voted answers resemble ideal responses. Reddit (via Pushshift): billions of posts and comments 2005–2023; GPT-2's WebText was defined as "pages linked from highly-upvoted Reddit posts" as a proxy for quality.
Each cell shows how much each source contributes to a capability (brighter = more relevant). Hover or click a source row to see its contribution profile.
Common Crawl is a non-profit started in 2007 that does something simple in principle and staggering in scale: it crawls the entire public web on a roughly monthly schedule, stores the raw HTTP responses, and makes them freely available.
The crawler uses Apache Nutch, starting from a seed list of hundreds of millions of URLs. It downloads each page, extracts all outgoing links, adds them to a priority queue, and repeats. A 2016 crawl took 10–12 days running on 100 machines. Each crawl tries to diversify — both revisiting frequently-updated pages and discovering new domains.
Four crawler policies govern what happens:
Common Crawl releases two formats for each snapshot:
| Format | Contents | Size (per snapshot) | Best for |
|---|---|---|---|
| WARC | Raw HTTP response: full HTML + headers | ~3–4 PB | Custom HTML→text pipelines |
| WET | Pre-converted plain text (lossy) | ~200–300 TB | Quick experiments |
The WET files are convenient but have a critical flaw: the HTML-to-text conversion Common Crawl does is lossy in the wrong direction. It strips too much structure (losing formatting cues like headers) and sometimes too little (keeping navigation menus, footers, cookie banners). The DCLM paper (2024) showed that using WARC + a better HTML extractor (trafilatura or resiliparse) improves downstream task accuracy measurably compared to using WET directly.
A single April 2025 snapshot contains approximately:
One Common Crawl snapshot: ~1.4T raw tokens. Drag sliders to see how many survive each filtering stage. Real-world numbers from C4, RefinedWeb, and FineWeb pipelines shown for reference.
Every web page is HTML. You want plain text. This sounds trivial — just strip the tags. It isn't. HTML contains two kinds of text: content (the article, post, or document you care about) and boilerplate (navigation menus, footers, cookie banners, ad slots, sidebar links). A naive tag-stripper keeps both equally. A good extractor keeps almost only content.
The challenge: there's no explicit label telling you which text is content and which is boilerplate. You have to infer it from HTML structure (semantic tags, DOM depth, text density, link density). A paragraph of article text sits inside a few div layers. A navigation menu sits inside an anchors-heavy, short-text node. The heuristic: high text density + low link density = likely content.
trafilatura: Python library that uses density-based heuristics to find the main content block. Handles most mainstream news/blog formats well. Used by RefinedWeb (Falcon), FineWeb (HuggingFace).
resiliparse: Rust-based extraction (via Python bindings), very fast. Similar approach to trafilatura but better at some edge cases.
jusText: older algorithm, paragraph-level classification. Used by The Pile (Pile-CC). Nemotron-CC found it returned more tokens than trafilatura by being less aggressive about removing borderline paragraphs.
html — raw web page (boilerplate + content mixed) <nav class="main-menu"> <a href="/home">Home</a> | <a href="/about">About</a> | ...15 more links... </nav> <div id="cookie-banner"> We use cookies to improve your experience. Accept | Decline </div> <article class="post-content"> <h1>The Cambrian Explosion: 543 Million Years of Novelty</h1> <p>In the span of 25 million years — a geological eyeblink — nearly all major animal phyla appeared in the fossil record. Trilobites, molluscs, chordates, arthropods: all emerged from what came before: single-celled Ediacaran organisms barely distinguishable from mats.</p> <p>Why? The oxygen hypothesis holds that rising O<sub>2</sub> levels...</p> </article> <footer> © 2023 ScienceBlog.com | Privacy Policy | Terms | Contact | Sitemap </footer>
text — after trafilatura extraction (content only) The Cambrian Explosion: 543 Million Years of Novelty In the span of 25 million years — a geological eyeblink — nearly all major animal phyla appeared in the fossil record. Trilobites, molluscs, chordates, arthropods: all emerged from what came before: single-celled Ediacaran organisms barely distinguishable from mats. Why? The oxygen hypothesis holds that rising O2 levels...
python — using trafilatura for extraction import trafilatura def extract_text(html_bytes: bytes) -> str | None: """ Extract main content from raw HTML. Returns None if no content found (e.g., pure navigation page). Input: raw HTML bytes from WARC file Output: clean plain text string, or None """ # include_tables=True preserves table structure as text # favor_precision=True: when in doubt, discard (quality-first) text = trafilatura.extract( html_bytes, include_tables=True, favor_precision=True, include_links=False, # strip hyperlinks include_images=False, # strip image alt text ) return text # may be None for navigation-only pages # Reading a WARC file from warcio.archiveiterator import ArchiveIterator with open('CC-MAIN-2024-10-0001.warc.gz', 'rb') as f: for record in ArchiveIterator(f): if record.rec_type == 'response': html = record.content_stream().read() text = extract_text(html) if text and len(text) > 200: # skip very short pages process(text, url=record.rec_headers['WARC-Target-URI'])
A sample web page. Toggle between the raw HTML view (with all boilerplate highlighted in red) and the extracted clean text view. The boilerplate-to-content ratio is typical for a blog or news site.
You have clean text from many sources: 400B tokens of filtered web, 20B tokens of code, 30B tokens of Wikipedia, 10B tokens of books, 5B tokens of arXiv papers, 15B tokens of StackExchange. You want to train on 1T tokens total. Which sources do you oversample? Which do you undersample? What fraction of the training batch should come from each domain?
This is the data mix problem, and it is arguably the single most important hyperparameter decision in pretraining — more impactful than learning rate or optimizer settings in many ablations. A model trained on 50% web + 50% code will be excellent at coding and mediocre at general knowledge. A model trained on 98% web + 2% code will have the reverse profile.
The Pile (2021) used 22 high-quality curated domains. The designers made explicit choices about oversampling: Wikipedia was oversampled 3× its natural frequency because of its high information density. GitHub code was oversampled 2×. The result: a model (GPT-J, GPT-NeoX) with unusually strong coding and scientific reasoning for its era.
GPT-3 used roughly: 60% Common Crawl (processed), 22% WebText2 (Reddit-linked pages), 8% books, 3% Wikipedia, 7% miscellaneous. This heavy web weighting is partly why GPT-3 was fluent and broad but not as technically precise as Codex (which was fine-tuned on code).
| Model | Total tokens | Web % | Code % | Books % | Wiki % | Other % |
|---|---|---|---|---|---|---|
| GPT-3 (2020) | 300B | ~60% | — | ~8% | ~3% | ~29% |
| The Pile (2021) | 825GB | ~40% | ~11% | ~18% | ~4% | ~27% |
| LLaMA (2023) | 1.2T | ~67% | ~5% | ~4% | ~4% | ~20% |
| Dolma (2024) | 3T | ~60% | ~8% | ~2% | ~3% | ~27% |
| Llama 3 (2024) | 15T | ~50% | ~17% | ~3% | ~2% | ~28% |
The "data-constrained" regime: if you repeat data (epochs > 1), performance starts to degrade. The Chinchilla-era result suggests models benefit from fresh tokens even if those tokens are slightly lower quality than repeating your best data. The Hoffmann et al. discount formula (from Lecture 11) quantifies this tradeoff:
Where r is the repetition factor: if you see each token r=3 times, the effective unique data is Duniq × 3−0.7 ≈ 0.46 × Duniq — you've tripled the data but only gained 46% effective coverage. Past r≈4, repeating is barely better than random noise.
python — data-mix sampler (weighted random draw across domains) import random from dataclasses import dataclass from typing import Iterator MIX = { "web": {"weight": 0.60, "tokens_avail": 400e9}, # 400B tokens "code": {"weight": 0.15, "tokens_avail": 30e9}, "books": {"weight": 0.10, "tokens_avail": 25e9}, "wiki": {"weight": 0.08, "tokens_avail": 4e9}, # oversample 6× "arxiv": {"weight": 0.04, "tokens_avail": 8e9}, "stackex":{"weight": 0.03, "tokens_avail": 5e9}, } def mix_sampler(total_tokens: int) -> Iterator[str]: """ Yields domain names according to MIX weights. Caller fetches the next doc from that domain's iterator. In production: each domain has a shuffled, tokenized shard list. """ domains = list(MIX.keys()) weights = [MIX[d]["weight"] for d in domains] seen = 0 while seen < total_tokens: domain = random.choices(domains, weights=weights, k=1)[0] yield domain seen += 1 # Repeat factor for wiki: wiki_weight * total / wiki_tokens_avail # = 0.08 * 1e12 / 4e9 = 20× — Wikipedia repeated 20 times!
This is the payoff simulation. Set domain weights with the sliders and see the composition pie chart update live. The capability bars predict — based on known empirical patterns from real datasets — which capabilities the resulting mix tends to produce.
Adjust domain weights (they auto-normalize to 100%). The right panel shows predicted capability levels based on domain weights. Hover bars for details.
The progression of LLM training datasets tells the story of the field's priorities: more data, cleaner data, more diverse data, then smarter-filtered data.
BooksCorpus (985M words, 7K self-published books) + English Wikipedia. Sequences are documents (not sentences — this matters for learning long-range dependencies). Simple and clean. No web data at all.
Outgoing links from Reddit posts with ≥3 karma = proxy for "humans found this worth reading." 8M pages, 40GB. Open replication: OpenWebText. Used Facebook fastText for English filtering, removed near-duplicates. The Reddit curation is a clever quality signal that doesn't require a classifier.
Common Crawl (processed) + WebText2 + Books1/Books2 (mysterious internet-sourced books) + Wikipedia. Trained a quality classifier to distinguish {WebText, Wikipedia, Books1, Books2} from random CC text — a landmark move. Fuzzy deduplication against WebText and benchmarks. First dataset at ~hundreds-of-billions-tokens scale.
Grassroots effort from EleutherAI, coordinated on Discord. Key sources: Pile-CC (custom WARC extraction with jusText), PubMed Central, arXiv (LaTeX!), Books3 (196K books from Bibliotik — later sued), Project Gutenberg, StackExchange, GitHub, Enron emails. Books3 has since been taken down due to copyright lawsuits. This dataset powered GPT-J and GPT-NeoX.
Started with one CC snapshot (1.4T raw tokens). Manual heuristics to filter: keep lines ending in punctuation with ≥5 words, remove pages with <3 sentences, remove "bad word" list hits, remove pages containing '{' (eliminates code), remove lorem ipsum / terms of use pages, keep English at p≥0.99 (langdetect). The most aggressive rule-based pipeline of its era. Powered T5.
CommonCrawl + CCNet pipeline + C4 + GitHub + Wikipedia (20 languages, Jun–Aug 2022) + Project Gutenberg + Books3 + arXiv (inline macros expanded) + StackExchange top 28 sites sorted by score. Reproduced by Together's RedPajama v1. Cerebras SlimPajama: 627B token subset after deduplication with MinHashLSH.
RefinedWeb (Falcon, 2023): "web data is all you need" thesis. trafilatura for WARC extraction. Gopher quality rules. Fuzzy dedup (MinHash over 5-grams). Released 600B tokens out of a 5T-token pipeline. FineWeb (HuggingFace, 2024): started as RefinedWeb replication, improved with 95 CC dumps, URL filtering, p(en) > 0.65 language threshold, additional Gopher + C4 rules, PII anonymization (email + IP addresses). Result: 15T tokens.
AI2 open-source dataset for OLMo. Multi-source: CC (Gopher + C4 rules), Reddit (Pushshift 2005–2023), PeS2o (40M Semantic Scholar papers), C4, Gutenberg, Wikipedia/Wikibooks. Explicit toxicity filtering with Jigsaw classifier. Bloom filter deduplication. No model-based quality classifier (explicitly avoids it to reduce systematic biases).
DataComp-LM: benchmark for data processing algorithms. Processed CC to DCLM-pool (240T tokens). DCLM-baseline: trained a fastText classifier using positive examples (OpenHermes-2.5 GPT-4-generated data + ELI5 subreddit) vs negative examples (random RefinedWeb). Applied classifier to all of DCLM-pool. Quality classifier outperforms all rule-based methods on downstream benchmarks.
NVIDIA's response to DCLM. Motivation: FineWebEdu and DCLM filter 90% of data, leaving too few tokens for frontier training. jusText for extraction (more tokens than trafilatura). Classifier ensemble: Nemotron-340B-instruct scored documents by educational value, distilled to fast model; combined with DCLM classifier. Synthetic data: for low-quality docs, LM rephrases them; for high-quality, LM generates QA pairs. Result: 6.3T tokens (1.1T high-quality subset).
The moment you download web data, you've copied copyrighted material. Almost everything on the internet is copyrighted — the threshold for copyright protection is extremely low. Your website is copyrighted the moment you write it. You don't need to register (unlike patents). This creates an unavoidable legal exposure for any organization training on web data.
In the United States, copyright law derives primarily from the Copyright Act of 1976. Key facts:
A license is essentially "a promise not to sue." Creative Commons licenses enable free distribution of copyrighted work with various conditions. Wikipedia, Open Courseware, Khan Academy content is Creative Commons. Many model developers pay for licenses:
For code: permissive licenses (MIT, Apache 2.0) allow use without restriction. GPL is "copyleft" — arguably "infects" anything derived from it. Most code data pipelines (The Stack, StarCoder) filter to permissively-licensed repositories only.
Fair use (Section 107 of the Copyright Act) allows use of copyrighted material without permission in certain circumstances. Four factors courts weigh:
| Factor | Favors fair use when... | LLM training relevance |
|---|---|---|
| 1. Purpose & character | Educational, transformative, non-commercial | Training is transformative; commercial use is a negative factor |
| 2. Nature of work | Factual, published, non-creative | News articles favor FU more than novels |
| 3. Amount used | Small snippet, not the "heart" of work | Whole-document ingestion is a strong negative factor |
| 4. Market effect | Does not substitute for the original market | Most contested factor: LLMs can reduce demand for original works |
A key insight from copyright law: training an ML model is transformative in a way that mere copying is not. The model doesn't store the training text — it updates weights. It's interested in patterns (the idea), not the verbatim expression. The counter-argument: if the model can reproduce copyrighted text (as demonstrated with NYT articles), the expression was memorized. Both arguments have merit. Courts will decide.
Even if you have a license or fair use applies, website terms of service may prohibit scraping. Reddit's ToS prohibits commercial use of its content — yet Reddit was scraped for years (Pushshift) and is now selling API access. robots.txt is advisory, not legally binding (in the US), but violating it can affect politeness norms and occasionally contributes to ToS violation claims.
You now have the full picture of the data acquisition and extraction layer. Before we connect forward, here's the landscape you've covered:
| Concept | Key insight | Where it matters most |
|---|---|---|
| Training stages | Pre → Mid → Post: volume drops, quality rises by 1000× | Resource allocation, dataset strategy |
| Common Crawl | ~3.5 PB/snapshot, ~1–15% usable after pipeline | Every large-scale pretraining corpus |
| WARC vs WET | WARC + custom extractor beats WET by multiple benchmark points | Pipeline choice for CC-based datasets |
| HTML extraction | trafilatura (quality) vs jusText (quantity) tradeoff | CC pipelines; DCLM, RefinedWeb, FineWeb |
| Data mix | Domain weights ≠ byte fractions; Wiki oversampled 20× | Every pretraining run |
| Legal landscape | Most web data is copyrighted; fair use contested for LLMs | Dataset release decisions, company strategy |
Deduplication and quality filtering — the two stages we flagged as "Lecture 14 material." You'll see:
The history of LLM training data, 2019–2024. Each entry shows approximate token count and the key innovation.
Data pipeline stages
Key numbers to remember