---
abstract: |
  We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally \`\`poor" data.

  ```{=latex}
  \ignore{The low-compute wins from filtered open datasets such as DCLM-Baseline and RefinedWeb entirely disappear at larger scales, and even adding documents with randomly shuffled word order or entirely random strings does not hurt model performance.}
  ```
  ```{=latex}
  \ignore{As computational budgets outpace the number of ``high-quality'' tokens available in heavily filtered datasets, we study the viability of the unfiltered Common Crawl (CC) and low-quality data for language model pretraining. 
      When taking a subset of CC, we find that CC outperforms standard filters including DCLM-Baseline and RefinedWeb as well as looser variants when sweeping over both model size and training steps. At the same subset size, neither randomly generated strings nor entirely shuffled new documents, up to 8x the subset size, appear to hurt the best achievable performance. 
      These findings also hold even as the subset size is scaled toward all of CC. 
      While we are able to construct ``poison'' data from distribution shift to hurt model performance, an initial study suggests there is a harmless amount in CC. 
      Altogether, our experiments totalling over 30,000 H200 GPU hours suggest that sufficiently large models trained for sufficiently long can benefit from the entire Common Crawl, and that the future of pretraining data lies mainly in seeking out and constructing new datasets. }
  ```
author:
- |
  Christopher Mohri
  Department of Computer Science
  Stanford University
  `xmohri@stanford.edu`
- |
  John Duchi
  Departments of Statistics and Electrical Engineering
  Stanford University
  `jduchi@stanford.edu`
- |
  Tatsunori Hashimoto
  Department of Computer Science
  Stanford University
  `thashim@stanford.edu`
bibliography:
- neurips_2026.bib
title: A Bitter Lesson for Data Filtering
---

```{=latex}
\let\P\undefined
```
```{=latex}
\DeclareMathOperator*{\P}{\mathbb{P}}
```
```{=latex}
\DeclareMathOperator*{\E}{\mathbb E}
```
```{=latex}
\DeclareMathOperator{\tr}{tr}
```
```{=latex}
\DeclareMathOperator{\rank}{rank}
```
```{=latex}
\DeclareMathOperator{\range}{range}
```
```{=latex}
\newcommand{\sF}{{\mathscr F}}
```
```{=latex}
\newcommand{\sX}{{\mathscr X}}
```
```{=latex}
\newcommand{\sY}{{\mathscr Y}}
```
```{=latex}
\def\Rset{\mathbb{R}}
```
```{=latex}
\newcommand{\1}{\mathds{1}}
```
```{=latex}
\newcommand{\ignore}[1]{}
```
```{=latex}
\newcommand{\chris}[1]{\textcolor{orange}{{[Chris: #1]}}}
```
```{=latex}
\maketitle
```
# Introduction

The standard approach to select pretraining data for language models is to filter text from sources like Common Crawl (CC) [@commoncrawl]. It is widely documented that in compute-constrained regimes, where one must train on a subset of CC, different data selection strategies can have a large impact on performance. This is intuitive: all else equal, it seems natural to train on \`\`higher-quality" data

```{=latex}
\ignore{ or data that looks similar to a given target task, like mathematics or software engineering}
```
. As a result, a large body of research has emerged to tackle the data selection problem, with the goal of finding the best subset for pretraining language models [@albalak2024surveydataselectionlanguage; @li2025datacomplmsearchgenerationtraining].

However, not only is large-scale filter ablation heuristic and expensive, but filtering removes data, which is at odds with scaling trends that prescribe ever-increasing amounts of data to improve model performance. For example, the heavily-filtered DCLM-baseline dataset keeps $\sim\!1\%$ of the original CC, leading to about $3.8$ trillion tokens [@li2025datacomplmsearchgenerationtraining]. While this is still enormous, it falls short of the Chinchilla-optimal token budget for a $1$ trillion parameter model, even after accounting for diminishing returns when epoching [@muennighoff2025scalingdataconstrainedlanguagemodels]. The current trend is also to over-train relative to Chinchilla-optimal, which prescribes even more tokens to allow for (relatively) smaller models that are financially feasible to serve [@sardana2025chinchillaoptimalaccountinginferencelanguage].

We begin by testing the hypothesis that data filtering is necessary at all in the large compute limit. While large-scale machine learning has moved toward task-agnostic pretraining [@raffel2023exploringlimitstransferlearning], and there is anecdotal evidence that larger computational budgets benefit from looser data filters [@goyal2024scalinglawsdatafiltering; @muennighoff2025scalingdataconstrainedlanguagemodels], removing *all* data filtering would be an extreme intervention that uses data considered to be actively harmful [@raffel2023exploringlimitstransferlearning]. Our goal in this work is to take this extreme seriously and study the limits of (low-quality) data for transformer pretraining.

We find evidence that rejects the hypothesis that data filtering is necessary, and that eventually, no existing data filter is likely to improve upon training directly on Common Crawl. In our experiments, we scale down both CC and its filtered versions to keep their relative sizes intact, and then scale computational resources for pretraining on these different datasets. Our two main levers to do so are scaling model size (which requires more compute per training step) and training steps (which eventually leads to epoching). When comparing the best achieved performance, regardless of computational cost, our main finding is that the full pool outperforms our selected filters.

Our findings are robust as we scale our experiments by 2 orders of magnitude, and we find that we can continue to see the effects from our small pool experiments as long as the models are sufficiently large. Furthermore, we find a predictable relationship between pool size, training steps, and model size which enables us to build scaling laws that predict how much compute is needed for no filter to be optimal for a particular pool size. Using this, we find that the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs.

These initial findings lead us to study the robustness of pretraining to \`\`junk" data. Surprisingly, sufficiently large models are highly robust to irrelevant or junk data and can extract useful information even from highly noisy data. We test this using randomly generated strings and documents with shuffled word orders. While performance degrades at low compute budgets, sufficiently trained large models close the gap. Remarkably, these models even benefit from shuffled-word documents, despite only the unigram distribution of the documents remaining intact.

Overall, our experiments suggest that sufficiently large models that are trained for sufficiently long can benefit from the full CC dataset. While it is possible to construct harmful data, which could for example be non-factual content that looks identical to high-quality data, we do not find large amounts of this in CC. As a result, data filtering may suffer from the bitter lesson [@sutton2019bitter] in which human-designed filters that perform well at the small scale are eventually replaced by simple, no-filter approaches that scale more gracefully with compute.

We structure the paper as follows. In Section `\ref{sec:preliminaries}`{=latex}, we provide the basic experimental setup, followed by experiments on filtering in Section `\ref{sec:filtering}`{=latex}. We then move to adding data to our CC pool in Section `\ref{sec:injection}`{=latex}, and scaling the pool size in Section `\ref{sec:scaling}`{=latex}. We finish in Section `\ref{sec:degradation}`{=latex} with edge cases and a theoretical model in Section `\ref{sec:theory}`{=latex} to provide a post-hoc explanation of the observed phenomena.

## Related Work

**Data-constrained pretraining.** Several prior works consider the data-constrained pretraining regime. @muennighoff2025scalingdataconstrainedlanguagemodels derive scaling laws that factor data repetition into the original Chinchilla scaling laws, finding diminishing returns after around 4 epochs on the data and that adding code data and using looser perplexity-based filters mitigates data scarcity. However, the authors recommend filtering \`\`noisy datasets" and train on subsets of C4 [@raffel2023exploringlimitstransferlearning], while the current work directly trains on (parsed) Common Crawl and finds evidence in support of no filtering. @kim2025pretraininginfinitecompute study the question of algorithmic improvements in a data-constrained but compute-unlimited setting. We share a similar experimental setup (where we take subsets of a dataset, scale compute on this subset, and then scale the subset size) but differ in the object of analysis (dataset filtering).

**Loose data filters.** The closest work to ours is @goyal2024scalinglawsdatafiltering, who argue that filter thresholds should depend on the compute budget, showing evidence for vision-language models. They derive a scaling law to predict the filtering threshold as a function of compute budget, and conclude that \`\`less aggressive filtering is best" with \`\`large compute" but do not identify the parameter scaling interactions that are critical to our work, and do not show our main findings that for language models, no filter can be the best filter. @fang2025datasetsdocumentsrepetitionspracticalities tackle a related question by artificially repeating \`\`high-quality" data to match the scale of loosely filtered data. They find that the former can outperform the latter in low-compute regimes, but the high compute regime studied in this work remains fully speculative in their work. Finally, @gao2021empiricalexplorationqualityfiltering finds that filtering aggressively can hurt performance, speculating that this follows from Goodhart's law \[[-@Goodhart1984]\], and @saada2025dataqualityillusionrethinkingclassifierbased find that filtering with a quality classifier may improve downstream benchmarks but not validation losses on \`\`high-quality" data.

On the theoretical side, @cheng2024labelershavecloserlook develop theoretical models of the data cleaning process, arguing that given models that have enough fidelity to model noisy data generation schemes, it is better to not clean data, while cleaning data can yield more robust learning when models are not perfect. This prediction dovetails with our subsequent findings.

**Low quality data.** Recent works' exploration of the impact of low-quality or intentionally degraded data on model performance motivates our experiments in Section `\ref{sec:injection}`{=latex}. @allenzhu2024physicslanguagemodels33 find that \`\`junk data" significantly reduces knowledge capacity in a synthetic data setting, which aligns with our findings on sufficient model sizes. Counterintuitively, @li2025baddataleadsgood argue that pretraining on toxic data leads to better representations, which makes it easier to remove toxic behavior during the post-training phase. Investigating the limits of data structure, @sinha2021maskedlanguagemodelingdistributional train on shuffled-word data similar to our shuffled-word experiments, arguing that the success of masked language models is primarily due to modeling \`\`higher-order word co-occurrence statistics". Finally, @ru2025reallyfilterrandomnoise train models on randomly generated integers similar to our randomly generated text in Section `\ref{sec:injection}`{=latex}, and notice only a small performance drop.

# Preliminaries {#sec:preliminaries}

We begin with our problem setup. Our goal is to measure the value of a dataset in terms of best possible performance, regardless of computational cost, on metrics of interest such as perplexity and downstream benchmarks. More formally, for a training algorithm $\mathcal{A}$ which accepts as arguments a dataset $D$ of any size, parameter count $M$, and training steps $N$, and outputs a model $\theta\in \Theta$ to be evaluated at a loss $\ell \colon \Theta \to \Rset$, our goal is to find the best achievable performance $$\begin{aligned}
\label{def:vod}
\mathcal{L^\star}(D) := \min_{M, N} \; \ell(\mathcal{A}(D, M, N)),    
\end{aligned}$$ as a function of the pretraining data.

```{=latex}
\ignore{ In the original Chinchilla study of compute-optimal language models \citep{hoffmann2022trainingcomputeoptimallargelanguage}, the objective was the same, but with the constraint that $(M, N)$ not exceed a given compute budget. The dataset was also left unspecified and its size unconstrained.We note that our formulation is similar to that of \citet{kim2025pretraininginfinitecompute}, but shifts the focus from the best algorithm with large amounts of compute to the best dataset. In typical dataset comparisons, it is natural to allocate the same compute budget to two candidates. However, this does not make sense when they can be orders of magnitude different in size, and when compute is so large that it can essentially be treated as an unconstrained hyperparameter}
```
Our formulation has an unconstrained minimum over parameter count $M$ and training steps $N$ in an attempt to extract all the \`\`juice" out of a dataset, no matter its size. Empirically, we compute this minimum by varying $M$ and $N$ over several orders of magnitude until either performance improvements start to plateau or we run out of compute.

```{=latex}
\ignore{Empirically, we do this by varying model size over roughly $3$ orders of magnitude and training steps until performance stops improving with many epochs. }
```
Since we do not have the compute budget to train on all of Common Crawl (let alone perform multiple epochs), our experiments are structured around randomly sampled subsets. Let $D_{cc}$ be the entire CC, $D_{cc, m} \subseteq D_{cc}$ be a randomly sampled subset of $m$ tokens, and $f(D_{cc, m}) \subseteq D_{cc, m}$ be a filtered variant of the subset. In Section `\ref{sec:filtering}`{=latex}, we compare $\mathcal{L}^\star(D_{cc, m})$ and $\mathcal{L}^\star(f(D_{cc, m}))$ for standard filtering functions $f$ such as DCLM-Baseline and RefinedWeb and our smallest subset size $m$, to test if the commonly removed documents $D_{cc, m} \setminus f(D_{cc, m})$ are indeed helpful for improving performance. In Section `\ref{sec:injection}`{=latex}, we test model robustness by injecting various \`\`junk data" $J$ to form $D_{cc, m} \cup J$, challenging the hypothesis that $\mathcal{L}^\star(D_{cc, m}) < \mathcal{L}^\star(D_{cc, m}\cup J)$ holds.

Our smaller scale experiments implicitly assume that the better of $\mathcal{L}^\star(f(D_{cc, m}))$ and $\mathcal{L}^\star(D_{cc, m})$ does not change (or at least changes predictably) with $m$, which allows us to scale down and study the function $\mathcal{L^\star}$ at reasonable compute budgets. To investigate whether this is indeed the case, and understand how performance changes as a function of $m$, $M$, and $N$, we additionally scale over the pool size $m$ in Section `\ref{sec:scaling}`{=latex}.

```{=latex}
\ignore{When comparing filtering methods at a given pool or subset size, we filter the subset directly so that the relative size is intact. We first focus on our smallest pool size in Sections~\ref{sec:filtering} and~\ref{sec:injection}, and then scale the pool size in Section~\ref{sec:scaling} to understand how the observed phenomena change.}
```
## Experiment details

We use the version of Common Crawl provided by @li2025datacomplmsearchgenerationtraining in their DCLM-Pool dataset, which is all of CC before $2023$ with text extracted from HTML via $\texttt{resiliparse}$ [@bevendorff:2018]. This dataset is $240$ trillion GPT-NeoX [@black-etal-2022-gpt] tokens and our randomly sampled subsets range from about $670$ million to $10$ billion tokens. When filtering, we use the code provided by @li2025datacomplmsearchgenerationtraining. We do not use any specialized data curricula or data weights.

Our models are Llama-style dense transformers ranging from $15$ million to $7$ billion parameters, trained with the Meta Lingua code repository [@meta_lingua]. For each of the models, we tune the training step count and weight decay, following prior studies to increase repeatability of the data [@fang2025datasetsdocumentsrepetitionspracticalities; @kim2025pretraininginfinitecompute]. As is standard, we set the learning rate to decay with model size [@brown2020languagemodelsfewshotlearners; @kaplan2020scalinglawsneurallanguage], with an initial tuning stage to determine the decay.

Our main metrics of interest are the loss (negative log-likelihood) on various datasets, since this is known to correlate with downstream performance and provides smoother measurements than common question-answering benchmarks (likely due to their small size). These datasets are the English portion of C4 [@raffel2023exploringlimitstransferlearning], Fineweb-Edu [@penedo2024finewebdatasetsdecantingweb], which is a pretraining dataset targeting educational texts, and Cosmopedia [@benallal2024cosmopedia], a dataset of synthetically-generated texts. We primarily plot the average loss across these three, but the trends are the same for each individually as well. We also provide results on common benchmarks such as Arc-Easy [@clark2018thinksolvedquestionanswering] and Piqa [@bisk2019piqareasoningphysicalcommonsense] in Appendix `\ref{app:benchmarks}`{=latex}. Since our experiments use pool sizes of only up to $10$B tokens, we do not expect to suffer from test set contamination.

# Data Filtering {#sec:filtering}

In this section, we test the hypothesis that standard filtered versions of CC achieve a lower loss than the unfiltered CC. Returning to our formulation in (`\ref{def:vod}`{=latex}), when $D_{cc, m}$ is an $m$-token subset of CC and $f$ is a filtering function, we are interested in the best of $\mathcal{L}^\star(D_{cc, m})$ and $\mathcal{L}^\star(f(D_{cc, m}))$.

```{=latex}
\ignore{The goal is to identify if the commonly removed documents $D \setminus f(D)$ are indeed helpful for improving performance. }
```
While we evaluate a representative set of standard and relaxed filters, an exhaustive search over the exponential space of subsets is computationally intractable. Our objective is instead to benchmark open curation strategies against the pool and identify if models are able to extract signal from \`\`low-quality" data.

We focus on our smallest CC pool size (about $670$ million tokens) where ablations are the cheapest, and curate five filtered versions of this pool by applying the filters described below, all of which are used in @li2025datacomplmsearchgenerationtraining. The first three are individual filters applied in the initial \`\`heuristic cleaning" stage of DCLM-Baseline, and ablating them alone gives us pretraining datasets that are larger and more loosely filtered than standard. The fourth gives the end result of the \`\`heuristic cleaning" stage, and the last gives the result of the full filtering pipeline.

```{=latex}
\ignore{, and we omit it from our plots because it would result in only about $10$ million tokens and performs much worse}
```
![Comparison of models on 670M token CC pool and five filtered subsets. For sufficiently large models (330M+) unfiltered pool (black) outperforms all five filters (colors) after sufficiently many optimization steps (x-axis, tokens under multiple epochs).](figures/small_filtering_plot.png){#fig:filters-600m-pool width="1\\linewidth"}

**English filter.** This filter first obtains an English score for a document using a fasttext classifier [@joulin2016bagtricksefficienttext] and then applies a threshold to this score. According to our tokenizer, $28.2 \%$ of the data is left after applying this filter.

**Repetition filter.** This filter originates from the data curation stage of the Gopher model, with the motivation that \`\`excessive repetition is often linked with uninformative content" [@rae2022scalinglanguagemodelsmethods]. It splits documents into segments of various granularities, such as lines, paragraphs, or n-grams, and applies a threshold on the duplicate fraction of these segments. According to our tokenizer, $45.3 \%$ of the data is left after applying this filter.

```{=latex}
\ignore{\textbf{Page length filter.} This filter removes documents that are less than $50$ or greater than $100,000$ words.}
```
**Stop word filter.** This filter ensures that a document contains at least $2$ occurrences of English stop words from the following list: \`\`the", \`\`be", \`\`to", \`\`of", \`\`and", \`\`that", \`\`have", and \`\`with". According to our tokenizer, $50.4 \%$ of the data is left after applying this filter.

```{=latex}
\begin{wrapfigure}{tr}{0.4\textwidth} % {r} for right side, width as needed
  
  \includegraphics[width=0.38\textwidth]{figures/pareto_frontier_plot.pdf}
  \caption{Pareto frontier of Figure~\ref{fig:filters-600m-pool} showing that in high-compute regimes, pool becomes optimal.}
  \label{fig:pareto-670m-pool}
\end{wrapfigure}
```
**RefinedWeb.** This consists of the filters above along with other similar filters, in an attempt to reproduce the RefinedWeb dataset [@penedo2023refinedweb]. According to our tokenizer, $13 \%$ of the data is left after applying this filter.

**DCLM-Baseline.** This dataset applies deduplication and quality-based filtering with a fasttext classifier to RefinedWeb. According to our tokenizer, $2.1\%$ of the original pool data is left after applying this filter. We address questions of severe data scarcity in Appendix `\ref{app:benchmarks}`{=latex}.

In Figure `\ref{fig:filters-600m-pool}`{=latex}, we show the average loss for each dataset as compute is varied with both model size and training steps. Each point consists of a separate training run, with its own warmup and cosine decay learning rate schedule.

```{=latex}
\ignore{ We first observe that generally, the loss starts to increase with more epochs, as observed by \citet[Figure 2]{kim2025pretraininginfinitecompute}, likely due to overfitting. }
```
Overall, the pool (CC) reaches the best loss of $3.37$ on the $1$B model, and its loss has not visibly plateaued from scaling model size. Outperforming the filtered datasets requires both a sufficiently large model *and* a sufficiently large training step count. While we have not trained the $15$M model until the loss starts to increase again because the loss continues to decrease even at a training budget of $100$B total tokens, it does not appear as though the pool will ever outperform any of the first four filtered datasets. As we transition to the larger models, we observe crossing points on the loss curves between the pool and filtered versions, and these crossing points appear earlier as model size increases.

In Figure `\ref{fig:pareto-670m-pool}`{=latex}, we take the same runs from Figure `\ref{fig:filters-600m-pool}`{=latex} and derive a compute-performance Pareto frontier. We calculate the compute for a run with the standard $6NM$ approximation [@kaplan2020scalinglawsneurallanguage], where $N$ is the number of total training tokens and $M$ is the number of model parameters. As compute is increased, the pool transitions from the worst-performing dataset to the best. Perhaps surprisingly, not all datasets enjoy a point on the overall Pareto frontier: at every given compute level, there are at least two better-performing datasets than the repetition filtered dataset.

Overall, these experiments suggest that pretraining is surprisingly resilient. Even at our scale, we see that the pool eventually beats the performance of all the filtered variants. This can be counterintuitive, since we might expect some junk data to hurt model performance. To further explore this phenomenon, we create artificial low quality data to probe the limits of pretraining robustness in the next section.

```{=latex}
\ignore{To add to our confidence that large-scale pretraining is robust to low-quality data, we introduce targeted data injection in the following section.}
```
# Data Injection {#sec:injection}

We now test the limits of model robustness by deliberately injecting low-quality data. We investigate the hypothesis that the best achievable performance strictly degrades when curated \`\`junk" distributions are added to the pretraining pool. More formally, if $D_{cc, m}$ is a subset of CC and $J$ represents the injected low-quality dataset, we are interested in the best of $\mathcal{L}^\star(D_{cc, m})$ and $\mathcal{L}^\star(D_{cc, m} \cup J)$. Our first variant of $J$ is designed to be devoid of any useful signal, and the second is designed to have some useful signal but of extremely low quality (Examples in Figure `\ref{fig:junk_examples}`{=latex}).

**Randomly generated strings.** We define a vocabulary of 10,000 words by uniformly sampling 3 to 8 characters from the lowercase English alphabet (a-z). We then sample uniformly from these words and concatenate them with a space character to form documents.

**Additional shuffled pool documents.** We take additional CC documents that are not included in our CC subset and randomly shuffle the order of the words in each document.

<figure id="fig:injection-600m-pool">
<p><img src="figures/scaling_laws_v149_random.png" /> <img src="figures/scaling_laws_v149_shuffled.png" /></p>
<figcaption>670M-token CC pool versus junk-injected versions. Plots show a surprising robustness to random data (top) for large models with consistent gains from low-quality (shuffled) data (bottom) with sufficiently many epoched training steps (x-axis). </figcaption>
</figure>

In Figure `\ref{fig:injection-600m-pool}`{=latex}, we provide the comparison of the two new datasets alongside the CC pool when varying model size and training step count. We have included varying amounts of injected junk data, up to 8 times the pool size in the shuffled words case, leaving only about 10% of untouched CC documents. In both cases, it is immediate that the injected data has not completely reduced model performance to random performance, which would result in a cross-entropy loss or negative log-likelihood of $-\log(1/V)$ where $V$ is the vocabulary size, giving approximately 10.8 with our tokenizer.

For both dataset variants, a sufficiently large model is required to match the pool performance. With the 15M model, there is a separation in the loss curves, regardless of the ratio of injected documents. As we transition to larger models, this gap closes. On the 330M model, we even see that all of the shuffled datasets---except the +800% shuffled dataset---surpass the pool performance after around 11B training tokens. We have not trained the +800% shuffled dataset past 100B tokens, but we expect it will also surpass pool performance since its loss has not visibly plateaued. We also expect it to cross this threshold even earlier on the larger 1B model because of its faster-decreasing loss. In the case of the randomly generated strings, the random datasets appear to more closely match the performance of the pool, but overall, the gaps are still closing with model size.

Our intuition for the differences between these datasets is that the shuffled words are more \`\`confusing" for a smaller model, whereas the randomly generated strings are more clearly drawn from a different distribution. As we scale model size, and thus perhaps the ability to differentiate between the two distributions, there is more signal to extract from the shuffled data as it contains additional unseen pool documents with the unigram distribution intact. If, for example, we shuffled the sentence \`\`The capital of France is Paris", we would still see \`\`France" and \`\`Paris" together, helping the model understand that there may be some connection between the two. We attribute the improved performance with +20% random to either a potential regularization effect or an unintended similarity to natural text, which generally features words of similar lengths separated by space characters.

<figure id="fig:junk_examples">
<figure id="fig:sub_random_example">

</figure>
<p></p>
<figure id="fig:sub_shuffled_example">

</figure>
<figcaption>Examples of ``low-quality” documents injected into CC pool. <strong>Left</strong>: documents with shuffled word order. <strong>Right</strong>: documents with randomly generated strings.</figcaption>
</figure>

# Scaling Pool Size {#sec:scaling}

Do our experiments have implications for large-scale pretraining where the pool is all of CC? While suggestive, our 670M pool sample is quite far from the available internet stock of 200-500 trillion tokens [@villalobos2024rundatalimitsllm], and scale effects could significantly change our conclusions.

To address these concerns, we turn to scaling studies that show our effects are consistent across scale by varying our pool size and model sizes across $2$ orders of magnitude, and build up to a prediction of the compute threshold where the CC pool in DCLM-Pool (240T tokens) outperforms RefinedWeb. Due to the computational costs of these runs, we focus solely on the comparison between CC and $f=$ RefinedWeb, with the goal of making a prediction on the better of $\mathcal{L}^\star(D_{cc})$ and $\mathcal{L}^\star(f(D_{cc}))$.

<figure id="fig:crossing-points">
<p><img src="figures/1B_avg_nll_clean_headers.png" /> <img src="figures/c4_cross_final.png" /></p>
<figcaption><strong>Top:</strong> <span class="math inline">1</span>B model performance as we vary the pool size; the total needed steps for pool to outperform RefinedWeb grows rapidly. <strong>Bottom:</strong> Crossing point as a function of pool size for various model sizes. Markers each represent a crossing point (e.g. top panel), with text showing the epoch count. Epochs above the largest observed crossing point (121.6 epochs) are shaded to indicate unreliability at extreme epoch counts. Dashed lines show second-order polynomial fits used to interpolate data and show growth trends.</figcaption>
</figure>

Understanding how pool size affects performance requires us to map out the joint space of pool size $m$, model parameters $M$ and step count $N$. As a simplifying first step, we represent step count as a function of the other two variables, $$N
^\star(M, m) := \min \left\{ N \colon \ell(\mathcal{A}(D_{cc, m}, M, N)) < \min_{N'} \ell(\mathcal{A}(f(D_{cc, m}), M, N'))\right\},$$ where we have taken the minimal winning $N$ (if one exists) as the output of the function. Given our intuition and experimental evidence that performance improves with larger models when sweeping over step count (see Figures `\ref{fig:filters-600m-pool}`{=latex} and `\ref{fig:injection-600m-pool}`{=latex}), this serves as a succinct representation of our 3 variable space.

Our step count function $N^\star$ has predictable behavior in both of its arguments. When we fix $M=1$B and increase the pool, we make two important observations. First, we see that the point at which the pool performance becomes better than RefinedWeb ($N^\star$) grows rapidly (top half of Figure `\ref{fig:crossing-points}`{=latex}), and the precise quantitative rate of growth is super-linear (roughly 10 epochs are needed for the 10B-token pool, compared to roughly three epochs for the 2B-token pool and one epoch for the 670M-token pool). Our second observation is that the validation losses are nonmonotone even with tuned weight decay regularization, suggesting that in extreme epochs (100+), the two may cease to cross.

We now also vary model size $M$ to understand the joint scaling behavior as model size grows with pool size. Figure `\ref{fig:crossing-points}`{=latex} shows a sweep over $N^\star(M,m)$ with each panel varying $M$ and the x-axis varying $m$. On the leftmost plot with the 80M model, we can clearly see that crossing points cease to exist, even across our evaluated pool sizes: while there is a crossing point for the smallest 670M-token pool, there is no longer a crossing point on the largest 10B-token pool as indicated by the dark orange marker. As high-epoch regimes can become nonmonotone, we mark those regions in orange in the plot to indicate that they are unlikely to have any crossing points. As we scale up $M$, however, we see that the epoch counts needed for the pool to win rapidly decrease as a function of model size.

```{=latex}
\begin{wrapfigure}{tr}{0.4\textwidth} % {r} for right side, width as needed
  
  \includegraphics[width=0.4\textwidth]{figures/meta_scaling_law_compute_combined_240T.pdf}
  \caption{Scaling laws for optimality of no data filtering. Two scaling laws with token-per-parameter scaling (in orange) and epoch constraints (in blue) both give highly linear scaling and predict similar budgets (1e+30 FLOPs).}
  \label{fig:extrapolation}
\end{wrapfigure}
```
With these observations and our experimental measurements of $N^*$, we can answer our question of what happens when we scale our pool sizes to the current CC pool size (240T tokens in DCLM-Pool). Are compute levels in the near future likely to reach a point where the entire CC pool is better than RefinedWeb? We follow a simple procedure to build a compute scaling law on top of our $N^\star$ function (Figure `\ref{fig:crossing-points}`{=latex}), fitting two types of scaling laws to be robust to misspecification. In our first approach, we start by specifying a token-to-non-embedding-parameter ratio (600-1, following Deepseek V4). For each model size, this ratio immediately specifies the number of training steps $(N^\star)$ as well as compute level $(C=6MN^\star)$. We can then estimate the pool size corresponding to this $N^*$ for each model (using a fitted quadratic to interpolate among our observed data points as described in Appendix `\ref{subsec:scaling_fits}`{=latex}) and build a scaling law against $C$. In our second approach, we instead specify an epoch count (4, based on @muennighoff2025scalingdataconstrainedlanguagemodels). The epoch count specifies a linear constraint which intersects $N^*$ for each model at a single point (c.f. the orange 120 epoch line for Fig `\ref{fig:crossing-points}`{=latex}). This point specifies the pool size and compute level, which we can then also use to build a scaling law.

In contrast to the training steps $N^*$, our compute scaling laws are highly linear (Figure `\ref{fig:extrapolation}`{=latex}), with $R^2$ above $0.99$, and both give similar predictions, near 1e+30 FLOPs for the crossing point. This compute level is quite high, with the best current estimates of frontier pretraining compute near 5e+26 [@grok4_modelcard], but this is far from an outlandish amount of near-future compute, with existing forecasts predicting 1e+29 FLOP training runs by 2030 [@epochepri2025aipower].

# Model Degradation {#sec:degradation}

In all of our experiments so far, we have seen that regardless of the distribution, more data helps if we are free to train a sufficiently large model for sufficiently long. We should not expect this to be a universal property in machine learning, as a large body of research has been dedicated to the problem of domain adaptation and learning under distribution shift [@mansour2023domainadaptationlearningbounds; @pmlr-v206-awasthi23b]. Instead, we hypothesize that language models are highly resistant to covariate shifts, and it is \`\`incorrectly labeled" data or data with shifts in the conditional distribution from a target metric that can be detrimental. For example, we expect that a model trained on sufficient instances of \`\`The capital of France is Copenhagen" will learn the wrong capital of France.

```{=latex}
\ignore{ On the other hand, when it is easy to tell that an input is from a different distribution with just a few tokens as in Section~\ref{sec:injection}, we do not expect degradation.}
```
::: {#tab:mmlu_averages}
  **Dataset**              **Support**   **Refute**   **Related**   **Unrelated**
  ----------------------- ------------- ------------ ------------- ---------------
  MMLU/world_religions        5.89          0.00         13.22          7.50
  MMLU/astronomy              2.03          0.14         10.14          17.41
  MMLU/college_biology        2.67          0.17         11.07          13.40
  MMLU/medical_genetics       2.80          0.23         14.30          11.23

  : Average GPT5-mini judgements on keyword-matched CC data for select MMLU categories.
:::

```{=latex}
\begin{wrapfigure}{r}{0.4\textwidth}
    
    \includegraphics[width=\linewidth]{figures/first_tokens_330M.pdf}
    \caption{330M model: loss of $670$M pool subset versus $+200\%$ dataset. }
    \label{fig:first-token-losses}
\end{wrapfigure}
```
While CC is too large to exhaustively search through and contains non-factual content such as conspiracy theories, we argue that such actively harmful content is relatively low frequency. We provide a very brief study to support this with a corpus analysis of MMLU-related documents in CC [@hendrycks2021measuringmassivemultitasklanguage]. We first match keywords, and then we prompt GPT5-mini to ask whether the document supports, refutes, is related, or is unrelated to the question and answer. We target MMLU subjects such as world religions, where there are very rare keywords. We present our analysis in Table `\ref{tab:mmlu_averages}`{=latex}. While our search did find mostly unrelated or related but neither supporting nor refuting documents, the average number of documents in support is at least an order of magnitude larger than refuting. In Appendix `\ref{app:theory}`{=latex}, we develop some theory to provide an analysis of when filtering should help, in terms of how factual or correctly labeled a dataset is.

We now move to a case of distribution shift from our experiments with shuffled word order documents in Section `\ref{sec:injection}`{=latex}. Our metrics were the average validation loss across the entire sequence, but we may expect to suffer from a distribution shift with the loss on the initial tokens in a document, because we changed the distribution from the natural distribution of first tokens that appear in CC. In the case of predicting the very first token, it is impossible to detect whether a document is shuffled by having access only to the empty prefix.

```{=latex}
\ignore{Therefore, if our metric is validation loss on CC, we expect a drop in performance. }
```
In Figure `\ref{fig:first-token-losses}`{=latex}, we show the average CC validation loss between CC and the $+ 200\%$ shuffled dataset when we look at the loss on initial segments of the document. As we transition from the full average to the loss on only the very first token, the $+200\%$ loses its advantage over the pool. We do not expect this behavior to change with larger models. However, as most use cases of language models involve more than just a few tokens, we do not anticipate that this is a meaningful degradation.

# Theoretical models {#sec:theory}

We might ask whether the results we have identified are predictable: ought we expect them? We present two theoretical models, one here and one in the appendices, that exhibit the behaviors we see, suggesting the types of behavior we identify might hold more broadly.

Heuristically, we might hypothesize that once a (transformer) model is large enough, it can pass \`\`bad" data through components that do not interact with components representing \`\`good" data, and when a model is not large enough, this cannot occur. Our experiments are consistent with this: large models absorb unfiltered data without penalty, while smaller models cannot. In low-rank matrix factorization---the simplest 1 hidden layer (linear) neural network---we see exactly this behavior at the population level.

To make this more precise, consider predicting vector-valued outputs $y$ (tokens) using a rank $r$ linear transformation of an input $x$. Assume the pairs $(x, y) \in \Rset^d \times \Rset^m$ come from one of $k$ tasks, where task $i$ occurs with probability $p_i > 0$ and generates $Y =
u_{\star,i}\, v_{\star,i}^\top X_i + \xi$ for independent mean-zero noise $\xi \in
\Rset^m$, where $\Sigma_i = \E[X_i X_i^\top]$ satisfy $\tr(\Sigma_i
\Sigma_{i'}) = 0$ for $i \neq i'$, so that tasks have orthogonal inputs: one may exactly separate them. The next proposition, whose proof is in Appendix `\ref{app:theory}`{=latex}, follows.

::: proposition
**Proposition 1** (Rank Necessity under Orthogonal Inputs). *`\label{thm:rank-necessity}`{=latex} Let the conditions above hold and $M_{\star,i} = u_{\star,i}\, v_{\star,i}^\top$, and define $M_\star = \sum_{i=1}^k M_{\star,i}$ and $\Sigma = \sum_{i=1}^k p_i\Sigma_i$. If $\sigma_1 \ge \cdots \ge
  \sigma_\rho > 0$ are the $\rho \leq k$ positive singular values of $M_\star\Sigma^{1/2}$, then for any model rank $r$ $$\min_{\substack{U \in \Rset^{m \times r}\\ V \in \Rset^{d \times r}}} \E\!\big[\|Y - UV^\top X\|^2\big] = \sum_{j=r+1}^{\rho} \sigma_j^2 + \E\!\big[\|\xi\|^2\big],$$ where the sum evaluates to $0$ if $r \ge \rho$.*
:::

The result makes clear that, given a large enough model rank $r$, a matrix factorization can optimally represent the prediction problem (so long as $r \ge k$). On the other hand, without enough capacity ($r < \rho$), model performance necessarily degrades with interference of the tasks in $Y$-space, as the singular values of $M_\star \Sigma^{1/2}$ capture. Moreover, at least at this population level, (regularized) gradient-based methods are guaranteed to find the optimal matrices $U$ and $V$, because the objective $\E[\|Y - UV^\top X\|^2]$ has no non-strict saddle points when $r
\ge k$ [@BaldiHo88; @ZhuLiTaWa18], and gradient descent converges to local minimizers with probability 1 [@LeeSiJoRe16]. In a fairly precise sense, then, this simple matrix factorization model exhibits much of the behavior we see in experiments: with enough capacity, noise (tasks) can be immediately absorbed, while smaller models suffer, and first-order methods are sufficient for optimal fitting.

```{=latex}
\ignore{If in addition the
$\{u_{\star,i}\}$ are mutually orthogonal, the singular values
decouple to
$\sigma_i = \|u_{\star,i}\|\sqrt{p_i\,v_{\star,i}^\top\Sigma_i\,v_{\star,i}}$,
and the rank-$r$ model only retains the $r$~tasks with the
largest signal strength, discarding the rest.}
```
# Discussion

While we have identified ways that scaling compute appears to make filtering immaterial, there are several limitations that lead to natural next steps for research in this direction.

**Deviations from vanilla pretraining.** Our setting is restricted to pretraining on dense transformer models, without any data curricula, data weights, or post-training. There may be more unstable architectures such as Mixture of Experts models (MoEs), or phenomena in later stages of training, that require more careful choices with the pretraining data.

**Duplicate documents**. The expected fraction of duplicate documents increases with subset size. At our subset size, it is likely much smaller than the entire CC. We do not expect that our general conclusions would change, especially as we epoch the data, but this is a variable that likely does not play a large role in our experiments.

**Compute.** The compute required for raw Common Crawl to outperform our tested filters is large, up to around $1e30$ FLOPs with our projections in Figure `\ref{fig:extrapolation}`{=latex}. When compute is a bottleneck, we expect filtering to still be important.

**AI-generated content.** We expect the fraction of AI-generated content in CC to increase, with likely a small amount in our pre-2023 DCLM-Pool dataset. It is unclear whether this will be detrimental.

**Factuality.** We have conducted an initial study into CC factuality or correctness with Table `\ref{tab:mmlu_averages}`{=latex}, but there are likely some rare edge cases where models trained on the full pool learn inaccuracies.

```{=latex}
\newpage
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\appendix
```
# Experimental Details

Across all experiments, we use a context length of $1024$ tokens, batch size of $2^{19} = 524,288$ tokens, and a $500$ training step warmup. All runs have a weight decay tuned in $[0.1, 0.5]$, and oftentimes more for filter baselines. We use a learning rate of $1e-2$ for the $15$M models, $1e-3$ for the $7$B model, and $5e-3$ for the models in between (80M, 330M, and 1B). This is in line with many scaling ladders that decay the learning rate with model size. We obtained these values with an initial tuning stage for each of the model sizes. Throughout the plots we vary the training steps as powers of $2$. We evaluate a model $5$ times during training and report the best checkpoint (which is almost always the last one, except for rare cases when the training steps are very large compared to data size).

All experiments were conducted on H200 GPUs. Each run used only data parallelism on a single 8-GPU node, varying from less than an hour to up to 2-3 days. All experiments combined exceeded 20,000 H200 GPU hours.

## Fitting of scaling laws. {#subsec:scaling_fits}

Several of our plots fit scaling laws to our empirically obtained measurements. In the bottom of Figure `\ref{fig:crossing-points}`{=latex}, we fit a second-degree polynomial to the log-log plot due to the super-linear and eventually infinite behavior. The hollow points on the plot are obtained from training runs at the given pool size, but with step counts prior to the crossing point. In those cases, we fit a power law to the (decaying) loss, and extrapolated the first training step or token count where the pool surpassed the best RefinedWeb loss (achieved or extrapolated). The cases where no crossing is ever predicted are marked with an orange \`\`x" on the plot, and only appear on the 80M model size plot. In Figure `\ref{fig:extrapolation}`{=latex}, we use a standard power law.

# Benchmarks {#app:benchmarks}

In this section, we provide downstream benchmark results in addition to the validation loss metrics from the main text. We use Piqa, Arc-Easy, and SocialIQA [@sap2019socialiqacommonsensereasoningsocial] as these are easy enough to provide signal at our scale.

In Figure `\ref{fig:filters-600m-pool-benchmarks}`{=latex}, we provide plots analogous to those in Figure `\ref{fig:filters-600m-pool}`{=latex} but for the benchmarks, and Figure `\ref{fig:pareto-600m-pool-benchmarks}`{=latex} is similarly analogous to Figure `\ref{fig:pareto-670m-pool}`{=latex}. We do the same for the injected datasets: Figure `\ref{fig:random-600m-pool-benchmarks}`{=latex} shows the datasets with random injection and Figure `\ref{fig:shuffled-600m-pool-benchmarks}`{=latex} shows the datasets with shuffled word order. These plots are in general much noisier than the perplexity-based ones in the main text, likely due to the relatively small number of questions in the benchmarks. However, the trends are roughly the same.

![Ablation of $670$M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching).](figures/small_filtering_plot_bench.png){#fig:filters-600m-pool-benchmarks width="1\\linewidth"}

![Pareto frontier of compute vs. benchmark performance for CC pool and filtered datasets. The frontier is formed with the same runs as in Figure `\ref{fig:filters-600m-pool-benchmarks}`{=latex}.](figures/pareto_frontier_plot_bench.png){#fig:pareto-600m-pool-benchmarks width="1\\linewidth"}

Finally, we address the potential confounding in Section `\ref{sec:filtering}`{=latex} when we used the DCLM-Baseline filter on the 670M Common Crawl pool, which retains roughly 2% of the data and potentially results in severe data scarcity with respect to model size. While we did train a very small 15M parameter model in that setting, and note that no matter the subset size, DCLM-Baseline will always be about 2 orders of magnitude smaller than the pool, we provide an experiment here where we instead use 100M DCLM-Baseline tokens. This increases the size by roughly an order of magnitude. Figure `\ref{fig:dclm10x}`{=latex} adds this artificially-increased DCLM-Baseline to Figure `\ref{fig:filters-600m-pool}`{=latex}, and Figure `\ref{fig:pareto-dclm-10x}`{=latex} adds it to the Pareto curve of Figure `\ref{fig:pareto-670m-pool}`{=latex}. Even though the dataset now has more tokens than in the RefinedWeb subset, the pool and looser variants still outperform it with sufficient model size and training.

![Ablation of $670$M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens. ](figures/small_filtering_plot_dclm100m.png){#fig:dclm10x width="1\\linewidth"}

![Pareto frontier of compute vs. average negative log-likelihood for CC pool and filtered datasets. The frontier is formed with the same runs as in Figure `\ref{fig:filters-600m-pool}`{=latex}. The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens.](figures/pareto_frontier_plot_dclm100m.png){#fig:pareto-dclm-10x width=".5\\linewidth"}

![+Random benchmarks. ](figures/scaling_laws_v149_random_bench.png){#fig:random-600m-pool-benchmarks width="1\\linewidth"}

![+Shuffled benchmarks.](figures/scaling_laws_v149_shuffled_bench.png){#fig:shuffled-600m-pool-benchmarks width="1\\linewidth"}

# Proof {#app:theory}

We now restate Proposition `\ref{thm:rank-necessity}`{=latex} from Section `\ref{sec:theory}`{=latex} and give its proof.

::: proposition
**Proposition 2** (Rank Necessity under Orthogonal Inputs). *Assume the tasks have pairwise orthogonal input spaces ($X_i \perp X_j$ almost surely for $i \neq j$). Let $M_\star = \sum_{i=1}^k M_{\star,i}$ and $\Sigma = \sum_{i=1}^k p_i\Sigma_i$. If $\sigma_1 \ge \cdots \ge \sigma_\rho > 0$ are the $\rho \leq k$ positive singular values of $M_\star\Sigma^{1/2}$, then for any model rank $r$: $$\min_{\substack{U \in \Rset^{m \times r}\\ V \in \Rset^{d \times r}}} \E\!\big[\|Y - UV^\top X\|^2\big] = \sum_{j=r+1}^{\rho} \sigma_j^2 + \E\!\big[\|\xi\|^2\big],$$ where the sum evaluates to $0$ if $r \ge \rho$.*
:::

::: proof
*Proof.* Let $M = UV^\top$.

```{=latex}
\ignore{We may replace $v_{\star, i}$ with its orthogonal projection onto $\range(\Sigma_i)$, which does not affect $Y_i = M_{\star, i} X_i + \xi_i$. }
```
We first decouple the noise $\xi$: $$\begin{aligned}
  \E\big[\|Y - MX\|^2\big]
  &= \sum_{i=1}^k p_i\,\E\big[\|M_{\star,i}\,X_i + \xi_i - M\,X_i\|^2\big] \\
  &= \underbrace{\sum_{i=1}^k p_i\,\E\big[\|(M_{\star,i} - M)X_i\|^2\big]}_{=: g(M)}
    + \E\big[\|\xi\|^2\big],
\end{aligned}$$ where the scalar cross-term $2\E[\xi_i^\top (M_{\star,i} - M)X_i] = 0$ vanishes by independence and zero mean. Since the noise term is independent of $M$, it suffices to minimize $g(M)$ over matrices of rank at most $r$. We rewrite the multi-task objective into a single-target objective: $$\begin{aligned}
  g(M) &= \sum_{i=1}^k p_i\,\tr\!\big((M_{\star,i} - M)\,\Sigma_i\,(M_{\star,i} - M)^\top\big) \\ 
  &= \sum_{i=1}^k p_i\,\tr\!\Big(\big(M_{\star} - M - \sum_{j\neq i}M_{\star, j}\big)\Sigma_i\big(M_{\star} - M - \sum_{j\neq i}M_{\star, j}\big)^\top\Big) \\
  &= \tr\!\big((M_{\star} - M)\,\Sigma\,(M_{\star} - M)^\top\big) \\
  &= \big\|(M_\star - M)\,\Sigma^{1/2}\big\|_F^2,
\end{aligned}$$ where the cross terms in the second step vanish by $M_{\star, j}\Sigma_i = u_{\star, j} v_{\star, j}^\top \E[X_i X_i^\top] = 0$ for $i\neq j$ since $v_{\star, j}^\top X_i = 0$.

Let $A = M_\star\Sigma^{1/2}$ have positive singular values $\sigma_1 \ge \cdots \ge \sigma_\rho$. The rank-constrained minimization reduces to $$\min_{\rank(M) \le r} g(M) = \min_{\rank(M)\le r} \|A - M\Sigma^{1/2}\|_F^2.$$ For any $M$ with $\rank(M) \le r$, the matrix $B = M\Sigma^{1/2}$ has rank at most $r$. By the Eckart--Young--Mirsky theorem, the squared Frobenius distance between $A$ and any rank-$r$ matrix $B$ is at least $\sum_{j=r+1}^\rho \sigma_j^2$. This lower bound is exactly achievable: let $A_r$ be the rank-$r$ truncated SVD of $A$. Because $A$ is formed by right-multiplying by $\Sigma^{1/2}$, its row space and therefore the row space of $A_r$ lies entirely within the $\range(\Sigma^{1/2})$. Thus, setting $M = A_r(\Sigma^{1/2})^\dagger$ yields a valid matrix with $\rank(M) \le r$ that satisfies $M\Sigma^{1/2} = A_r$. The minimum achievable excess loss is therefore $\sum_{j=r+1}^\rho \sigma_j^2$, which vanishes if and only if $r \ge \rho$. ◻
:::

## Theoretical conditions for filter improvement

The other phenomenon we have found is the robustness of models and the lack of benefit from filtering. To understand this, we hypothesize that a sufficiently trained large model's conditional distributions can be defined by a similarity measure $s \colon \mathcal{X} \times \mathcal{X} \to \Rset_+$ over test inputs $x \in \mathcal{X}$ and train inputs $x_i \in \mathcal{X}$ from a training dataset $D=\{(x_i, y_i)\}_i$: $$\P_D(y \mid x) := \sum_{i\in D} \frac{s(x, x_i) }{\sum_{j\in D} s(x, x_j )}\1_{y_i = y}.$$ The conditional distribution is the fraction of examples in $D$ with the same label $y$, weighted by $s$. According to the definition, we can affect the model's prediction at a given test input $x$ by including a similar $x'$ in the training dataset $D$.

```{=latex}
\ignore{ If the corresponding label $y'$ does not match a target $y^\star$, the cross-entropy loss $-\log \P_D(y^\star\mid x)$ on the test example $x$ will increase. If we include unrelated documents with $s(x, x') = 0$, the loss is unaffected. }
```
```{=latex}
\ignore{To do this, we will first study a very simple function $d$, which will work via thresholding. Intuitively, we average all the $y_i$ in the data for which $x_i$ is close enough to $x$.}
```
Let us consider the error (in KL divergence) of this predictor $\P_D$ compared to a predictor using filtered data $\P_{\phi\circ D}$ with $\phi \circ D \subseteq D$. We use the notation $D|_y$ to denote the restriction of $D$ to examples $(x_i, y_i)$ with $y_i=y$ and $D|_{\neq y}$ to denote the restriction of $D$ to examples $(x_i, y_i)$ with $y_i\neq y$. Expectations are defined with respect to an $s(x, \cdot)$-weighted dataset. We find a simple characterization of the error difference.

::: fact
**Fact 3** (Characterization of Filter Improvement). *`\label{fact:improvement}`{=latex} Given Dirac target conditional $\P_\text{t}(\cdot \mid x)$ with all mass on $y^\star$*

```{=latex}
\ignore{, dataset $D = \{(x_i, y_i)\}_i$, and filter function $\phi \colon \sX \times \sY \to \{0, 1\}$, }
```
*, the improvement of $\P_{\phi\circ D}$ with respect to $\P_D$ in KL divergence to $\P_t$ is $$KL( \P_t \mid\mid \P_{D}) - KL(\P_{t} \mid\mid  \P_{\phi\circ D}) =  -\log \paren*{\P_D(y^\star|x) + (1-\P_D(y^\star|x)) \frac{\E_{D|_{\neq y^\star}} \bracket{\phi(X, Y)} }{\E_{ D_{|y^\star}} \bracket{\phi(X, Y)} }}.$$*
:::

The theorem shows that two terms appear in the difference: the prevalence of the label $y^\star$ in the original dataset $P_{D}(y^\star\mid x)$ and a measurement of the filter performance via the ratio of the false positive rate to the true positive rate $\E_{D|_{\neq y^\star}} \bracket{\phi(X, Y)} /\E_{ D_{|y^\star}} \bracket{\phi(X, Y)}$. If the prevalence is already high, there is little improvement possible, and otherwise $\phi$ must be able to distinguish between correct and incorrect labels, on the $s$-weighted dataset, well. In the case of CC, Table `\ref{tab:mmlu_averages}`{=latex} suggests that the prevalence on select MMLU subjects is already high. In cases of strong filtering (e.g. removing 99% of the data including all $x'$ with high $s(x, x')$), it is plausible to expect a true positive rate of approximately 0, leading to a worsening.

::: proof
*Proof.* In the following, we drop the first argument to $s(\cdot, \cdot)$ to simplify notation. We first simplify the difference using the definition of KL: $$\begin{aligned}
        & KL( \P_{t} \mid\mid \P_{D}) - KL(\P_{t} \mid\mid  \P_{\phi\circ D})  
        =  \sum_{y\in\sY} \P_{t}(y\mid x) \log \frac{\P_{\phi \circ D}(y\mid x)}{\P_D(y\mid x)}. 
    
\end{aligned}$$ We analyze the likelihood ratio: $$\begin{aligned}
    \frac{\P_D(y\mid x)}{\P_{\phi \circ D}(y\mid x)} & = \frac{\sum_{i\in D, y_i = y} \frac{s(x_i)}{\sum_{j\in D} s( x_j)}}{\sum_{i\in D, y_i = y} \frac{s(x_i)\phi(x_i, y_i)}{\sum_{j\in D} s( x_j)\phi(x_j, y_j)}} \\
    & = \frac{\sum_{j\in D} s( x_j)\phi(x_j, y_j)}{\sum_{j\in D} s( x_j)}\frac{\sum_{i\in D, y_i = y} s(x_i)}{\sum_{i\in D, y_i = y}s(x_i)\phi(x_i, y_i)} \\
    & = \frac{\E_{D} \bracket{\phi(X, Y)} }{\E_{D_{|y}} \bracket{\phi(X, Y)} } \\
    & = \frac{\P_{D}(y|x)\E_{D_{|y}} \bracket{\phi(X, Y)} + (1-\P_{D}(y|x)) \E_{D_{|\neq y}} \bracket{\phi(X, Y)} }{\E_{D_{|y}} \bracket{\phi(X, Y)} } \\
    & = \P_{D}(y|x) + \left(1-\P_{D}(y|x)\right) \frac{\E_{D|_{\neq y}} \bracket{\phi(X, Y)} }{\E_{ D_{|y}} \bracket{\phi(X, Y)} }. 
\end{aligned}$$ Plugging this back in, we find that the difference is $$\begin{aligned}
     - \sum_{y\in\sY} \P_{t}(y\mid x) \log \paren*{\P_D(y|x) + (1-\P_D(y|x)) \frac{\E_{D|_{\neq y}} \bracket{\phi(X, Y)} }{\E_{ D_{|y}} \bracket{\phi(X, Y)} }},
    
\end{aligned}$$ for which the theorem statement is a special case. ◻
:::

```{=latex}
\newpage
```
# NeurIPS Paper Checklist {#neurips-paper-checklist .unnumbered}

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: **The papers not including the checklist will be desk rejected.** The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

-   You should answer `\answerYes{}`{=latex}, `\answerNo{}`{=latex}, or `\answerNA{}`{=latex}.

-   `\answerNA{}`{=latex} means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

-   Please provide a short (1--2 sentence) justification right after your answer (even for `\answerNA`{=latex}).

**The checklist answers are an integral part of your paper submission.** They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will also be asked to include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While `\answerYes{}`{=latex} is generally preferable to `\answerNo{}`{=latex}, it is perfectly acceptable to answer `\answerNo{}`{=latex} provided a proper justification is given (e.g., error bars are not reported because it would be too computationally expensive" or \`\`we were unable to find the license for the dataset we used"). In general, answering `\answerNo{}`{=latex} or `\answerNA{}`{=latex} is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer `\answerYes{}`{=latex} to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

-   **Delete this instruction block, but keep the section heading \`\`NeurIPS Paper Checklist\"**,

-   **Keep the checklist subsection headings, questions/answers and guidelines below.**

-   **Do not modify the questions and only use the provided macros for your answers**.

1.  **Claims**

2.  Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

3.  Answer: `\answerYes{}`{=latex}

4.  Justification: Our central claim is that sufficiently trained large models can benefit from all of Common Crawl. We provide extensive scaling studies and experiments to support this.

5.  Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the abstract and introduction do not include the claims made in the paper.

    -   The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A `\answerNo{}`{=latex} or `\answerNA{}`{=latex} answer to this question will not be perceived well by the reviewers.

    -   The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    -   It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

6.  **Limitations**

7.  Question: Does the paper discuss the limitations of the work performed by the authors?

8.  Answer: `\answerYes{}`{=latex}

9.  Justification: We provide a clear limitations section in the main text which includes discussions of the pretraining setting and compute required.

10. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper has no limitation while the answer `\answerNo{}`{=latex} means that the paper has limitations, but those are not discussed in the paper.

    -   The authors are encouraged to create a separate \`\`Limitations" section in their paper.

    -   The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    -   The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    -   The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    -   The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    -   If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    -   While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

11. **Theory assumptions and proofs**

12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

13. Answer: `\answerYes{}`{=latex}

14. Justification: Each theoretical statement is introduced with the necessary notation and assumptions in the main text, and we provide the proofs in the supplemental material.

15. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not include theoretical results.

    -   All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    -   All assumptions should be clearly stated or referenced in the statement of any theorems.

    -   The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    -   Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    -   Theorems and Lemmas that the proof relies upon should be properly referenced.

16. **Experimental result reproducibility**

17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

18. Answer: `\answerYes{}`{=latex}

19. Justification: We describe the experimental setup in detail in the preliminaries section, along with the already open-source code that we use. Further experimental details are provided in the first section of the supplementary material.

20. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not include experiments.

    -   If the paper includes experiments, a `\answerNo{}`{=latex} answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    -   If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    -   Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    -   While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

        1.  If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

        2.  If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

        3.  If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

        4.  We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

21. **Open access to data and code**

22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

23. Answer: `\answerYes{}`{=latex}

24. Justification: The data, such as the DCLM-Pool dataset, is already public. The code is simply the public Meta Lingua code repository along with the public DCLM code repository, which already contain their own respective setup instructions.

25. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that paper does not include experiments requiring code.

    -   Please see the NeurIPS code and data submission guidelines (<https://neurips.cc/public/guides/CodeSubmissionPolicy>) for more details.

    -   While we encourage the release of code and data, we understand that this might not be possible, so `\answerNo{}`{=latex} is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    -   The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (<https://neurips.cc/public/guides/CodeSubmissionPolicy>) for more details.

    -   The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    -   The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    -   At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    -   Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

26. **Experimental setting/details**

27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer) necessary to understand the results?

28. Answer: `\answerYes{}`{=latex}

29. Justification: We describe the experimental setting in the preliminaries section of the main text along with further details and hyperparameter choices such as learning rate and weight decay in the first section of the supplementary material.

30. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not include experiments.

    -   The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    -   The full details can be provided either with the code, in appendix, or as supplemental material.

31. **Experiment statistical significance**

32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

33. Answer: `\answerNo{}`{=latex}

34. Justification: Error bars are not reported because this would be too computationally expensive in our transformer pretraining setting.

35. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not include experiments.

    -   The authors should answer `\answerYes{}`{=latex} if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    -   The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    -   The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    -   The assumptions made should be given (e.g., Normally distributed errors).

    -   It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    -   It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    -   For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g., negative error rates).

    -   If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

36. **Experiments compute resources**

37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

38. Answer: `\answerYes{}`{=latex}

39. Justification: We describe the compute resources used in the first section of the supplementary material.

40. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not include experiments.

    -   The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    -   The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    -   The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).

41. **Code of ethics**

42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <https://neurips.cc/public/EthicsGuidelines>?

43. Answer: `\answerYes{}`{=latex}

44. Justification: We have read the code of ethics and our research complies with all the listed areas.

45. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the authors have not reviewed the NeurIPS Code of Ethics.

    -   If the authors answer `\answerNo`{=latex}, they should explain the special circumstances that require a deviation from the Code of Ethics.

    -   The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

46. **Broader impacts**

47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

48. Answer: `\answerNA{}`{=latex}

49. Justification: Our work is foundational in nature and not tied to any specific applications or deployments.

50. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that there is no societal impact of the work performed.

    -   If the authors answer `\answerNA{}`{=latex} or `\answerNo`{=latex}, they should explain why their work has no societal impact or why the paper does not address societal impact.

    -   Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    -   The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    -   The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    -   If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

51. **Safeguards**

52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pre-trained language models, image generators, or scraped datasets)?

53. Answer: `\answerNA{}`{=latex}

54. Justification: We do not pose any such risks because we do not release new data nor any pre-trained language models. We only pretrain the language models to measure their performance.

55. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper poses no such risks.

    -   Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    -   Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    -   We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

56. **Licenses for existing assets**

57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

58. Answer: `\answerYes{}`{=latex}

59. Justification: Every use of code and data is properly cited in the main text.

60. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not use existing assets.

    -   The authors should cite the original paper that produced the code package or dataset.

    -   The authors should state which version of the asset is used and, if possible, include a URL.

    -   The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    -   For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    -   If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, [paperswithcode.com/datasets](paperswithcode.com/datasets){.uri} has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    -   For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    -   If this information is not available online, the authors are encouraged to reach out to the asset's creators.

61. **New assets**

62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

63. Answer: `\answerNA{}`{=latex}

64. Justification: We are not releasing any new assets.

65. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not release new assets.

    -   Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    -   The paper should discuss whether and how consent was obtained from people whose asset is used.

    -   At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

66. **Crowdsourcing and research with human subjects**

67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

68. Answer: `\answerNA{}`{=latex}

69. Justification: The paper does not involve crowdsourcing nor research with human subjects.

70. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not involve crowdsourcing nor research with human subjects.

    -   Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    -   According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

71. **Institutional review board (IRB) approvals or equivalent for research with human subjects**

72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

73. Answer: `\answerNA{}`{=latex}

74. Justification: The paper does not involve crowdsourcing nor research with human subjects.

75. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the paper does not involve crowdsourcing nor research with human subjects.

    -   Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    -   We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    -   For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

76. **Declaration of LLM usage**

77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does *not* impact the core methodology, scientific rigor, or originality of the research, declaration is not required.

78. Answer: `\answerNA{}`{=latex}

79. Justification: The core method development did not involve LLMs as any important, original, or non-standard components.

80. Guidelines:

    -   The answer `\answerNA{}`{=latex} means that the core method development in this research does not involve LLMs as any important, original, or non-standard components.

    -   Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described.

[^1]: Use footnote for providing further information about author (webpage, alternative address)---*not* for acknowledging funding agencies.
