A Bitter Lesson for Data Filtering

Source

Status And Credibility

Submitted to arXiv on 2026-05-19 as v1. This is a current Stanford University preprint by Christopher Mohri, John Duchi, and Tatsunori Hashimoto. Treat it as credible current evidence for large-scale language-model pretraining data filtering because the authors are a strong optimization and language-modeling team, the paper includes a large scaling study, and the authors released replication configs/instructions. It is still an arXiv preprint rather than a peer-reviewed result, and the code repository mainly provides configs around Meta Lingua and DCLM rather than a one-command reproduction.

Core Claim

The paper argues that in high-compute, data-scarce language-model pretraining, the best tested filter can eventually be no filter. Standard filters can help at small compute, but when model size and training steps are scaled enough, an unfiltered Common Crawl pool outperforms filtered subsets and can even benefit from documents that look low quality or distracting.

The central objective is dataset value after allowing the training run to use the model size and step count that best extract the dataset:

That objective is important because it does not ask which dataset wins under one fixed small-compute recipe. It asks which dataset has more extractable value when compute can be increased.

Author Narrative Context

The X post that triggered this ingest summarizes the result as: Stanford researchers tested the assumption that large models need only high-quality filtered data; filtering helps under small compute budgets, but as models get larger and train longer, the full unfiltered dataset can win; large models can tolerate messy, irrelevant, and some junk data; and the practical choice becomes a trade-off between compute and curation.

That social framing is directionally aligned with the paper, but it is broader than what the experiments prove. The paper’s claim is about dense transformer language-model pretraining on Common Crawl-like text under a high-compute regime. It is not a proof that every modality, architecture, data source, or deployment stage should stop filtering.

Key Contributions

  • Runs controlled scaling studies over Common Crawl pools and filtered variants, varying model size and training steps rather than comparing one fixed recipe.
  • Compares unfiltered Common Crawl against English, repetition, stop-word, RefinedWeb, and DCLM-Baseline-style filters.
  • Shows that filtered data can be better at low compute, while unfiltered data crosses over and wins for sufficiently large models trained sufficiently long.
  • Tests deliberate low-quality injection with random strings and word-shuffled documents, finding that sufficiently large models are surprisingly robust and can extract useful unigram/co-occurrence signal from shuffled text.
  • Builds scaling-law extrapolations suggesting that the full 240T-token DCLM Common Crawl pool could beat RefinedWeb around the 1e30 FLOP scale under their assumptions.
  • Provides theoretical toy models for why sufficient capacity can absorb additional tasks/noise and when filtering can help or hurt conditional prediction.

Method Notes

The paper uses DCLM-Pool Common Crawl before 2023, parsed from HTML with resiliparse, with subsets from roughly 670M to 10B GPT-NeoX tokens. Models are Llama-style dense transformers from 15M to 7B parameters trained with Meta Lingua. The main metrics are validation negative log-likelihood on English C4, FineWeb-Edu, and Cosmopedia; appendix benchmark results include ARC-Easy, PIQA, and SocialIQA.

The tested filters are representative open curation filters rather than an exhaustive learned data-selection search:

  • English filter using fastText language scoring.
  • Repetition filter inspired by Gopher-style duplicate/repetition thresholds.
  • Stop-word filter.
  • RefinedWeb-style filter stack.
  • DCLM-Baseline-style heuristic, deduplication, and quality filtering.

The low-quality injection experiments add either random generated strings or additional Common Crawl documents with word order shuffled.

flowchart LR
  CC["Common Crawl pool"] --> Filters["standard filters"]
  CC --> Raw["unfiltered pool"]
  CC --> Junk["+ random or shuffled text"]
  Filters --> Train["scale model size and steps"]
  Raw --> Train
  Junk --> Train
  Train --> Eval["best validation loss / benchmarks"]
  Eval --> Verdict["filter helps low compute; pool can win high compute"]

Evidence And Results

The strongest empirical result is the crossing behavior: with small models or insufficient training steps, filtered subsets can outperform the raw pool; with larger models and enough training, the unfiltered pool reaches the best loss in the 670M-token setting and the crossing point arrives earlier as model size increases.

The paper’s scaling-pool experiments make the result less trivial. As the Common Crawl pool size grows, the number of steps needed for the raw pool to beat RefinedWeb grows rapidly, while larger models reduce that crossing step count. Their extrapolation predicts that the full 240T-token DCLM-Pool could require roughly 1e30 FLOPs before no-filter becomes optimal.

The low-quality injection experiments refine the story. Random strings and shuffled-word documents hurt smaller models, but gaps close as model size increases. Shuffled documents can help because they still contain useful unigram and co-occurrence statistics from additional Common Crawl documents, even after syntax is damaged.

The degradation section is the main caveat against overgeneralizing. The authors explicitly distinguish easy-to-separate covariate-shift or junk data from harmful conditional-distribution shift: enough false statements such as “The capital of France is Copenhagen” should make the model learn the wrong fact. They also show a first-token degradation case for shuffled documents, because a model cannot detect a shuffled document from an empty prefix.

Relation To Dynamic Curriculum Learning

This source does not kill the case for Dynamic Curriculum Learning For JEPA. It changes the default filter contract.

The dynamic-curriculum lesson is not “always filter more.” It is:

use filtering or reweighting to allocate compute,
not to irreversibly erase distributional support.

For a useful-signal-poor corpus, repeated normal windows can waste compute, so dynamic selection is still valuable. But this paper warns that human-designed quality filters can remove low-quality-looking examples that actually carry useful weak signal, rare vocabulary, unusual co-occurrences, or long-tail regimes. A dynamic curriculum should therefore behave more like a bounded sampler than a hard deletion policy.

For multivariate time-series models, the translation is:

  • Exact or near-duplicate normal windows can be downweighted because they add little new state information.
  • Structured rare regimes, pre-failure buildup, intervention windows, unusual seasonality, and tail devices should not be removed merely because they have high loss, odd scale, or low surface quality.
  • A uniform sampling floor is not a hack; it is a support-preservation mechanism.
  • High-surprise caps are still needed for corrupt or unlearnable windows, but the cap should be paired with per-regime coverage checks so it does not discard all tails.
  • At high enough compute, no-filter or loose-filter baselines become essential controls. If a dynamic curriculum only wins under tiny budgets by deleting diversity, it is not a foundation-model result.

So Alex’s hypothesis is mostly right with a sharper boundary: natural diversity can make aggressive filtering harmful, but redundant repetition still wastes training compute. The useful filter is diversity-preserving reweighting or deduplication, not benchmark- or quality-proxy pruning that shrinks the support of the data distribution.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Data diversity, curriculum, and long tailwarningShows that filters can throw away useful low-quality-looking signal and that raw diverse data can win when compute and model capacity are high.Need matched-compute time-series and video tests comparing no filter, loose filters, deduplication, and dynamic reweighting under rare-state metrics.
Dynamic compute allocationadjacentReframes filtering as a compute-vs-data trade-off: with low compute, selection helps; with enough compute, keeping more data can win.Need policies that adapt selection strength to compute budget, model size, and data repetition rate.
Benchmark levelwarningAggregate validation loss can prefer filtered data early and raw data later; downstream benchmark trends are noisier.TSFM benchmarks need rare-regime retention, normal-retention, corruption controls, and diversity support metrics, not only average forecast error.
Training dynamicsadjacentDataset value depends on model size, training steps, epoching, weight decay, and capacity to separate tasks/noise.Need TSFM evidence under practical optimizers and long streaming corpora.

Limitations And Gotchas

  • The source is a 2026 arXiv v1 preprint, not a peer-reviewed result.
  • The evidence is language-model pretraining on Common Crawl-like text, not multivariate time series, event streams, JEPA, video, or action-conditioned world models.
  • The experiments use dense transformers and do not test MoE instability, post-training, data curricula, data weights, or safety-heavy filtering.
  • The full-pool conclusion requires very large compute in the authors’ extrapolation. Under realistic low-compute budgets, filtering can still help.
  • The tested filters are common heuristic filters, not an exhaustive search over learned or target-aware filters.
  • The paper does not imply that harmful conditional-distribution shifts, incorrect labels, poisoning, PII, legal constraints, or safety risks should be left unfiltered.
  • The DCLM-Pool data is pre-2023, so the paper does not settle the effect of future Common Crawl with much higher AI-generated-content density.

Open Questions

  • At what compute and model-size scale does no-filter beat dynamic filtering for multivariate time-series and video pretraining?
  • Can a dynamic curriculum preserve the benefits of no-filter diversity while avoiding repeated low-information windows?
  • Which diversity metrics should constrain the sampler: per-regime coverage, gradient diversity, embedding coverage, autocorrelation, topology coverage, or rare-state probe score?
  • How much of the benefit of no-filter comes from true diversity versus simple token volume and epoching effects?
  • Can a learned curriculum discover when to loosen filtering as compute grows?
  • How should privacy, safety, PII, and poisoning filters be separated from quality filters that might remove useful tails?