# Scaling Laws, Carefully

## Provenance

- Source type: author technical blog post plus X announcement.
- Blog title: Scaling Laws, Carefully.
- Blog date: 2026-06-24.
- Author: Lilian Weng.
- Official blog URL: <https://lilianweng.github.io/posts/2026-06-24-scaling-laws/>
- Original X status: <https://x.com/i/status/2070237256070389897>
- Canonical X status resolved from API author metadata: <https://x.com/lilianweng/status/2070237256070389897>
- X post timestamp: 2026-06-25T20:06:39Z.
- Snapshot date: 2026-06-27.
- Local source artifacts: `source_blog.html`, `source_blog_article.html`, `source_blog_article.md`, `x_post_lilianweng_2070237256070389897.json`, `assets/`, and `00README.json`.

## Source Status

This is a 2026 Lil'Log technical blog article, not a peer-reviewed paper. It is credible as an expert synthesis because Lilian Weng is the author of Lil'Log and the X API author metadata identifies her as a co-founder of Thinking Machines Lab, former VP of AI Safety and robotics/applied research at OpenAI, and the owner of `lilianweng.github.io`. Treat it as a curated scaling-law explainer and literature map, not as new experimental evidence by itself.

## X Announcement Snapshot

The X post says:

> A super long overdue (3+ years?) post on scaling laws.
>
> Compute is expensive. Scaling laws are a way to help us reason about the optimal compute allocation between data and model size before committing to a large run.
>
> The post covers what scaling laws predict, how compute-optimal allocation works, why Kaplan et al. and Chinchilla disagree, and how data limits + fitting details make extrapolation tricky.
>
> https://t.co/HP26eJvjHB

The API response resolves the URL to <https://lilianweng.github.io/posts/2026-06-24-scaling-laws/>. At extraction time the API public metrics were stored in `x_post_lilianweng_2070237256070389897.json`, but engagement counts are unstable and should not be used as technical evidence.

## Blog Snapshot

The article frames scaling laws as empirical power-law relationships among training loss `L`, model size `N`, dataset size `D`, and training compute `C`. The practical value is not the elegance of a log-log line; it is being able to fit smaller runs and extrapolate before paying for a large training run.

A key approximation repeated in the article is the Kaplan-style training compute estimate:

$$
C \approx 6ND,
$$

where `N` is parameter count and `D` is training tokens. The factor 6 comes from approximately `2N` forward-pass FLOPs per token and twice that for backpropagation.

## Compute-Optimal Allocation

The blog contrasts Kaplan et al. 2020 and Chinchilla / Hoffmann et al. 2022:

- Kaplan-style fits suggested scaling model size faster than data under a fixed compute budget and stopping the larger model before convergence.
- Chinchilla-style fits found that the compute-optimal frontier scales model size and training tokens at roughly similar rates, making many older large models undertrained.

The Chinchilla-style fixed-compute optimization is summarized as:

$$
N_{\text{opt}}(C), D_{\text{opt}}(C)
= \operatorname*{arg\,min}_{\text{s.t. } \operatorname{FLOPs}(N,D)=C} \hat{L}(N,D).
$$

For the parametric loss model

$$
\hat{L}(N,D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E,
$$

and the constraint `C ≈ 6ND`, the closed-form optimum is:

$$
N_{\text{opt}} =
\left(\frac{\alpha A}{\beta B}\right)^{\frac{1}{\alpha+\beta}}
\left(\frac{C}{6}\right)^{\frac{\beta}{\alpha+\beta}},
$$

$$
D_{\text{opt}} =
\left(\frac{\beta B}{\alpha A}\right)^{\frac{1}{\alpha+\beta}}
\left(\frac{C}{6}\right)^{\frac{\alpha}{\alpha+\beta}}.
$$

When `α ≈ β`, the article emphasizes that parameters and data should scale at roughly equal rates.

## Data-Limited Scaling

The article then moves from the idealized infinite-unique-data regime to finite-data training. It highlights three points that matter for any fixed-budget training idea:

1. The raw token count `D` assumes cleaned, useful data. Two datasets with the same token count can have very different compute efficiency.
2. Repeated data is not equivalent to fresh data. Muennighoff et al. model repeated tokens through an effective data count `D'` whose marginal value decays with repetition.
3. Lovelace et al. add an explicit overfitting penalty that grows with both the repetition count and the capacity ratio `N / U_D`, where `U_D` is unique data.

The overfitting-penalty form highlighted by the article is:

$$
\hat{L}(N, U_D, R_D)
= E + \frac{A}{N^\alpha}
+ \frac{B}{\left(U_D(1+R_D)\right)^\beta}
+ P \cdot R_D^\delta \cdot \left(\frac{N}{U_D}\right)^\kappa.
$$

The practical message is that fixed compute does not remove the data-quality and data-repetition problem. A model can be compute-matched and still be in the wrong data regime.

## Fitting And Extrapolation Caveats

The blog closes by warning that scaling-law fits are sensitive to details that can look trivial:

- whether parameters include embeddings;
- loss precision and rounding;
- whether fitting sums or averages robust loss terms;
- which size/data region is included in the fit;
- whether architecture, optimizer, learning-rate schedule, batch ramp, data mix, tokenizer, and tuning quality remain fixed across the runs.

The source uses the Kaplan-vs-Chinchilla disagreement and Besiroglu et al.'s Chinchilla replication discussion to show why small fitting choices can shift large-run recommendations.

## Local Interpretation Notes

For this knowledge base, the source is most useful as a method-hygiene anchor for compute-budgeted training ideas. It does not by itself prove a new time-series or world-model scaling law. It does, however, state the right experimental discipline:

- define what counts as compute;
- compare designs on IsoFLOP or fixed-budget frontiers rather than one cherry-picked run;
- name the data unit and data-quality regime;
- separate parameter count, data count, tokenization/compression, and repetition;
- report confidence or sensitivity of the fitted optimum before extrapolating.

## Limitations

- The source is a blog synthesis rather than a paper with new experiments.
- The central evidence comes from language-model scaling, not numeric time series, event streams, graph time series, or action-conditioned world models.
- The `C ≈ 6ND` estimate is a useful dense-Transformer approximation, not a complete cost model for sparse attention, MoE, adaptive tokenization, recurrent depth, hierarchical compression, memory bandwidth, or real latency.
- Scaling-law fits over aggregate loss do not reveal which latent-state capabilities, rare-regime probes, channel-coupling skills, or intervention-window dynamics have emerged.
- Engagement metrics from the X announcement are not technical evidence.
