Scaling Laws, Carefully

Source

Raw Markdown: paper_scaling-laws-carefully-2026.md
Official blog post: Scaling Laws, Carefully
Original X status: https://x.com/i/status/2070237256070389897
Canonical X status: https://x.com/lilianweng/status/2070237256070389897
Local blog snapshots: papers/scaling-laws-carefully-2026/source_blog.html, papers/scaling-laws-carefully-2026/source_blog_article.md
Local X API snapshot: papers/scaling-laws-carefully-2026/x_post_lilianweng_2070237256070389897.json

Status And Credibility

This is a June 24, 2026 Lil’Log technical blog article announced on X by Lilian Weng on June 25, 2026. It is not a peer-reviewed paper and does not introduce new experiments, but it is credible as an expert synthesis and literature map: the X API metadata identifies the author as Lilian Weng, co-founder of Thinking Machines Lab, former VP of AI Safety and robotics/applied research at OpenAI, and author of Lil’Log. Treat the source as a careful explainer and method-hygiene anchor for scaling-law experiments rather than as standalone SOTA evidence.

Core Claim

Scaling laws are useful because they turn expensive training decisions into a constrained allocation problem. Under a dense-Transformer approximation, training compute is often modeled as:

C \approx 6 N D,

where $N$ is model size and $D$ is training data, usually token count. The point of a scaling law is to fit small runs, estimate how loss changes with $N$ , $D$ , and $C$ , and choose the compute-optimal allocation before paying for a large run.

The article’s most important practical warning is that the fitted optimum is fragile. Kaplan-style and Chinchilla-style recommendations disagree not because compute-optimality is meaningless, but because parameter counting, fit region, scale range, embedding parameters, data assumptions, and fitting details can shift the inferred frontier.

Key Contributions

Reconstructs the path from early learning-curve power laws through Rosenfeld-style joint loss models, Kaplan et al. 2020, Chinchilla / Hoffmann et al. 2022, data-constrained scaling, and Chinchilla replication/fitting caveats.
Explains the Chinchilla fixed-compute objective:
$N_{opt} (C), D_{opt} (C) = FLOPs (N, D) = C arg min \hat{L} (N, D) .$
Shows how the parametric fit
$\hat{L} (N, D) = \frac{A}{N ^{α}} + \frac{B}{D ^{β}} + E$
yields closed-form optima under $C \approx 6 N D$ , and why $α \approx β$ implies roughly equal scaling of parameters and data.
Separates the infinite-unique-data regime from data-limited training where repeated tokens have discounted value and larger models can overfit faster.
Highlights the Lovelace et al. 2026 data-repetition penalty term, where loss grows with both repetition count and the capacity ratio $N / U_{D}$ .
Emphasizes that scaling-law fits assume the only changing factor is scale; architecture, optimizer, learning-rate schedule, batch ramp, data mix, tokenizer, and tuning quality should not silently change across the sweep.

Why It Matters For This Wiki

The source upgrades fixed-FLOPs discussions from a slogan to an experimental contract. A fixed-budget hierarchy, router, patcher, MoE, recurrent-depth policy, or context-compression method should not only report one matched-compute comparison. It should map an IsoFLOP or budget frontier and show where the new allocation variable moves the optimum.

For time-series and world-model work, the direct transfer is methodological rather than empirical. The article is about language-model scaling, not numeric time series. But its warnings become even sharper for multivariate time series because the effective data unit is not just tokens: it can be samples, channel-time cells, irregular events, graph snapshots, intervention windows, or compressed latent states. A time-series scaling claim must name which unit is varied and which dense numeric, rare-regime, or action-conditioned information must be preserved.

Impact On Fixed-FLOPs Hierarchical Training

Hierarchical Modeling with a Fixed FLOPs Budget should treat this source as a calibration checklist:

Fixed FLOPs is a constraint, not a result. The budget controller needs an explicit cost model analogous to $C \approx 6 N D$ , but adapted to active tokens, channel groups, width, recurrent loops, MoE experts, return-path decoding, and memory or latency overhead.
Adaptive compression adds a new scaling variable. The experiment should compare not only model size and data size, but also expected active representation count or compression rate. This echoes Compute Optimal Tokenization, but the unit for time series should be samples, channel-time cells, events, or task-grounded information units rather than text tokens.
IsoFLOP curves matter. A learned router should be evaluated against uniform, hand-scheduled, and static bowtie baselines across several budgets, not only at one chosen $B$ .
Data limits can reverse the optimum. Useful-signal-poor temporal corpora have repeated normal-state windows. A fixed-FLOPs sampler or router must distinguish fresh informative windows from repeated redundant windows, otherwise it may optimize compute around the wrong effective data count.
Fit sensitivity must be reported. If small TSFM sweeps are extrapolated to larger models, the report should include fit-region, loss-noise, rounding, optimizer, data-mix, and target-construction sensitivity.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute allocation	adjacent	Gives the fixed-compute optimization frame and explains why compute allocation should be inferred from fitted frontiers rather than hand intuition.	Needs numeric time-series sweeps where adaptive routers, patchers, channel hierarchies, or recurrent-depth policies are included as scaling variables.
Scaling and efficiency	warning	Shows that aggregate loss scaling can be sensitive to parameter counting, fit region, precision, data regime, and assumptions about what changes across runs.	TSFM scaling reports need confidence intervals, fit sensitivity, data-quality accounting, and realized latency/memory checks.
Data diversity and curriculum	warning	Data-constrained sections show repeated data is not equivalent to unique data and that capacity ratio interacts with repetition.	Need TSFM-specific effective-data units for repeated normal windows, rare regimes, event streams, and intervention windows.
Adaptive tokenization and patching	adjacent	Reinforces that compression/tokenization changes the unit in a scaling law.	Need non-text units and preservation probes for spikes, change points, dense numeric detail, channel coupling, and action history.
Benchmark hygiene	warning	A smooth aggregate loss frontier does not prove which capabilities emerged, and fit details can move the frontier.	Pair scaling curves with capability probes for rare regimes, context use, channel coupling, event parsing, and action-conditioned rollout.

Limitations

This is a blog synthesis, not a paper with new experiments or a peer-reviewed result.
Most evidence is language-model scaling; transfer to time-series foundation models is methodological.
$C \approx 6 N D$ is a dense-Transformer estimate. It omits sparse kernels, MoE routing, adaptive token counts, recurrent depth, memory bandwidth, batching, and latency.
The source is about loss scaling, not latent-state capability emergence, rare-regime preservation, or action-conditioned world-model utility.

Links Into The Wiki

Open Questions

What is the right non-text analogue of D: samples, channel-time cells, compressed bits, events, trajectories, useful-signal windows, or intervention-labeled transitions?
Can a fixed-FLOPs hierarchy fit a scaling law over active representation count, model size, data size, and preservation loss?
How should repeated normal-state windows be discounted when training on observability or industrial telemetry?
Which capability probes should accompany aggregate loss when comparing IsoFLOP curves for latent-state time-series models?
How should theoretical FLOPs be calibrated to wall-clock latency and memory bandwidth for adaptive routing and variable-shape hierarchies?

Alex Open Research Wiki

Explorer

Scaling Laws, Carefully

Scaling Laws, Carefully

Source

Status And Credibility

Core Claim

Key Contributions

Why It Matters For This Wiki

Impact On Fixed-FLOPs Hierarchical Training

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks