Scaling Laws, Carefully

Source

Status And Credibility

This is a June 24, 2026 Lil’Log technical blog article announced on X by Lilian Weng on June 25, 2026. It is not a peer-reviewed paper and does not introduce new experiments, but it is credible as an expert synthesis and literature map: the X API metadata identifies the author as Lilian Weng, co-founder of Thinking Machines Lab, former VP of AI Safety and robotics/applied research at OpenAI, and author of Lil’Log. Treat the source as a careful explainer and method-hygiene anchor for scaling-law experiments rather than as standalone SOTA evidence.

Core Claim

Scaling laws are useful because they turn expensive training decisions into a constrained allocation problem. Under a dense-Transformer approximation, training compute is often modeled as:

where is model size and is training data, usually token count. The point of a scaling law is to fit small runs, estimate how loss changes with , , and , and choose the compute-optimal allocation before paying for a large run.

The article’s most important practical warning is that the fitted optimum is fragile. Kaplan-style and Chinchilla-style recommendations disagree not because compute-optimality is meaningless, but because parameter counting, fit region, scale range, embedding parameters, data assumptions, and fitting details can shift the inferred frontier.

Key Contributions

  • Reconstructs the path from early learning-curve power laws through Rosenfeld-style joint loss models, Kaplan et al. 2020, Chinchilla / Hoffmann et al. 2022, data-constrained scaling, and Chinchilla replication/fitting caveats.

  • Explains the Chinchilla fixed-compute objective:

  • Shows how the parametric fit

    yields closed-form optima under , and why implies roughly equal scaling of parameters and data.

  • Separates the infinite-unique-data regime from data-limited training where repeated tokens have discounted value and larger models can overfit faster.

  • Highlights the Lovelace et al. 2026 data-repetition penalty term, where loss grows with both repetition count and the capacity ratio .

  • Emphasizes that scaling-law fits assume the only changing factor is scale; architecture, optimizer, learning-rate schedule, batch ramp, data mix, tokenizer, and tuning quality should not silently change across the sweep.

Why It Matters For This Wiki

The source upgrades fixed-FLOPs discussions from a slogan to an experimental contract. A fixed-budget hierarchy, router, patcher, MoE, recurrent-depth policy, or context-compression method should not only report one matched-compute comparison. It should map an IsoFLOP or budget frontier and show where the new allocation variable moves the optimum.

For time-series and world-model work, the direct transfer is methodological rather than empirical. The article is about language-model scaling, not numeric time series. But its warnings become even sharper for multivariate time series because the effective data unit is not just tokens: it can be samples, channel-time cells, irregular events, graph snapshots, intervention windows, or compressed latent states. A time-series scaling claim must name which unit is varied and which dense numeric, rare-regime, or action-conditioned information must be preserved.

Impact On Fixed-FLOPs Hierarchical Training

Hierarchical Modeling with a Fixed FLOPs Budget should treat this source as a calibration checklist:

  1. Fixed FLOPs is a constraint, not a result. The budget controller needs an explicit cost model analogous to , but adapted to active tokens, channel groups, width, recurrent loops, MoE experts, return-path decoding, and memory or latency overhead.
  2. Adaptive compression adds a new scaling variable. The experiment should compare not only model size and data size, but also expected active representation count or compression rate. This echoes Compute Optimal Tokenization, but the unit for time series should be samples, channel-time cells, events, or task-grounded information units rather than text tokens.
  3. IsoFLOP curves matter. A learned router should be evaluated against uniform, hand-scheduled, and static bowtie baselines across several budgets, not only at one chosen .
  4. Data limits can reverse the optimum. Useful-signal-poor temporal corpora have repeated normal-state windows. A fixed-FLOPs sampler or router must distinguish fresh informative windows from repeated redundant windows, otherwise it may optimize compute around the wrong effective data count.
  5. Fit sensitivity must be reported. If small TSFM sweeps are extrapolated to larger models, the report should include fit-region, loss-noise, rounding, optimizer, data-mix, and target-construction sensitivity.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute allocationadjacentGives the fixed-compute optimization frame and explains why compute allocation should be inferred from fitted frontiers rather than hand intuition.Needs numeric time-series sweeps where adaptive routers, patchers, channel hierarchies, or recurrent-depth policies are included as scaling variables.
Scaling and efficiencywarningShows that aggregate loss scaling can be sensitive to parameter counting, fit region, precision, data regime, and assumptions about what changes across runs.TSFM scaling reports need confidence intervals, fit sensitivity, data-quality accounting, and realized latency/memory checks.
Data diversity and curriculumwarningData-constrained sections show repeated data is not equivalent to unique data and that capacity ratio interacts with repetition.Need TSFM-specific effective-data units for repeated normal windows, rare regimes, event streams, and intervention windows.
Adaptive tokenization and patchingadjacentReinforces that compression/tokenization changes the unit in a scaling law.Need non-text units and preservation probes for spikes, change points, dense numeric detail, channel coupling, and action history.
Benchmark hygienewarningA smooth aggregate loss frontier does not prove which capabilities emerged, and fit details can move the frontier.Pair scaling curves with capability probes for rare regimes, context use, channel coupling, event parsing, and action-conditioned rollout.

Limitations

  • This is a blog synthesis, not a paper with new experiments or a peer-reviewed result.
  • Most evidence is language-model scaling; transfer to time-series foundation models is methodological.
  • is a dense-Transformer estimate. It omits sparse kernels, MoE routing, adaptive token counts, recurrent depth, memory bandwidth, batching, and latency.
  • The source is about loss scaling, not latent-state capability emergence, rare-regime preservation, or action-conditioned world-model utility.

Open Questions

  • What is the right non-text analogue of D: samples, channel-time cells, compressed bits, events, trajectories, useful-signal windows, or intervention-labeled transitions?
  • Can a fixed-FLOPs hierarchy fit a scaling law over active representation count, model size, data size, and preservation loss?
  • How should repeated normal-state windows be discounted when training on observability or industrial telemetry?
  • Which capability probes should accompany aggregate loss when comparing IsoFLOP curves for latent-state time-series models?
  • How should theoretical FLOPs be calibrated to wall-clock latency and memory bandwidth for adaptive routing and variable-shape hierarchies?