LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Source

Status And Credibility

arXiv lists the paper as cs.LG, version v1, submitted on 2026-05-22. The rendered PDF first page gives a paper date of 2026-05-25 and affiliations with ByteDance Seed, University of Virginia, and University of California, Berkeley. The ICML virtual site lists the paper as an ICML 2026 poster, which makes this stronger than ordinary preprint-only evidence.

Credibility is high enough for an important ingest because the source has a tier-1 venue page, a ByteDance Seed / UVA / UC Berkeley author team, a public arXiv source, and experiments over Pythia and OLMo2 checkpoint grids. Caveats: the paper is still a scaling-law fit and extrapolation study rather than independently reproduced infrastructure; official code and released model checkpoints were not found at ingest time.

The Gonzo post and ArXivIQ review list Code: N/A and Model: N/A; artifact search at ingest time found no verified official repository or checkpoint release.

Core Claim

The paper argues that classical monotonic LLM scaling laws miss regimes where adding model size, training tokens, or post-training pressure makes downstream loss worse. It proposes the Shannon Scaling Law, treating LLM training as information transmission over a noisy channel: model parameters act like channel bandwidth, training tokens act like signal power, and data/model/architecture perturbations act like noise.

The central capacity form is:

where is model size, is training tokens, models data-model interaction noise, models data-induced accumulated noise, and is an irreducible noise floor. The paper then links loss inversely to capacity, .

The useful conceptual move is not only the equation. It is the claim that the familiar monotonic scaling regime is a high-SNR special case, while catastrophic overtraining, aggressive supervised fine-tuning, low-bit quantization, and injected Gaussian noise expose low-SNR U-shaped loss basins.

Evidence

The paper fits and compares scaling-law forms on Pythia-dedup and OLMo2 checkpoints. Evaluation uses WikiText-2 loss after three perturbation families:

  • additive Gaussian weight noise with controlled SNR;
  • full supervised fine-tuning on GSM8K, SiQA, and StarCoder-Python with varying learning rates;
  • GPTQ post-training quantization to 4-bit, 3-bit, and 2-bit.

Key reported numbers:

SettingReported result for Shannon Scaling LawWhy it matters
Gaussian-noise fittingPythia average ; OLMo2 average Robust across high- and low-SNR regimes.
SFT fittingAverage on GSM8K, on SiQA, on StarCoderCaptures U-shaped loss basins where monotonic laws can collapse.
Quantization fittingPythia average ; OLMo2 average Models degradation down to 2-bit GPTQ better than standard laws.
Unperturbed pretraining on Pythia and on OLMo2Treats ordinary pretraining as the high-SNR special case.
Joint extrapolationfitted on B Pythia models and B tokens, predicts held-out 12B up to 307B tokens at pooled Shows nontrivial extrapolation, though only about beyond the fit grid.

The exponent analysis is the main interpretability result. In high-SNR regimes, the fitted bandwidth exponent can exceed the model-noise exponent, so scaling model size helps. In low-SNR regimes, the model-noise exponent can exceed the bandwidth exponent, so larger models amplify perturbation. The paper also argues that the data-noise exponent can exceed the signal exponent , making token-axis U-shaped degradation an intrinsic limiting behavior under enough accumulated noise.

Relevance To This Wiki

This is upstream language-model scaling-law evidence, not a time-series foundation-model paper. Its transferable value is the warning that “more parameters” and “more data” are not sufficient scaling variables once noise, compression, post-training, and downstream evaluation are part of the contract.

For time-series and world-model work, the closest analogue is an SNR-aware scaling law over dense numeric observations, event streams, context, channel count, horizon, actions, and interventions. A TSFM might improve predictably in a clean average-forecasting regime while degrading for rare regimes, high-channel coupling, quantized serving, post-training for reasoning, or action-conditioned rollout if the useful signal is too weak relative to noise and compression pressure.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Scaling and efficiencywarningShows monotonic LLM power laws can be high-SNR special cases and can fail under quantization, SFT, and noise.Needs numeric time-series scaling laws with sample rate, horizon, channel count, context, and serving perturbations.
Data diversity and curriculumadjacentData-induced noise can grow with token count and overwhelm signal in the fitted law.Needs TSFM data-mixture and rare-regime studies, not only language-token loss.
Training dynamics and post-trainingadjacentSFT learning-rate sweeps produce U-shaped loss basins that the law fits better than monotonic baselines.Needs direct tests on time-series reasoning, anomaly detection, and action-conditioned post-training.
Quantization and deployment compressionwarning2-bit/3-bit/4-bit GPTQ results show larger or longer-trained models can be more fragile under precision reduction.Needs hardware-aware latency, throughput, and dense numeric fidelity tests for TSFM serving.
Causal structure, counterfactuals, and controlinsufficient evidenceThe law could inform scaling budgets for action-conditioned models.No actions, control inputs, interventions, counterfactual rollouts, or control utility are evaluated.

Limitations

  • The evidence is language-model centric: Pythia, OLMo2, WikiText-2 loss, GSM8K/SiQA/StarCoder SFT, and GPTQ quantization.
  • The full law has nine fitted constants. The paper includes a six-parameter simplified law, but practical use still requires a grid of smaller runs and careful fit diagnostics.
  • The headline extrapolation is meaningful but modest: held-out 12B and up to 307B tokens, roughly beyond the fit grid in the emphasized joint setting, not frontier-scale proof.
  • It treats loss as reciprocal to fitted capacity. That is a useful operational model, but not a mechanistic proof of what representations survive.
  • It does not evaluate numeric time series, event streams, graph time series, robotics trajectories, observability telemetry, or action-conditioned world models.
  • It does not release official code or checkpoints at ingest time, so reproduction effort may be higher.

Open Questions

  • What is the TSFM equivalent of SNR: signal-to-data-noise over samples, signal-to-channel-noise over multivariate coupling, signal-to-context-noise over textual metadata, or signal-to-control-noise over actions and interventions?
  • Do TSFM scaling laws become U-shaped under low-bit serving, long-horizon rollout, heavy post-training, data repetition, or rare-regime under-sampling?
  • Can fitted exponents separate useful extra data from accumulated noise in heterogeneous time-series corpora?
  • Which perturbation should be the first TSFM analogue of the paper’s Gaussian noise, SFT, and GPTQ sweeps: synthetic observation noise, channel dropout, quantized latent state, or reasoning/post-training updates?
  • Can SNR-aware scaling predict when compact models or specialized backbones should beat larger monotonic-scaling candidates under a deployment budget?