LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Source

Raw Markdown: paper_llms-noisy-channels-2026.md
PDF: paper_llms-noisy-channels-2026.pdf
Preprint: arXiv 2605.23901v1
Venue page: ICML 2026 poster
Gonzo ML discussion: Telegram post 5435 (local extract stored at papers/llms-noisy-channels-2026/telegram-post-gonzo-ml-5435.md)
Gonzo-linked review: ArXivIQ review
Podcast pointer: Gonzo ML Podcasts 3760
Local official-artifact metadata: papers/llms-noisy-channels-2026/official_artifacts_metadata.json

Status And Credibility

arXiv lists the paper as cs.LG, version v1, submitted on 2026-05-22. The rendered PDF first page gives a paper date of 2026-05-25 and affiliations with ByteDance Seed, University of Virginia, and University of California, Berkeley. The ICML virtual site lists the paper as an ICML 2026 poster, which makes this stronger than ordinary preprint-only evidence.

Credibility is high enough for an important ingest because the source has a tier-1 venue page, a ByteDance Seed / UVA / UC Berkeley author team, a public arXiv source, and experiments over Pythia and OLMo2 checkpoint grids. Caveats: the paper is still a scaling-law fit and extrapolation study rather than independently reproduced infrastructure; official code and released model checkpoints were not found at ingest time.

The Gonzo post and ArXivIQ review list Code: N/A and Model: N/A; artifact search at ingest time found no verified official repository or checkpoint release.

Core Claim

The paper argues that classical monotonic LLM scaling laws miss regimes where adding model size, training tokens, or post-training pressure makes downstream loss worse. It proposes the Shannon Scaling Law, treating LLM training as information transmission over a noisy channel: model parameters act like channel bandwidth, training tokens act like signal power, and data/model/architecture perturbations act like noise.

The central capacity form is:

C_{e x t LL M} = a N^{α} lo g_{2} (1 + \frac{b D ^{β}}{c ( D N ) ^{γ} + d D ^{δ} + e}),

where $N$ is model size, $D$ is training tokens, $c (D N)^{γ}$ models data-model interaction noise, $d D^{δ}$ models data-induced accumulated noise, and $e$ is an irreducible noise floor. The paper then links loss inversely to capacity, $L (N, D) \approx 1/ C_{e x t LL M}$ .

The useful conceptual move is not only the equation. It is the claim that the familiar monotonic scaling regime is a high-SNR special case, while catastrophic overtraining, aggressive supervised fine-tuning, low-bit quantization, and injected Gaussian noise expose low-SNR U-shaped loss basins.

Evidence

The paper fits and compares scaling-law forms on Pythia-dedup and OLMo2 checkpoints. Evaluation uses WikiText-2 loss after three perturbation families:

additive Gaussian weight noise with controlled SNR;
full supervised fine-tuning on GSM8K, SiQA, and StarCoder-Python with varying learning rates;
GPTQ post-training quantization to 4-bit, 3-bit, and 2-bit.

Key reported numbers:

Setting	Reported result for Shannon Scaling Law	Why it matters
Gaussian-noise fitting	Pythia average $R^{2} = 0.9613 \pm 0.03$ ; OLMo2 average $R^{2} = 0.9585 \pm 0.06$	Robust across high- and low-SNR regimes.
SFT fitting	Average $R^{2} = 0.936$ on GSM8K, $0.916$ on SiQA, $0.937$ on StarCoder	Captures U-shaped loss basins where monotonic laws can collapse.
Quantization fitting	Pythia average $R^{2} = 0.9824 \pm 0.02$ ; OLMo2 average $R^{2} = 0.9548 \pm 0.06$	Models degradation down to 2-bit GPTQ better than standard laws.
Unperturbed pretraining	$R^{2} = 0.9915$ on Pythia and $0.9889$ on OLMo2	Treats ordinary pretraining as the high-SNR special case.
Joint extrapolation	fitted on $\leq 6.9$ B Pythia models and $\leq 180$ B tokens, predicts held-out 12B up to 307B tokens at pooled $R^{2} = 0.847$	Shows nontrivial extrapolation, though only about $1.7 im es$ beyond the fit grid.

The exponent analysis is the main interpretability result. In high-SNR regimes, the fitted bandwidth exponent can exceed the model-noise exponent, so scaling model size helps. In low-SNR regimes, the model-noise exponent can exceed the bandwidth exponent, so larger models amplify perturbation. The paper also argues that the data-noise exponent $δ$ can exceed the signal exponent $β$ , making token-axis U-shaped degradation an intrinsic limiting behavior under enough accumulated noise.

Relevance To This Wiki

This is upstream language-model scaling-law evidence, not a time-series foundation-model paper. Its transferable value is the warning that “more parameters” and “more data” are not sufficient scaling variables once noise, compression, post-training, and downstream evaluation are part of the contract.

For time-series and world-model work, the closest analogue is an SNR-aware scaling law over dense numeric observations, event streams, context, channel count, horizon, actions, and interventions. A TSFM might improve predictably in a clean average-forecasting regime while degrading for rare regimes, high-channel coupling, quantized serving, post-training for reasoning, or action-conditioned rollout if the useful signal is too weak relative to noise and compression pressure.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Scaling and efficiency	warning	Shows monotonic LLM power laws can be high-SNR special cases and can fail under quantization, SFT, and noise.	Needs numeric time-series scaling laws with sample rate, horizon, channel count, context, and serving perturbations.
Data diversity and curriculum	adjacent	Data-induced noise can grow with token count and overwhelm signal in the fitted law.	Needs TSFM data-mixture and rare-regime studies, not only language-token loss.
Training dynamics and post-training	adjacent	SFT learning-rate sweeps produce U-shaped loss basins that the law fits better than monotonic baselines.	Needs direct tests on time-series reasoning, anomaly detection, and action-conditioned post-training.
Quantization and deployment compression	warning	2-bit/3-bit/4-bit GPTQ results show larger or longer-trained models can be more fragile under precision reduction.	Needs hardware-aware latency, throughput, and dense numeric fidelity tests for TSFM serving.
Causal structure, counterfactuals, and control	insufficient evidence	The law could inform scaling budgets for action-conditioned models.	No actions, control inputs, interventions, counterfactual rollouts, or control utility are evaluated.

Limitations

The evidence is language-model centric: Pythia, OLMo2, WikiText-2 loss, GSM8K/SiQA/StarCoder SFT, and GPTQ quantization.
The full law has nine fitted constants. The paper includes a six-parameter simplified law, but practical use still requires a grid of smaller runs and careful fit diagnostics.
The headline extrapolation is meaningful but modest: held-out 12B and up to 307B tokens, roughly $1.7 im es$ beyond the fit grid in the emphasized joint setting, not frontier-scale proof.
It treats loss as reciprocal to fitted capacity. That is a useful operational model, but not a mechanistic proof of what representations survive.
It does not evaluate numeric time series, event streams, graph time series, robotics trajectories, observability telemetry, or action-conditioned world models.
It does not release official code or checkpoints at ingest time, so reproduction effort may be higher.

Links Into The Wiki

Open Questions

What is the TSFM equivalent of SNR: signal-to-data-noise over samples, signal-to-channel-noise over multivariate coupling, signal-to-context-noise over textual metadata, or signal-to-control-noise over actions and interventions?
Do TSFM scaling laws become U-shaped under low-bit serving, long-horizon rollout, heavy post-training, data repetition, or rare-regime under-sampling?
Can fitted exponents separate useful extra data from accumulated noise in heterogeneous time-series corpora?
Which perturbation should be the first TSFM analogue of the paper’s Gaussian noise, SFT, and GPTQ sweeps: synthetic observation noise, channel dropout, quantized latent state, or reasoning/post-training updates?
Can SNR-aware scaling predict when compact models or specialized backbones should beat larger monotonic-scaling candidates under a deployment budget?

Alex Open Research Wiki

Explorer

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Source

Status And Credibility

Core Claim

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks