LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
Source
- Raw Markdown: paper_llms-noisy-channels-2026.md
- PDF: paper_llms-noisy-channels-2026.pdf
- Preprint: arXiv 2605.23901v1
- Venue page: ICML 2026 poster
- Gonzo ML discussion: Telegram post 5435 (local extract stored at
papers/llms-noisy-channels-2026/telegram-post-gonzo-ml-5435.md) - Gonzo-linked review: ArXivIQ review
- Podcast pointer: Gonzo ML Podcasts 3760
- Local official-artifact metadata:
papers/llms-noisy-channels-2026/official_artifacts_metadata.json
Status And Credibility
arXiv lists the paper as cs.LG, version v1, submitted on 2026-05-22. The rendered PDF first page gives a paper date of 2026-05-25 and affiliations with ByteDance Seed, University of Virginia, and University of California, Berkeley. The ICML virtual site lists the paper as an ICML 2026 poster, which makes this stronger than ordinary preprint-only evidence.
Credibility is high enough for an important ingest because the source has a tier-1 venue page, a ByteDance Seed / UVA / UC Berkeley author team, a public arXiv source, and experiments over Pythia and OLMo2 checkpoint grids. Caveats: the paper is still a scaling-law fit and extrapolation study rather than independently reproduced infrastructure; official code and released model checkpoints were not found at ingest time.
The Gonzo post and ArXivIQ review list Code: N/A and Model: N/A; artifact search at ingest time found no verified official repository or checkpoint release.
Core Claim
The paper argues that classical monotonic LLM scaling laws miss regimes where adding model size, training tokens, or post-training pressure makes downstream loss worse. It proposes the Shannon Scaling Law, treating LLM training as information transmission over a noisy channel: model parameters act like channel bandwidth, training tokens act like signal power, and data/model/architecture perturbations act like noise.
The central capacity form is:
where is model size, is training tokens, models data-model interaction noise, models data-induced accumulated noise, and is an irreducible noise floor. The paper then links loss inversely to capacity, .
The useful conceptual move is not only the equation. It is the claim that the familiar monotonic scaling regime is a high-SNR special case, while catastrophic overtraining, aggressive supervised fine-tuning, low-bit quantization, and injected Gaussian noise expose low-SNR U-shaped loss basins.
Evidence
The paper fits and compares scaling-law forms on Pythia-dedup and OLMo2 checkpoints. Evaluation uses WikiText-2 loss after three perturbation families:
- additive Gaussian weight noise with controlled SNR;
- full supervised fine-tuning on GSM8K, SiQA, and StarCoder-Python with varying learning rates;
- GPTQ post-training quantization to 4-bit, 3-bit, and 2-bit.
Key reported numbers:
| Setting | Reported result for Shannon Scaling Law | Why it matters |
|---|---|---|
| Gaussian-noise fitting | Pythia average ; OLMo2 average | Robust across high- and low-SNR regimes. |
| SFT fitting | Average on GSM8K, on SiQA, on StarCoder | Captures U-shaped loss basins where monotonic laws can collapse. |
| Quantization fitting | Pythia average ; OLMo2 average | Models degradation down to 2-bit GPTQ better than standard laws. |
| Unperturbed pretraining | on Pythia and on OLMo2 | Treats ordinary pretraining as the high-SNR special case. |
| Joint extrapolation | fitted on B Pythia models and B tokens, predicts held-out 12B up to 307B tokens at pooled | Shows nontrivial extrapolation, though only about beyond the fit grid. |
The exponent analysis is the main interpretability result. In high-SNR regimes, the fitted bandwidth exponent can exceed the model-noise exponent, so scaling model size helps. In low-SNR regimes, the model-noise exponent can exceed the bandwidth exponent, so larger models amplify perturbation. The paper also argues that the data-noise exponent can exceed the signal exponent , making token-axis U-shaped degradation an intrinsic limiting behavior under enough accumulated noise.
Relevance To This Wiki
This is upstream language-model scaling-law evidence, not a time-series foundation-model paper. Its transferable value is the warning that “more parameters” and “more data” are not sufficient scaling variables once noise, compression, post-training, and downstream evaluation are part of the contract.
For time-series and world-model work, the closest analogue is an SNR-aware scaling law over dense numeric observations, event streams, context, channel count, horizon, actions, and interventions. A TSFM might improve predictably in a clean average-forecasting regime while degrading for rare regimes, high-channel coupling, quantized serving, post-training for reasoning, or action-conditioned rollout if the useful signal is too weak relative to noise and compression pressure.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Scaling and efficiency | warning | Shows monotonic LLM power laws can be high-SNR special cases and can fail under quantization, SFT, and noise. | Needs numeric time-series scaling laws with sample rate, horizon, channel count, context, and serving perturbations. |
| Data diversity and curriculum | adjacent | Data-induced noise can grow with token count and overwhelm signal in the fitted law. | Needs TSFM data-mixture and rare-regime studies, not only language-token loss. |
| Training dynamics and post-training | adjacent | SFT learning-rate sweeps produce U-shaped loss basins that the law fits better than monotonic baselines. | Needs direct tests on time-series reasoning, anomaly detection, and action-conditioned post-training. |
| Quantization and deployment compression | warning | 2-bit/3-bit/4-bit GPTQ results show larger or longer-trained models can be more fragile under precision reduction. | Needs hardware-aware latency, throughput, and dense numeric fidelity tests for TSFM serving. |
| Causal structure, counterfactuals, and control | insufficient evidence | The law could inform scaling budgets for action-conditioned models. | No actions, control inputs, interventions, counterfactual rollouts, or control utility are evaluated. |
Limitations
- The evidence is language-model centric: Pythia, OLMo2, WikiText-2 loss, GSM8K/SiQA/StarCoder SFT, and GPTQ quantization.
- The full law has nine fitted constants. The paper includes a six-parameter simplified law, but practical use still requires a grid of smaller runs and careful fit diagnostics.
- The headline extrapolation is meaningful but modest: held-out 12B and up to 307B tokens, roughly beyond the fit grid in the emphasized joint setting, not frontier-scale proof.
- It treats loss as reciprocal to fitted capacity. That is a useful operational model, but not a mechanistic proof of what representations survive.
- It does not evaluate numeric time series, event streams, graph time series, robotics trajectories, observability telemetry, or action-conditioned world models.
- It does not release official code or checkpoints at ingest time, so reproduction effort may be higher.
Links Into The Wiki
- LLMs as Noisy Channels
- Time-Series Scaling And Efficiency
- Training Dynamics
- LLM Post-Training
- Foundation Time-Series Model Research Agenda
- Compute Optimal Tokenization
- TurboQuant
- Scaling-laws for Large Time-series Models
- Scaling Law for Time Series Forecasting
Open Questions
- What is the TSFM equivalent of SNR: signal-to-data-noise over samples, signal-to-channel-noise over multivariate coupling, signal-to-context-noise over textual metadata, or signal-to-control-noise over actions and interventions?
- Do TSFM scaling laws become U-shaped under low-bit serving, long-horizon rollout, heavy post-training, data repetition, or rare-regime under-sampling?
- Can fitted exponents separate useful extra data from accumulated noise in heterogeneous time-series corpora?
- Which perturbation should be the first TSFM analogue of the paper’s Gaussian noise, SFT, and GPTQ sweeps: synthetic observation noise, channel dropout, quantized latent state, or reasoning/post-training updates?
- Can SNR-aware scaling predict when compact models or specialized backbones should beat larger monotonic-scaling candidates under a deployment budget?