Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Source
- Raw Markdown: paper_reinforcement-learning-small-subnetworks-2025.md
- PDF: paper_reinforcement-learning-small-subnetworks-2025.pdf
- Preprint: arXiv:2505.11711
- Venue record: NeurIPS 2025 OpenReview poster
- Official code: SagnikMukherjee/sparsity_in_rl
- Official X thread: Sagnik Mukherjee announcement
- Local X snapshot:
papers/reinforcement-learning-small-subnetworks-2025/x-thread-lifan-yuan-1927101010654146972.json
Status And Credibility
This source was first submitted to arXiv on 2025-05-16 and revised to v2 on 2025-12-18. The arXiv page lists NeurIPS 2025, and OpenReview lists it as a NeurIPS 2025 poster, published 2025-09-18 and last modified 2026-04-21. It is credible venue evidence for the LLM post-training cluster, with official code and author/coauthor X discussion captured through the authenticated X API.
The useful caveat is that this is empirical update-geometry evidence over language-model RL/post-training checkpoints and controlled DPO/PRIME experiments. It should not be promoted into a universal law for every optimizer, training duration, modality, or precision regime.
Core Claim
The paper argues that full-parameter RL finetuning often changes only a small subset of LLM parameters. The authors call this parameter update sparsity: the update delta is sparse even though the final model remains dense. Across 7 RL or preference-optimization algorithms and 10 LLMs, they report that RL updates only about 5%-30% of parameters, while SFT produces much denser updates.
The paper’s update-sparsity definition is:
The result is not just layer freezing. The updated parameters are distributed across almost all transformer layers and matrices, and the resulting update matrices are nearly full-rank. That separates this phenomenon from LoRA-style low-rank adaptation.
flowchart LR Base[pretrained or SFT model] --> RL[RL / preference optimization] RL --> Delta[sparse update delta] Delta --> Dense[final dense model] Delta --> Mask[subnetwork mask] Base --> Masked[masked-gradient finetuning] Mask --> Masked Masked --> Match[similar accuracy and near-identical weights]
Key Contributions
- Measures update sparsity across PPO, GRPO, ORPO, KTO, DPO, SimPO, PRIME, and rejection-sampling finetuning checkpoints.
- Reports that RL-finetuned models leave 68.5%-96.0% of parameters unchanged under the paper’s tolerance, while SFT checkpoints are much denser.
- Shows that the sparse update is spread across layers and parameter matrices rather than isolated to a few modules; layer norms are the main high-sparsity exception.
- Shows that sparse RL updates are nearly full-rank, so the phenomenon is not simply low-rank adaptation.
- Tests masked-gradient finetuning on the final RL subnetwork and reports that it recovers or slightly improves DPO and PRIME task performance while producing near-identical weights under relaxed tolerance.
- Finds subnetwork overlap across seeds, data, and even algorithms above random baselines, suggesting a partially reusable update structure.
- Argues that the main driver is training on data close to the policy distribution; KL regularization, gradient clipping, online/offline status, and SFT-before-RL are not sufficient explanations in the paper’s tests.
Why It Matters
This source sharpens the wiki’s post-training lens. A post-training method should not only be scored by benchmark gain or reward; it should also report update sparsity, update rank, layer distribution, tolerance, retention, and whether the training data is close to the current policy distribution.
It also complicates simple parameter-efficient finetuning stories. RL may naturally update a small subset of parameters, but the subset is not confined to a few layers and is not low-rank. If one wants to exploit it, the training system may need sparse full-matrix updates or early subnetwork discovery rather than ordinary adapters.
Relationship To Existing Wiki Threads
Dynamic Fine-Tuning argues that SFT can over-amplify low-probability expert tokens. This paper adds a direct weight-space contrast: SFT updates densely in the reported comparisons, while RL often updates sparsely when the training data is near the policy distribution.
Evolution Strategies at Scale and Evolutionary Strategies lead to Catastrophic Forgetting in LLMs ask whether reward-only parameter-space search can replace or complement RL. This paper gives gradient-based RL a different diagnostic: sparse, full-rank, broadly distributed updates may be one reason RL can improve target behavior while preserving more prior capability than dense post-training recipes.
The Universal Weight Subspace Hypothesis tracks reusable low-rank adapter or checkpoint subspaces. This paper is a useful counterpoint because the reported RL deltas are sparse but nearly full-rank, so reusable update geometry need not be low-rank.
Limitations
- The paper is empirical and uses finite precision/tolerance when deciding whether parameters changed; very small updates can disappear under
bfloat16or tolerance choices. - The experiments focus on language models and RL-style LLM post-training, not numeric time-series foundation models, multimodal diffusion, robotics, or action-conditioned world models.
- Fully controlled experiments are expensive, so some comparisons rely on public checkpoints with incomplete training details.
- Sparsity decreases with more training steps in the PRIME analysis, even if it appears to converge to a nonzero level; very long training remains an open question.
- The paper identifies a final subnetwork and then tests masked-gradient finetuning. It does not yet provide a practical early method for discovering the subnetwork before a full RL run.
- The authors note possible confounders and call for more theory; the mechanism is not settled.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Scaling and efficiency | adjacent | Shows that LLM RL post-training may naturally concentrate updates in a sparse but full-rank subnetwork. | Needs TSFM and action-conditioned world-model post-training runs with update-sparsity, retention, and rare-regime probes. |
| Data diversity and long tail | warning | Suggests in-distribution data is a major driver of sparse updates. | Need tests where rare or out-of-distribution regimes require dense adaptation rather than conservative updates. |
| Dynamic compute and adaptation | adjacent | Opens a route to sparse-update training if subnetworks can be identified early. | Needs practical sparse-update systems and evidence that sparse updates preserve dense numeric detail. |
| Control and counterfactuals | insufficient evidence | No action-conditioned dynamics or intervention rollout is evaluated. | Needs explicit action/control input data and causal or policy-evaluation tests. |
Links Into The Wiki
- LLM Post-Training
- Training Dynamics
- Time-Series Scaling And Efficiency
- Evolution Strategies
- Dynamic Fine-Tuning
- The Universal Weight Subspace Hypothesis
Open Questions
- Can the RL-updated subnetwork be identified before paying for a full RL run?
- Does sparse full-rank update geometry persist for much longer RL runs, larger models, multimodal models, or diffusion/world-model training?
- Which retention tests best separate useful sparse adaptation from undertraining?
- Can sparse-update training systems exploit this without losing the distributed full-rank structure across layers?
- For time-series or operational agents, does near-policy training preserve old dynamics at the cost of underlearning rare but important regime changes?