Reinforcement Learning on Pre-Training Data
Source
- Raw Markdown: paper_rlpt-2025.md
- PDF: paper_rlpt-2025.pdf
- Preprint: arXiv:2509.19249
- Official X thread: Tencent Hunyuan announcement
- Local X snapshot:
papers/rlpt-2025/x-thread-tencent-hunyuan-1971118167281057821.json
Status And Credibility
This arXiv preprint was first submitted on 2025-09-23 and revised to v2 on 2025-09-25. The arXiv comments say Work in progress. The authors are primarily from Tencent’s LLM Department, HunYuan Infra Team, and the Chinese University of Hong Kong, with an official Tencent Hunyuan X announcement captured through the authenticated X API.
Treat the paper as current but not settled. It has an official lab announcement and a large Tencent/CUHK author team, but no accepted venue record, project page, or official code repository was found during ingest.
Core Claim
RLPT proposes to scale reinforcement learning directly on pre-training text rather than relying only on supervised next-token prediction, human preference labels, or verifiable task labels. The method turns unlabeled text into segment-level RL tasks: predict a target segment from context, then score whether the prediction semantically matches the true next or masked segment.
The paper defines two next-segment reasoning tasks:
- Autoregressive Segment Reasoning (ASR): predict segment from preceding context .
- Middle Segment Reasoning (MSR): predict segment from both preceding context and following segment .
The training objective is the paper’s mixed ASR/MSR reward:
flowchart LR Text[pre-training text] --> Seg[segment into s_i windows] Seg --> ASR[ASR prompt] Seg --> MSR[MSR prompt] ASR --> Policy[policy samples continuation] MSR --> Policy Policy --> GRM[generative reward model] Seg --> GRM GRM --> GRPO[on-policy GRPO update] GRPO --> Model[RLPT model]
Mechanism
The data pipeline aggregates web text from sources such as Wikipedia, arXiv, threaded conversation data, and QA-style data. The paper describes MinHash near-deduplication, PII masking, contamination removal against development and evaluation sets, rule-based filtering, and model-based quality filtering.
RLPT requires a cold-start SFT phase because base models need enough instruction-following ability to perform next-segment reasoning. The RL phase uses sentence-level segments by default, samples multiple outputs, and optimizes on-policy GRPO without KL regularization in the reported setup. The reward is produced by a generative reward model that checks whether the predicted segment is semantically consistent with the reference. The paper reports that a strict semantic-equivalence reward was too brittle, so it moves to a relaxed prefix reward.
Evidence And Results
- Evaluates Llama-3.2-3B-Base, Qwen3-4B-Base, and Qwen3-8B-Base after cold-start SFT plus RLPT.
- Reports gains on general-domain benchmarks including MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, KOR-Bench, and OlympiadBench.
- For Qwen3-4B-Base, the paper reports absolute gains after RLPT over cold-start on MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, and KOR-Bench.
- Reports math-reasoning gains on MATH-500, AMC23, Minerva Math, AIME24, and AIME25 with Pass@ and Pass@.
- Reports that RLPT improves the initialization for subsequent RLVR on Qwen3-4B-Base, with additional AIME24/AIME25 gains over RLVR alone in several Pass@ cells.
- Reports a power-law-like scaling trend with training tokens for benchmark performance, but this is a work-in-progress claim and should be treated as recipe-specific evidence.
Why It Matters
RLPT is useful for the wiki because it changes the boundary between pretraining and post-training. It treats pre-training data as a source of reward-bearing tasks, not only as next-token supervised targets. That makes it adjacent to RLVR, DFT, synthetic-data steering, and training-time scaling.
The time-series analogy is not “use text RLPT directly.” It is the broader contract: large unlabeled corpora can be converted into self-supervised trajectory tasks, but the reward definition decides what latent structure the model is pushed to discover. For numeric time series or event streams, an equivalent would need segment, state, or event-continuation rewards that preserve dense numeric detail, rare regimes, exogenous variables, and action histories.
Limitations
- The paper is labeled work in progress and does not provide an official code release in the discovered artifacts.
- The method still depends on cold-start SFT, so the result is not pure RL from a raw base model.
- The reward is generated by a model, not by a deterministic verifier; semantic false positives/negatives and reward-model bias are central risks.
- The strict semantic-equivalence reward failed enough that the paper needed a relaxed prefix reward, showing that segment boundaries and information density matter.
- The evidence is LLM benchmark evidence, not time-series, robotics, observability, or action-conditioned world-model evidence.
- Data provenance and contamination controls are described at a high level; reproducibility depends on unreleased corpus and filtering details.
- Benchmark gains should not be collapsed with ordinary RLVR, SFT, or pretraining scaling because RLPT adds a distinct objective and cold-start stage.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Data diversity, curriculum, and long tail | adjacent | Converts large unlabeled text corpora into segment-level RL tasks and reports scaling with training tokens. | Needs numeric time-series or event-stream analogs with leakage, rare-regime, and contamination controls. |
| Context interface | adjacent | ASR/MSR reward depends on context before and sometimes after the target segment. | Needs structured context, exogenous variables, system metadata, and action history rather than text-only segments. |
| Dynamic compute and reasoning | adjacent | Uses reasoning trajectories during training and reports downstream reasoning gains. | Needs evidence that extra reasoning preserves state dynamics rather than only benchmark answer patterns. |
| Control and counterfactuals | insufficient evidence | RL optimizes text-continuation reward, not action-conditioned transitions or interventions. | Needs explicit actions, control inputs, and counterfactual rollout evaluation. |
Links Into The Wiki
- LLM Post-Training
- Training Dynamics
- Reinforcement Learning Finetunes Small Subnetworks
- Synthetic Data for any Differentiable Target
- Dynamic Fine-Tuning
- TimeOmni-1
- Time-Series Benchmark Hygiene
Open Questions
- Does RLPT still improve when the reward model, segmenter, data filters, and contamination controls are independently reproduced?
- Can next-segment rewards avoid rewarding surface continuation style instead of useful latent reasoning?
- How should RLPT be compared against extra next-token pretraining at matched compute and data?
- Can a similar objective be built for multivariate time series, event streams, or action-conditioned trajectories without erasing dense numeric detail?
- Does RLPT preserve broad capabilities better than SFT or RLVR alone, and what is its parameter-update geometry?