Reinforcement Learning on Pre-Training Data

Source

Raw Markdown: paper_rlpt-2025.md
PDF: paper_rlpt-2025.pdf
Preprint: arXiv:2509.19249
Official X thread: Tencent Hunyuan announcement
Local X snapshot: papers/rlpt-2025/x-thread-tencent-hunyuan-1971118167281057821.json

Status And Credibility

This arXiv preprint was first submitted on 2025-09-23 and revised to v2 on 2025-09-25. The arXiv comments say Work in progress. The authors are primarily from Tencent’s LLM Department, HunYuan Infra Team, and the Chinese University of Hong Kong, with an official Tencent Hunyuan X announcement captured through the authenticated X API.

Treat the paper as current but not settled. It has an official lab announcement and a large Tencent/CUHK author team, but no accepted venue record, project page, or official code repository was found during ingest.

Core Claim

RLPT proposes to scale reinforcement learning directly on pre-training text rather than relying only on supervised next-token prediction, human preference labels, or verifiable task labels. The method turns unlabeled text into segment-level RL tasks: predict a target segment from context, then score whether the prediction semantically matches the true next or masked segment.

The paper defines two next-segment reasoning tasks:

Autoregressive Segment Reasoning (ASR): predict segment $s_{i}$ from preceding context $s_{< i}$ .
Middle Segment Reasoning (MSR): predict segment $s_{i}$ from both preceding context $s_{< i}$ and following segment $s_{i + 1}$ .

The training objective is the paper’s mixed ASR/MSR reward:

J_{SRPT} (θ) = E_{A S R} [r (o, s_{i})] + λ E_{M S R} [r (o, s_{i})] .

flowchart LR
  Text[pre-training text] --> Seg[segment into s_i windows]
  Seg --> ASR[ASR prompt]
  Seg --> MSR[MSR prompt]
  ASR --> Policy[policy samples continuation]
  MSR --> Policy
  Policy --> GRM[generative reward model]
  Seg --> GRM
  GRM --> GRPO[on-policy GRPO update]
  GRPO --> Model[RLPT model]

Mechanism

The data pipeline aggregates web text from sources such as Wikipedia, arXiv, threaded conversation data, and QA-style data. The paper describes MinHash near-deduplication, PII masking, contamination removal against development and evaluation sets, rule-based filtering, and model-based quality filtering.

RLPT requires a cold-start SFT phase because base models need enough instruction-following ability to perform next-segment reasoning. The RL phase uses sentence-level segments by default, samples multiple outputs, and optimizes on-policy GRPO without KL regularization in the reported setup. The reward is produced by a generative reward model that checks whether the predicted segment is semantically consistent with the reference. The paper reports that a strict semantic-equivalence reward was too brittle, so it moves to a relaxed prefix reward.

Evidence And Results

Evaluates Llama-3.2-3B-Base, Qwen3-4B-Base, and Qwen3-8B-Base after cold-start SFT plus RLPT.
Reports gains on general-domain benchmarks including MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, KOR-Bench, and OlympiadBench.
For Qwen3-4B-Base, the paper reports absolute gains after RLPT over cold-start on MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, and KOR-Bench.
Reports math-reasoning gains on MATH-500, AMC23, Minerva Math, AIME24, and AIME25 with Pass@ $1$ and Pass@ $8$ .
Reports that RLPT improves the initialization for subsequent RLVR on Qwen3-4B-Base, with additional AIME24/AIME25 gains over RLVR alone in several Pass@ $k$ cells.
Reports a power-law-like scaling trend with training tokens for benchmark performance, but this is a work-in-progress claim and should be treated as recipe-specific evidence.

Why It Matters

RLPT is useful for the wiki because it changes the boundary between pretraining and post-training. It treats pre-training data as a source of reward-bearing tasks, not only as next-token supervised targets. That makes it adjacent to RLVR, DFT, synthetic-data steering, and training-time scaling.

The time-series analogy is not “use text RLPT directly.” It is the broader contract: large unlabeled corpora can be converted into self-supervised trajectory tasks, but the reward definition decides what latent structure the model is pushed to discover. For numeric time series or event streams, an equivalent would need segment, state, or event-continuation rewards that preserve dense numeric detail, rare regimes, exogenous variables, and action histories.

Limitations

The paper is labeled work in progress and does not provide an official code release in the discovered artifacts.
The method still depends on cold-start SFT, so the result is not pure RL from a raw base model.
The reward is generated by a model, not by a deterministic verifier; semantic false positives/negatives and reward-model bias are central risks.
The strict semantic-equivalence reward failed enough that the paper needed a relaxed prefix reward, showing that segment boundaries and information density matter.
The evidence is LLM benchmark evidence, not time-series, robotics, observability, or action-conditioned world-model evidence.
Data provenance and contamination controls are described at a high level; reproducibility depends on unreleased corpus and filtering details.
Benchmark gains should not be collapsed with ordinary RLVR, SFT, or pretraining scaling because RLPT adds a distinct objective and cold-start stage.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Data diversity, curriculum, and long tail	adjacent	Converts large unlabeled text corpora into segment-level RL tasks and reports scaling with training tokens.	Needs numeric time-series or event-stream analogs with leakage, rare-regime, and contamination controls.
Context interface	adjacent	ASR/MSR reward depends on context before and sometimes after the target segment.	Needs structured context, exogenous variables, system metadata, and action history rather than text-only segments.
Dynamic compute and reasoning	adjacent	Uses reasoning trajectories during training and reports downstream reasoning gains.	Needs evidence that extra reasoning preserves state dynamics rather than only benchmark answer patterns.
Control and counterfactuals	insufficient evidence	RL optimizes text-continuation reward, not action-conditioned transitions or interventions.	Needs explicit actions, control inputs, and counterfactual rollout evaluation.

Links Into The Wiki

Open Questions

Does RLPT still improve when the reward model, segmenter, data filters, and contamination controls are independently reproduced?
Can next-segment rewards avoid rewarding surface continuation style instead of useful latent reasoning?
How should RLPT be compared against extra next-token pretraining at matched compute and data?
Can a similar objective be built for multivariate time series, event streams, or action-conditioned trajectories without erasing dense numeric detail?
Does RLPT preserve broad capabilities better than SFT or RLVR alone, and what is its parameter-update geometry?

Alex Open Research Wiki

Explorer

Reinforcement Learning on Pre-Training Data

Reinforcement Learning on Pre-Training Data

Source

Status And Credibility

Core Claim

Mechanism

Evidence And Results

Why It Matters

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks