---
abstract: |
  The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (`\ours`{=latex}), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, `\ours`{=latex} enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, `\ours`{=latex} eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of `\ours`{=latex}. For example, when applied to Qwen3-4B-Base, `\ours`{=latex} yields absolute improvements of $3.0$, $5.1$, $8.1$, $6.0$, $6.6$, and $5.3$ on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, `\ours`{=latex} provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.
author:
- |
  Siheng Li$^{1,3,}$[^1]  $^{,}$[^2]  , Kejiao Li$^{1,\dagger}$, Zenan Xu$^{1,\dagger}$, Guanhua Huang$^1$, Evander Yang$^1$, Kun Li$^{1,3,*}$,\
  Haoyuan Wu$^{1}$, Jiajia Wu$^{1}$, Zihao Zheng$^{1}$, Chenchen Zhang$^{1}$, Kun Shi$^{1}$, Kyrierl Deng$^{1}$, Qi Yi$^{1}$,\
  Ruibin Xiong$^{1}$, Tingqiang Xu$^{1,*}$, Yuhao Jiang$^{1}$, Jianfeng Yan$^{1}$, Yuyuan Zeng$^{1}$, Guanghui Xu$^{1}$,\
  Jinbao Xue$^{2}$, Zhijiang Xu$^{2}$, Zheng Fang$^{2}$, Shuai Li$^{2}$, Qibin Liu$^{2}$, Xiaoxue Li$^{2}$, Zhuoyu Li$^{2}$,\
  Yangyu Tao$^{2}$, Fei Gao$^{2}$, Cheng Jiang$^{2}$, Bo Chao Wang$^{2}$, Kai Liu$^{2}$, Jianchen Zhu$^{2}$,\
  Wai Lam$^{3}$, Bo Zhou$^{1,}$[^3], Di Wang$^{1}$\
  **$^1$LLM Department, Tencent** `\quad `{=latex}**$^2$HunYuan Infra Team**\
  **$^3$The Chinese University of Hong Kong**\
  `\Letter`{=latex} chaysezhou@tencent.com\
bibliography:
- iclr2025_conference.bib
title: Reinforcement Learning on Pre-Training Data
---

\newcommand{\figleft}{{\em (Left)}}
\newcommand{\figcenter}{{\em (Center)}}
\newcommand{\figright}{{\em (Right)}}
\newcommand{\figtop}{{\em (Top)}}
\newcommand{\figbottom}{{\em (Bottom)}}
\newcommand{\captiona}{{\em (a)}}
\newcommand{\captionb}{{\em (b)}}
\newcommand{\captionc}{{\em (c)}}
\newcommand{\captiond}{{\em (d)}}
\newcommand{\newterm}[1]{{\bf #1}}
\def\figref#1{figure~\ref{#1}}
\def\Figref#1{Figure~\ref{#1}}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
\def\secref#1{section~\ref{#1}}
\def\Secref#1{Section~\ref{#1}}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
\def\eqref#1{equation~\ref{#1}}
\def\Eqref#1{Equation~\ref{#1}}
\def\plaineqref#1{\ref{#1}}
\def\chapref#1{chapter~\ref{#1}}
\def\Chapref#1{Chapter~\ref{#1}}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
\def\algref#1{algorithm~\ref{#1}}
\def\Algref#1{Algorithm~\ref{#1}}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
\def\partref#1{part~\ref{#1}}
\def\Partref#1{Part~\ref{#1}}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\1{\bm{1}}
\newcommand{\train}{\mathcal{D}}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
\def\eps{{\epsilon}}
\def\reta{{\textnormal{$\eta$}}}
\def\ra{{\textnormal{a}}}
\def\rb{{\textnormal{b}}}
\def\rc{{\textnormal{c}}}
\def\rd{{\textnormal{d}}}
\def\re{{\textnormal{e}}}
\def\rf{{\textnormal{f}}}
\def\rg{{\textnormal{g}}}
\def\rh{{\textnormal{h}}}
\def\ri{{\textnormal{i}}}
\def\rj{{\textnormal{j}}}
\def\rk{{\textnormal{k}}}
\def\rl{{\textnormal{l}}}
\def\rn{{\textnormal{n}}}
\def\ro{{\textnormal{o}}}
\def\rp{{\textnormal{p}}}
\def\rq{{\textnormal{q}}}
\def\rr{{\textnormal{r}}}
\def\rs{{\textnormal{s}}}
\def\rt{{\textnormal{t}}}
\def\ru{{\textnormal{u}}}
\def\rv{{\textnormal{v}}}
\def\rw{{\textnormal{w}}}
\def\rx{{\textnormal{x}}}
\def\ry{{\textnormal{y}}}
\def\rz{{\textnormal{z}}}
\def\rvepsilon{{\mathbf{\epsilon}}}
\def\rvtheta{{\mathbf{\theta}}}
\def\rva{{\mathbf{a}}}
\def\rvb{{\mathbf{b}}}
\def\rvc{{\mathbf{c}}}
\def\rvd{{\mathbf{d}}}
\def\rve{{\mathbf{e}}}
\def\rvf{{\mathbf{f}}}
\def\rvg{{\mathbf{g}}}
\def\rvh{{\mathbf{h}}}
\def\rvu{{\mathbf{i}}}
\def\rvj{{\mathbf{j}}}
\def\rvk{{\mathbf{k}}}
\def\rvl{{\mathbf{l}}}
\def\rvm{{\mathbf{m}}}
\def\rvn{{\mathbf{n}}}
\def\rvo{{\mathbf{o}}}
\def\rvp{{\mathbf{p}}}
\def\rvq{{\mathbf{q}}}
\def\rvr{{\mathbf{r}}}
\def\rvs{{\mathbf{s}}}
\def\rvt{{\mathbf{t}}}
\def\rvu{{\mathbf{u}}}
\def\rvv{{\mathbf{v}}}
\def\rvw{{\mathbf{w}}}
\def\rvx{{\mathbf{x}}}
\def\rvy{{\mathbf{y}}}
\def\rvz{{\mathbf{z}}}
\def\erva{{\textnormal{a}}}
\def\ervb{{\textnormal{b}}}
\def\ervc{{\textnormal{c}}}
\def\ervd{{\textnormal{d}}}
\def\erve{{\textnormal{e}}}
\def\ervf{{\textnormal{f}}}
\def\ervg{{\textnormal{g}}}
\def\ervh{{\textnormal{h}}}
\def\ervi{{\textnormal{i}}}
\def\ervj{{\textnormal{j}}}
\def\ervk{{\textnormal{k}}}
\def\ervl{{\textnormal{l}}}
\def\ervm{{\textnormal{m}}}
\def\ervn{{\textnormal{n}}}
\def\ervo{{\textnormal{o}}}
\def\ervp{{\textnormal{p}}}
\def\ervq{{\textnormal{q}}}
\def\ervr{{\textnormal{r}}}
\def\ervs{{\textnormal{s}}}
\def\ervt{{\textnormal{t}}}
\def\ervu{{\textnormal{u}}}
\def\ervv{{\textnormal{v}}}
\def\ervw{{\textnormal{w}}}
\def\ervx{{\textnormal{x}}}
\def\ervy{{\textnormal{y}}}
\def\ervz{{\textnormal{z}}}
\def\rmA{{\mathbf{A}}}
\def\rmB{{\mathbf{B}}}
\def\rmC{{\mathbf{C}}}
\def\rmD{{\mathbf{D}}}
\def\rmE{{\mathbf{E}}}
\def\rmF{{\mathbf{F}}}
\def\rmG{{\mathbf{G}}}
\def\rmH{{\mathbf{H}}}
\def\rmI{{\mathbf{I}}}
\def\rmJ{{\mathbf{J}}}
\def\rmK{{\mathbf{K}}}
\def\rmL{{\mathbf{L}}}
\def\rmM{{\mathbf{M}}}
\def\rmN{{\mathbf{N}}}
\def\rmO{{\mathbf{O}}}
\def\rmP{{\mathbf{P}}}
\def\rmQ{{\mathbf{Q}}}
\def\rmR{{\mathbf{R}}}
\def\rmS{{\mathbf{S}}}
\def\rmT{{\mathbf{T}}}
\def\rmU{{\mathbf{U}}}
\def\rmV{{\mathbf{V}}}
\def\rmW{{\mathbf{W}}}
\def\rmX{{\mathbf{X}}}
\def\rmY{{\mathbf{Y}}}
\def\rmZ{{\mathbf{Z}}}
\def\ermA{{\textnormal{A}}}
\def\ermB{{\textnormal{B}}}
\def\ermC{{\textnormal{C}}}
\def\ermD{{\textnormal{D}}}
\def\ermE{{\textnormal{E}}}
\def\ermF{{\textnormal{F}}}
\def\ermG{{\textnormal{G}}}
\def\ermH{{\textnormal{H}}}
\def\ermI{{\textnormal{I}}}
\def\ermJ{{\textnormal{J}}}
\def\ermK{{\textnormal{K}}}
\def\ermL{{\textnormal{L}}}
\def\ermM{{\textnormal{M}}}
\def\ermN{{\textnormal{N}}}
\def\ermO{{\textnormal{O}}}
\def\ermP{{\textnormal{P}}}
\def\ermQ{{\textnormal{Q}}}
\def\ermR{{\textnormal{R}}}
\def\ermS{{\textnormal{S}}}
\def\ermT{{\textnormal{T}}}
\def\ermU{{\textnormal{U}}}
\def\ermV{{\textnormal{V}}}
\def\ermW{{\textnormal{W}}}
\def\ermX{{\textnormal{X}}}
\def\ermY{{\textnormal{Y}}}
\def\ermZ{{\textnormal{Z}}}
\def\vzero{{\bm{0}}}
\def\vone{{\bm{1}}}
\def\vmu{{\bm{\mu}}}
\def\vtheta{{\bm{\theta}}}
\def\va{{\bm{a}}}
\def\vb{{\bm{b}}}
\def\vc{{\bm{c}}}
\def\vd{{\bm{d}}}
\def\ve{{\bm{e}}}
\def\vf{{\bm{f}}}
\def\vg{{\bm{g}}}
\def\vh{{\bm{h}}}
\def\vi{{\bm{i}}}
\def\vj{{\bm{j}}}
\def\vk{{\bm{k}}}
\def\vl{{\bm{l}}}
\def\vm{{\bm{m}}}
\def\vn{{\bm{n}}}
\def\vo{{\bm{o}}}
\def\vp{{\bm{p}}}
\def\vq{{\bm{q}}}
\def\vr{{\bm{r}}}
\def\vs{{\bm{s}}}
\def\vt{{\bm{t}}}
\def\vu{{\bm{u}}}
\def\vv{{\bm{v}}}
\def\vw{{\bm{w}}}
\def\vx{{\bm{x}}}
\def\vy{{\bm{y}}}
\def\vz{{\bm{z}}}
\def\evalpha{{\alpha}}
\def\evbeta{{\beta}}
\def\evepsilon{{\epsilon}}
\def\evlambda{{\lambda}}
\def\evomega{{\omega}}
\def\evmu{{\mu}}
\def\evpsi{{\psi}}
\def\evsigma{{\sigma}}
\def\evtheta{{\theta}}
\def\eva{{a}}
\def\evb{{b}}
\def\evc{{c}}
\def\evd{{d}}
\def\eve{{e}}
\def\evf{{f}}
\def\evg{{g}}
\def\evh{{h}}
\def\evi{{i}}
\def\evj{{j}}
\def\evk{{k}}
\def\evl{{l}}
\def\evm{{m}}
\def\evn{{n}}
\def\evo{{o}}
\def\evp{{p}}
\def\evq{{q}}
\def\evr{{r}}
\def\evs{{s}}
\def\evt{{t}}
\def\evu{{u}}
\def\evv{{v}}
\def\evw{{w}}
\def\evx{{x}}
\def\evy{{y}}
\def\evz{{z}}
\def\mA{{\bm{A}}}
\def\mB{{\bm{B}}}
\def\mC{{\bm{C}}}
\def\mD{{\bm{D}}}
\def\mE{{\bm{E}}}
\def\mF{{\bm{F}}}
\def\mG{{\bm{G}}}
\def\mH{{\bm{H}}}
\def\mI{{\bm{I}}}
\def\mJ{{\bm{J}}}
\def\mK{{\bm{K}}}
\def\mL{{\bm{L}}}
\def\mM{{\bm{M}}}
\def\mN{{\bm{N}}}
\def\mO{{\bm{O}}}
\def\mP{{\bm{P}}}
\def\mQ{{\bm{Q}}}
\def\mR{{\bm{R}}}
\def\mS{{\bm{S}}}
\def\mT{{\bm{T}}}
\def\mU{{\bm{U}}}
\def\mV{{\bm{V}}}
\def\mW{{\bm{W}}}
\def\mX{{\bm{X}}}
\def\mY{{\bm{Y}}}
\def\mZ{{\bm{Z}}}
\def\mBeta{{\bm{\beta}}}
\def\mPhi{{\bm{\Phi}}}
\def\mLambda{{\bm{\Lambda}}}
\def\mSigma{{\bm{\Sigma}}}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
\def\tA{{\tens{A}}}
\def\tB{{\tens{B}}}
\def\tC{{\tens{C}}}
\def\tD{{\tens{D}}}
\def\tE{{\tens{E}}}
\def\tF{{\tens{F}}}
\def\tG{{\tens{G}}}
\def\tH{{\tens{H}}}
\def\tI{{\tens{I}}}
\def\tJ{{\tens{J}}}
\def\tK{{\tens{K}}}
\def\tL{{\tens{L}}}
\def\tM{{\tens{M}}}
\def\tN{{\tens{N}}}
\def\tO{{\tens{O}}}
\def\tP{{\tens{P}}}
\def\tQ{{\tens{Q}}}
\def\tR{{\tens{R}}}
\def\tS{{\tens{S}}}
\def\tT{{\tens{T}}}
\def\tU{{\tens{U}}}
\def\tV{{\tens{V}}}
\def\tW{{\tens{W}}}
\def\tX{{\tens{X}}}
\def\tY{{\tens{Y}}}
\def\tZ{{\tens{Z}}}
\def\gA{{\mathcal{A}}}
\def\gB{{\mathcal{B}}}
\def\gC{{\mathcal{C}}}
\def\gD{{\mathcal{D}}}
\def\gE{{\mathcal{E}}}
\def\gF{{\mathcal{F}}}
\def\gG{{\mathcal{G}}}
\def\gH{{\mathcal{H}}}
\def\gI{{\mathcal{I}}}
\def\gJ{{\mathcal{J}}}
\def\gK{{\mathcal{K}}}
\def\gL{{\mathcal{L}}}
\def\gM{{\mathcal{M}}}
\def\gN{{\mathcal{N}}}
\def\gO{{\mathcal{O}}}
\def\gP{{\mathcal{P}}}
\def\gQ{{\mathcal{Q}}}
\def\gR{{\mathcal{R}}}
\def\gS{{\mathcal{S}}}
\def\gT{{\mathcal{T}}}
\def\gU{{\mathcal{U}}}
\def\gV{{\mathcal{V}}}
\def\gW{{\mathcal{W}}}
\def\gX{{\mathcal{X}}}
\def\gY{{\mathcal{Y}}}
\def\gZ{{\mathcal{Z}}}
\def\sA{{\mathbb{A}}}
\def\sB{{\mathbb{B}}}
\def\sC{{\mathbb{C}}}
\def\sD{{\mathbb{D}}}
\def\sF{{\mathbb{F}}}
\def\sG{{\mathbb{G}}}
\def\sH{{\mathbb{H}}}
\def\sI{{\mathbb{I}}}
\def\sJ{{\mathbb{J}}}
\def\sK{{\mathbb{K}}}
\def\sL{{\mathbb{L}}}
\def\sM{{\mathbb{M}}}
\def\sN{{\mathbb{N}}}
\def\sO{{\mathbb{O}}}
\def\sP{{\mathbb{P}}}
\def\sQ{{\mathbb{Q}}}
\def\sR{{\mathbb{R}}}
\def\sS{{\mathbb{S}}}
\def\sT{{\mathbb{T}}}
\def\sU{{\mathbb{U}}}
\def\sV{{\mathbb{V}}}
\def\sW{{\mathbb{W}}}
\def\sX{{\mathbb{X}}}
\def\sY{{\mathbb{Y}}}
\def\sZ{{\mathbb{Z}}}
\def\emLambda{{\Lambda}}
\def\emA{{A}}
\def\emB{{B}}
\def\emC{{C}}
\def\emD{{D}}
\def\emE{{E}}
\def\emF{{F}}
\def\emG{{G}}
\def\emH{{H}}
\def\emI{{I}}
\def\emJ{{J}}
\def\emK{{K}}
\def\emL{{L}}
\def\emM{{M}}
\def\emN{{N}}
\def\emO{{O}}
\def\emP{{P}}
\def\emQ{{Q}}
\def\emR{{R}}
\def\emS{{S}}
\def\emT{{T}}
\def\emU{{U}}
\def\emV{{V}}
\def\emW{{W}}
\def\emX{{X}}
\def\emY{{Y}}
\def\emZ{{Z}}
\def\emSigma{{\Sigma}}
\newcommand{\etens}[1]{\mathsfit{#1}}
\def\etLambda{{\etens{\Lambda}}}
\def\etA{{\etens{A}}}
\def\etB{{\etens{B}}}
\def\etC{{\etens{C}}}
\def\etD{{\etens{D}}}
\def\etE{{\etens{E}}}
\def\etF{{\etens{F}}}
\def\etG{{\etens{G}}}
\def\etH{{\etens{H}}}
\def\etI{{\etens{I}}}
\def\etJ{{\etens{J}}}
\def\etK{{\etens{K}}}
\def\etL{{\etens{L}}}
\def\etM{{\etens{M}}}
\def\etN{{\etens{N}}}
\def\etO{{\etens{O}}}
\def\etP{{\etens{P}}}
\def\etQ{{\etens{Q}}}
\def\etR{{\etens{R}}}
\def\etS{{\etens{S}}}
\def\etT{{\etens{T}}}
\def\etU{{\etens{U}}}
\def\etV{{\etens{V}}}
\def\etW{{\etens{W}}}
\def\etX{{\etens{X}}}
\def\etY{{\etens{Y}}}
\def\etZ{{\etens{Z}}}
\newcommand{\pdata}{p_{\rm{data}}}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
\newcommand{\pmodel}{p_{\rm{model}}}
\newcommand{\Pmodel}{P_{\rm{model}}}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
\newcommand{\pencode}{p_{\rm{encoder}}}
\newcommand{\pdecode}{p_{\rm{decoder}}}
\newcommand{\precons}{p_{\rm{reconstruct}}}
\newcommand{\laplace}{\mathrm{Laplace}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Ls}{\mathcal{L}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\emp}{\tilde{p}}
\newcommand{\lr}{\alpha}
\newcommand{\reg}{\lambda}
\newcommand{\rect}{\mathrm{rectifier}}
\newcommand{\softmax}{\mathrm{softmax}}
\newcommand{\sigmoid}{\sigma}
\newcommand{\softplus}{\zeta}
\newcommand{\KL}{D_{\mathrm{KL}}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\standarderror}{\mathrm{SE}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\normlzero}{L^0}
\newcommand{\normlone}{L^1}
\newcommand{\normltwo}{L^2}
\newcommand{\normlp}{L^p}
\newcommand{\normmax}{L^\infty}
\newcommand{\parents}{Pa}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Tr}{Tr}
\let\ab\allowbreak
\newcommand{\ours}{RLPT}
\newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]}
\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}
\maketitle
\let\oldthefootnote\thefootnote
\let\thefootnote\oldthefootnote

<figure id="fig:rpt_scaling" data-latex-placement="h">
<img src="figures/rpt_scaling3.png" style="width:100.0%" />
<figcaption>Scaling law of  performance on various benchmarks with respect to training tokens.</figcaption>
</figure>

# Introduction

Large language models (LLMs) have achieved remarkable success across diverse domains, including human-aligned conversational assistants [@bai2022training; @ouyang2022training] and autonomous AI agents [@team2025kimi]. A central driver of this progress has been the scaling of computational resources during training, realized through the simultaneous expansion of both data and model parameters. For instance, training corpora have grown from billions of tokens in BERT [@devlin2019bert] to trillions in Llama [@touvron2023llama; @grattafiori2024llama], while model sizes have scaled from millions of parameters in BERT [@devlin2019bert] to the trillion-parameter level in Kimi K2 [@team2025kimi]. However, parameter scaling requires increasingly demanding infrastructure and results in prohibitive inference costs, whereas data scaling is constrained by the scarcity of high-quality web corpora [@villalobos2024position; @muennighoff2023scaling; @ruan2025reasoning].

In this paper, we propose a new scaling paradigm `\ours`{=latex}[^4] to optimize LLMs through reinforcement learning (RL) on pre-training data. In contrast to prior scaling approaches that primarily rely on supervised learning, `\ours`{=latex} allocates training compute to enable the policy to autonomously explore meaningful reasoning trajectories to learn from pre-training data and improve its overall capabilities through reinforcement learning (RL). This paradigm offers two main advantages. First, it enables reasoning for learning: rather than directly learning token by token, the model generates intermediate reasoning content that can uncover the latent thought process underlying data construction, augment the original data, and support more data-efficient learning [@ruan2025reasoning]. Second, RL leverages self-explored trajectories for training, maintains proximity to the original policy distribution, and thereby fosters stronger generalization capabilities [@chu2025sft; @lai2025reinforcement; @shenfeld2025rl]. However, directly scaling RL also introduces new challenges, since existing frameworks such as reinforcement learning from human feedback (RLHF) [@bai2022training] and reinforcement learning with verifiable rewards (RLVR) [@guo2025deepseek] still rely heavily on human annotation, which constrains their scalability on pre-training data.

To address this challenge, we propose a novel next-segment reasoning objective that can obtaining meaningful self-supervised reward from unlabeled internat data. To be more specifically, the model is first required to predict the subsequent segment of text, and then the reward signal is derived by evaluating the semantic consistency between the predicted segment and the real segment using a generative reward model. Based on different prediction segment configurations, we propose two tasks with distinct effect. The first requires the model to predict a complete subsequent sentence given the preceding context, which we term the Autoregressive Segment Reasoning (ASR) task. The second involves a context with masked tokens in the middle, where the model must leverage both preceding and following context to infer a continuous span of masked tokens, which we designate as the Middle Segment Reasoning (MSR) task. During training, we interleave ASR and MSR task to simultaneously optimize the model's autoregressive generation capabilities as well as the in-context understanding abilities.

We evaluate `\ours`{=latex} across both general-domain and mathematical reasoning tasks using multiple models. Experimental results demonstrate that `\ours`{=latex} delivers consistent and substantial improvements in both settings. For example, when applied to Qwen3-4B-Base, `\ours`{=latex} achieves absolute gains of $3.0$, $5.1$, $8.1$, and $6.0$ on MMLU [@hendrycks2021measuringmmlu], MMLU-Pro [@wang2024mmlu], GPQA-Diamond [@rein2024gpqa], and KOR-Bench [@ma2024kor], respectively, together with improvements of $6.6$ and $5.3$ in Pass@$1$ on AIME24 and AIME25 [@aime]. Comparable gains are also observed on Llama3.2-3B-Base and Qwen3-8B-Base, with detailed results provided in Sec. `\ref{sec:main_results}`{=latex}. Beyond standalone performance, `\ours`{=latex} also strengthens the reasoning capability of LLMs. When serving as the foundation for RLVR, it yields additional improvements of $2.3$ and $1.3$ in Pass@$1$, and $3.7$ and $2.0$ in Pass@$8$, on AIME24 and AIME25 with Qwen3-4B-Base, respectively. We further analyze the scaling behavior of `\ours`{=latex}, showing that downstream performance empirically follows a scaling law with training compute (Fig. `\ref{fig:rpt_scaling}`{=latex}), highlighting its potential for continued progress with increased compute. In addition to quantitative results, qualitative analysis of reasoning trajectories reveals diverse reasoning strategies, providing insight into the effectiveness of `\ours`{=latex}. Finally, we distill practical design lessons from `\ours`{=latex} to inform future research in this direction.

Our contributions can be summarized in three aspects:

- We propose RLPT, a method that scales RL on pre-training data. To remove the reliance on human annotation, we design a next-segment reasoning objective, consisting of ASR and MSR tasks, which reward LLMs for correctly predicting the ground-truth next segment given the preceding context.

- Extensive experiments on general-domain and mathematical reasoning tasks across multiple models show that RLPT substantially improves performance and exhibits a favorable scaling trend, empirically establishing a scaling law in benchmark performance as compute increases, indicating strong potential for continued gains.

- Results further demonstrate that RLPT provides a strong foundation for subsequent RLVR, extending the reasoning boundaries of LLMs and boosting performance on mathematical reasoning benchmarks.

# Preliminary

In this section, we briefly review reinforcement learning (RL) and the supervised learning paradigm of next-token prediction in training large language models (LLMs), as these serve as the foundation of `\ours`{=latex}.

## Reinforcement Learning in Large Language Models

Reinforcement learning (RL) has become an essential component for improving LLMs. Formally, $$\begin{equation}
    \mathcal{J}_{\text{RL}}(\theta) = \mathbb{E}{[q \sim D_{q}, o \sim \pi_{\theta}(\cdot\mid q)]} [r(o)],
\label{eq:rl}
\end{equation}$$ where $r(o)$ denotes the reward assigned to output $o$. In reinforcement learning from human feedback (RLHF) [@ouyang2022training; @bai2022training], rewards are provided by a neural reward model trained on human-annotated preference pairs. More recently, reinforcement learning with verifiable rewards (RLVR) employs rule-based functions that compare model outputs against reference answers [@guo2025deepseek; @zeng2025simplerl]. Optimizing Eq. `\ref{eq:rl}`{=latex} encourages the model to reinforce behaviors associated with higher rewards while suppressing those linked to lower rewards. In practice, this objective is typically optimized using policy gradient algorithms such as PPO [@schulman2017proximal] and GRPO [@shao2024deepseekmath]. Despite their effectiveness, both RLHF and RLVR face scalability challenges due to their reliance on human annotations.

## Next-Token Prediction

Next-token prediction (NTP) is the fundamental training objective of modern LLMs. Formally, $$\begin{equation}
    \mathcal{J}_{\text{NTP}}(\theta) = \mathbb{E}{[x \sim \mathcal{D}_{x}]} - \frac{1}{|x|} \sum_{i=1}^{|x|} \log \pi_\theta(x_i \mid x_{<i}),
\end{equation}$$ where $x$ is a token sequence and $|x|$ its length. Pre-training and post-training based on NTP constitute the mainstream optimization paradigm for LLMs, yielding remarkable success across diverse applications. Nevertheless, recent studies suggest that supervised fine-tuning (SFT) under the NTP paradigm often promotes surface-level memorization rather than fostering the deeper generalization capabilities achievable with RL [@chu2025sft; @lai2025reinforcement; @shenfeld2025rl].

# Reinforcement Learning on Pre-Training Data

To address the limitations of scalability and generalization, we propose Reinforcement Learning on Pre-Training data (`\ours`{=latex}). In this framework, next-segment reasoning serves as the reinforcement learning (RL) objective, where the subsequent segment in unlabeled text acts as the ground truth. This self-supervised objective removes the reliance on human annotation and enables RL to scale directly on large pre-training corpora. An overview of `\ours`{=latex} is shown in Fig. `\ref{fig:method_overview}`{=latex}.

<figure id="fig:method_overview" data-latex-placement="t">
<img src="figures/method_overview.png" style="width:100.0%" />
<figcaption>Overview of . Raw data from internet corpora is processed into training samples of the form <span class="math inline">(<em>s</em><sub> &lt; <em>i</em></sub>, <em>s</em><sub><em>i</em></sub>, <em>s</em><sub><em>i</em> + 1</sub>)</span>. During the reinforcement pre-training stage, the policy LLM predicts <span class="math inline"><em>ŝ</em><sub><em>i</em></sub></span> conditioned on <span class="math inline"><em>s</em><sub> &lt; <em>i</em></sub></span> (ASR) or on <span class="math inline">(<em>s</em><sub> &lt; <em>i</em></sub>, <em>s</em><sub><em>i</em> + 1</sub>)</span> (MSR). The prediction is then compared with <span class="math inline"><em>s</em><sub><em>i</em></sub></span> to compute the reward.</figcaption>
</figure>

## Data Preparation

We construct a corpus for `\ours`{=latex} by aggregating web text from diverse sources such as Wikipedia, arXiv, and threaded conversation data. To ensure data quality and compliance, we apply a multi-stage preprocessing pipeline consisting of: (i) MinHash-based near-deduplication, (ii) detection and masking of personally identifiable information (PII), and (iii) contamination removal with respect to all development and evaluation sets. Given the inherent noise in web corpora, we further implement a rigorous filtering procedure that integrates both rule-based and model-based methods. The rule-based stage eliminates content that is clearly unsuitable for language model training, whereas the model-based stage employs an instruction-tuned language model to perform more fine-grained quality assessments. Furthermore, we curated high-quality QA data from the annealing dataset [@team2025hunyuan] for mathematical reasoning tasks to enhance the model's reasoning ability.

## Next-Segment Reasoning

Given a text $t$ from the pre-training data, we divide it into a sequence of contiguous segments $t = [s_1, s_2, \ldots, s_n]$, where each $s_i$ corresponds to a semantically coherent unit such as a phrase, a complete sentence, or a reasoning step. We then construct a dataset $$\begin{equation}
\mathcal{D}s = {(s_{<i}, , s_i, s_{i+1}) \mid i = 2, \dots, n-1 },
\end{equation}$$ where $s_{<i} = [s_1, s_2, \dots, s_{i-1}]$ denotes the context, $s_i$ is the target segment, and $s_{i+1}$ is its subsequent segment. Based on this formulation, we introduce two segment-level training objectives that capture richer semantics than token-level prediction. Inspired by next-token prediction (NTP), we propose Autoregressive Segment Reasoning (ASR), which trains the policy to predict $s_i$ from $s_{<i}$, aligning with the autoregressive generation process of modern LLMs. To further enable the model to leverage broader contextual information, we introduce Middle Segment Reasoning (MSR), which trains the policy to predict $s_i$ from both $s_{<i}$ and $s_{i+1}$. This resembles masked language modeling [@devlin2019bert; @liu2019roberta; @raffel2020exploring] and is particularly useful for tasks such as code completion. During training, we interleave ASR and MSR by designing different prompts and extracting the predicted segment between special tags in the output. The prompt for the ASR task is illustrated below.

\begin{prompt}[notitle]{-15pt}{-5pt}{}
    Complete the text provided under ### Context by predicting the next most probable sentence. 
    
    Please reason step by step to determine the best possible continuation, and then enclose your final answer within <|startofprediction|> and <|endofprediction|> tags.
    
    ### Context
    
    {context} 
\end{prompt}

Similarly, the prompt for the MSR task is presented as follows

\begin{prompt}[notitle]{-15pt}{-5pt}{}
    ## Text Material ##:
    {prompt}
    
    <MASK>

    {next_step}
    
    ## Task ##: 
    Fill in the <MASK> section of the material with appropriate sentences or a solution step.
    
    Carefully reason step by step to determine the most suitable completion.  
    Finally, provide your best prediction for the <MASK> section.  
    Enclose your final answer for the <MASK> part within <|startofprediction|> and <|endofprediction|>.
\end{prompt}

The reward is defined as the semantic consistency between the predicted and reference segments, evaluated by a generative reward model $G_{rm}$. This model assesses whether the two segments convey equivalent content while allowing for linguistic variation. In practice, we find that directly comparing the predicted segment with the ground-truth next segment is overly strict, since the model may generate outputs that span multiple subsequent segments. To address this issue, we provide $G_{rm}$ with several subsequent segments as reference and instruct it to verify whether the predicted segment is a valid prefix of the reference content. The prompt for $G_{rm}$ is shown below.

\begin{prompt}[notitle]{-15pt}{-5pt}{}
    ## Task
    Given a Predicted sentence and a Reference paragraph, determine whether the Predicted text is a prefix (initial segment) of the Reference paragraph, and whether it expresses exactly the same semantic content as the corresponding prefix of the Reference. 
    The Predicted text does not need to match the prefix of the Reference word-for-word, but it must convey the same meaning.
    
    Reference:
    {reference}
    
    Predicted:
    {predicted}
    
    ## Scoring Rules
    
    If the Predicted text semantically matches the prefix of the Reference, assign a score of 1.
    If the Predicted text does not semantically match the prefix of the Reference, assign a score of 0.
    When making your judgment, focus primarily on semantic equivalence, not on exact wording.
    
    Only output the score on a single line; do not provide any explanatory text or additional content. 
    Output format (choose one):
    
    Score: 0  
    or  
    Score: 1
\end{prompt}

Given a predicted segment $\hat{s}_{i}$ extracted from the model output $o$, the reward is specified as $$\begin{equation}
r(o, s_{i}) =
\begin{cases}
1 & \text{if } G_{rm}(\hat{s}_{i}, s_{i}) = 1, \\
0 & \text{otherwise}.
\end{cases}
\end{equation}$$

The training objective of `\ours`{=latex} is defined as $$\begin{equation}
\begin{aligned}
    \mathcal{J}_{\text{SRPT}}(\theta) 
    &= \mathbb{E}_{ASR}{[(s_{<i},s_{i}) \sim \mathcal{D}_{s}, \, o \sim \pi_{\theta}(\cdot \mid s_{< i})]} [r(o,s_{i})] \\
    &\quad + \lambda \, \mathbb{E}_{MSR}{[(s_{<i},s_{i}, s_{i+1}) \sim \mathcal{D}_{s}, \, o \sim \pi_{\theta}(\cdot \mid s_{< i}, s_{i+1})]} [r(o,s_{i})],
\end{aligned}
\end{equation}$$ where $\lambda \in (0, 1)$ is a hyperparameter that balances the contributions of ASR and MSR terms, and may be adjusted depending on the requirements of specific downstream applications.

## Training Details

#### Cold-Start.

`\ours`{=latex} can be applied to a base model after next-token pre-training, but it requires a minimum level of instruction-following ability to initiate next-segment reasoning. To satisfy this requirement, we introduce a cold-start phase consisting of supervised fine-tuning on instruction-following data.

#### Next-Segment Reasoning.

In this work, we define a segment unit as a sentence by default. We also conducted preliminary studies with alternative segmentation units, such as employing LLMs to extract integrated atomic steps from text, but these approaches did not yield clear improvements over sentence-level segmentation. Therefore, we adopt sentence segmentation as the default setting in our experiments and leave the exploration of other strategies for future work. For sentence segmentation, we use the NLTK toolkit [@bird2006nltk], filtering out sentences that are too short. Each remaining sentence is then treated as a target for RL under the next-segment reasoning objective.

# Experiments

## Experimental Setup

Experiments are conducted on Llama3 models [@grattafiori2024llama] and Qwen3 models [@yang2025qwen3]. In the cold-start supervised fine-tuning (SFT) stage, we use a batch size of $1024$, a learning rate of $2 \times 10^{-5}$ with a cosine scheduler, and train for $3$ epochs. For next-segment reasoning, we adopt a batch size of $512$, a maximum response length of $8192$, and a constant learning rate of $1 \times 10^{-6}$. For each prompt, we sample $8$ outputs with a temperature of $1.0$, and optimization is performed using on-policy GRPO [@shao2024deepseekmath] without KL regularization. In the mathematical reasoning domain, we further conduct RLVR experiments, evaluating its performance when built on `\ours`{=latex}, with RLVR configured using the same hyperparameters as the next-segment reasoning task.

## Evaluation Metric

We evaluate model performance on both general-domain and mathematical reasoning tasks. For the general domain, we use benchmarks including MMLU [@hendrycks2021measuringmmlu], MMLU-Pro [@wang2024mmlu], GPQA-Diamond [@rein2024gpqa], SuperGPQA [@du2025supergpqa], KOR-Bench [@ma2024kor], and OlympiadBench [@he2024olympiadbench], reporting accuracy as the evaluation metric. For mathematical reasoning, we evaluate on MATH-500 [@hendrycks2021measuring], AMC23 [@amc], Minerva Math [@lewkowycz2022solving], and AIME [@aime], using the Pass@$k$ metric, which measures the probability that at least one correct solution appears among $k$ independent attempts. We adopt the unbiased estimator of Pass@$k$ [@chen2021evaluating]: $$\begin{equation}
\text{Pass@}k = \mathbb{E}_{\boldsymbol{x} \sim \mathcal{D}} \left[1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right],
\end{equation}$$ where $n$ is the number of sampled responses per prompt and $c$ is the number of correct responses. We sample $n=64$ responses with temperature $0.6$ and top-$p$ $0.95$, and report Pass@$1$ and Pass@$8$. The maximum generation length is set to $32{,}768$ tokens. Correctness in mathematical reasoning is evaluated using Math-Verify[^5].

## Experimental Results {#sec:main_results}

\renewcommand{\arraystretch}{1.0}
\resizebox{0.95\linewidth}{!}{%
    \begin{tabular}{lcccccc}
      \toprule
      \textbf{Training} 
        & \textbf{MMLU} & \textbf{MMLU-Pro} & \textbf{GPQA-Diamond}  & \textbf{SuperGPQA} & \textbf{KOR-Bench} & \textbf{OlympiadBench} \\
      \midrule
      \midrule
      \textit{Llama-3.2-3B-Base} \\
      \midrule
      \midrule
      Base                  & $4.2$            & $21.3$            & $3.5$            & $7.7$            & $3.1$             & $1.7$  \\
      \midrule
      \quad $+$ Cold-Start  & $59.4$            & $34.7$            & $16.7$            & $15.8$            & $39.1$            & $14.4$  \\
      \quad $+$ \ours       & $59.4$ & $\mathbf{36.2}$     & $\mathbf{28.3}$   & $\mathbf{19.2}$   & $\mathbf{39.4}$   & $\mathbf{15.9}$\\
      \midrule
      \midrule
      \textit{Qwen3-4B-Base} \\
      \midrule
      \midrule
      Base                  & $30.6$            & $16.0$            & $17.7$            & $25.4$            & $3.7$             & $35.0$  \\
      \midrule
      \quad $+$ Cold-Start  & $77.8$            & $59.7$            & $31.3$            & $32.3$            & $50.7$            & $51.7$  \\
      \quad $+$ \ours       & $\mathbf{80.8}$ & $\mathbf{64.8}$     & $\mathbf{39.4}$   & $\mathbf{34.3}$   & $\mathbf{56.7}$   & $\mathbf{52.7}$\\
      \midrule
      \midrule
      \textit{Qwen3-8B-Base} \\
      \midrule
      \midrule
      Base                  & $58.9$            & $47.0$            & $27.8$            & $28.5$            & $40.6$             & $38.8$  \\
      \midrule
      \quad $+$ Cold-Start  & $81.6$            & $64.9$            & $45.5$            & $37.8$            & $55.1$            & $57.6$  \\
      \quad $+$ \ours       & $\mathbf{83.0}$ & $\mathbf{68.3}$     & $\mathbf{47.5}$   & $\mathbf{40.1}$   & $\mathbf{55.7}$   & $\mathbf{59.7}$\\
      \bottomrule
    \end{tabular}%
  }
\renewcommand{\arraystretch}{1.1}
\resizebox{1.0\linewidth}{!}{%
    \begin{tabular}{lcccccccccc}
      \toprule
      \textbf{Training} 
        & \textbf{Pass@$1$} 
        & \textbf{Pass@$8$} \\
      \cmidrule(lr){2-6} \cmidrule(lr){7-11}
      & \textbf{MATH} & \textbf{AMC23}  & \textbf{Minerva} & \textbf{AIME24} & \textbf{AIME25}
      & \textbf{MATH} & \textbf{AMC23}  & \textbf{Minerva} & \textbf{AIME24} & \textbf{AIME25} \\
      \midrule
      Base         & $39.8$        & $24.7$        & $17.7$        & $7.3$        & $4.5$    
                   & $79.9$        & $65.6$        & $41.0$        & $24.3$       & $21.6$        \\
      \midrule
      \quad $+$ Cold-Start & $83.6$        & $65.9$        & $38.2$        & $20.6$        & $21.9$    
                           & $95.0$        & $91.8$        & $54.1$        & $40.3$        & $39.5$        \\
      \quad $+$ \ours  & $87.4$        & $77.1$        & $40.1$        & $27.2$        & $27.2$    
                       & $95.3$        & $92.1$        & $54.8$        & $45.3$        & $40.9$        \\
      \midrule
      \quad $+$ RLVR   & $89.1$        & $76.3$        & $41.6$        & $27.6$        & $27.7$    
                       & $96.7$        & $\mathbf{94.3}$        & $55.8$        & $49.8$        & $41.6$ \\
      \quad $+$ \ours $+$ RLVR   
                       & $\mathbf{90.6}$ & $\mathbf{79.7}$ & $\mathbf{42.1}$ & $\mathbf{29.9}$ & $\mathbf{29.0}$    
                       & $\mathbf{96.8}$ & $93.5$ & $\mathbf{56.8}$ & $\mathbf{53.5}$ & $\mathbf{43.6}$        \\
      \bottomrule
    \end{tabular}%
  }

#### General Domain.

The performance on general-domain tasks is summarized in Tab. `\ref{tab:general_results}`{=latex}, where `\ours`{=latex} delivers substantial and consistent gains across all benchmarks and models. In particular, when applied to Qwen3-4B-Base, it achieves absolute improvements of $3.0$, $5.1$, $8.1$, $2.0$, and $6.0$ on MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, and KOR-Bench, respectively. With Qwen3-8B-Base, the improvements are $1.4$, $3.4$, $2.0$, $2.3$, and $2.1$ on MMLU, MMLU-Pro, GPQA-Diamond, SuperGPQA, and OlympiadBench, respectively. Furthermore, results on Llama-3.2-3B-Base confirm the generalizability of `\ours`{=latex} across different model families, with absolute improvements of $1.5$, $11.6$, and $3.4$ on MMLU-Pro, GPQA-Diamond, and SuperGPQA, respectively. Since these benchmarks span diverse domains including STEM, law, economics, and health, the results demonstrate that `\ours`{=latex} effectively leverages the extensive knowledge contained in large-scale pre-training corpora.

#### Mathematical Reasoning.

As shown in Tab. `\ref{tab:math_results}`{=latex}, `\ours`{=latex} yields substantial gains in mathematical reasoning, improving performance on both Pass@$1$ and Pass@$8$. On the challenging AIME24 and AIME25 benchmarks, `\ours`{=latex} achieves absolute improvements of $6.6$ and $5.3$ on Pass@$1$, and $5.0$ and $1.4$ on Pass@$8$, respectively. These improvements indicate that `\ours`{=latex} is effective in unlocking the reasoning boundary, thereby providing a strong basis for subsequent RLVR training. Indeed, when `\ours`{=latex} is used as the initialization for RLVR, it further boosts performance, with absolute gains of $2.3$ (AIME24) and $1.3$ (AIME25) on Pass@$1$, and $3.7$ (AIME24) and $2.0$ (AIME25) on Pass@$8$. This demonstrates that `\ours`{=latex} enhances both exploitation and exploration, which are typically considered competing objectives.

<figure id="fig:rpt_rlvr_scaling" data-latex-placement="h">
<img src="figures/rpt_rlvr_scaling2.png" style="width:100.0%" />
<figcaption>Comparative scaling properties of RLVR and  <span class="math inline">+</span> RLVR.</figcaption>
</figure>

## Analysis

#### Scaling Properties.

As shown in Fig. `\ref{fig:rpt_scaling}`{=latex}, the performance of `\ours`{=latex} on various benchmarks follows a power-law decay with respect to the number of training tokens, suggesting potential for further gains through scaling compute. We also report the scaling trend when `\ours`{=latex} serves as the foundation for RLVR in Fig. `\ref{fig:rpt_rlvr_scaling}`{=latex}. In this setting, `\ours`{=latex} provides a strong initialization, yielding consistent improvements throughout training. Notably, both Pass@$1$ and Pass@$8$ improve, indicating that the gains from `\ours`{=latex} do not come at the expense of exploration capability, which remains a common challenge for current RLVR methods [@yue2025does; @walder2025pass; @chen2025pass; @wu2025invisible].

#### Thinking Patterns.

\scalebox{1.0}{
\toprule
\textbf{Thinking Process of \ours}  \toprule
Alright, let's tackle this step by step. I'm going to predict the next most probable sentence after the given context... \newline
\textcolor{myred}{Understanding the Context:} the Fourier transform of $\chi_{[-a,a]}$ is defined, the integral is restricted to $[-a,a]$, the exponential is split into cosine and sine parts... The cosine integral gives $\tfrac{2 \sin(a \xi)}{\xi}$, while the sine integral gives $0$... \newline
\textcolor{myred}{Identifying the Next Step:} the next move would be to combine these results... Since the sine term vanishes, the final answer should come only from the cosine contribution... \newline
\textcolor{myblue}{Verifying the Flow:} after evaluating both parts, the solution normally concludes by combining them into the closed-form Fourier transform... \newline
\textcolor{myblue}{Considering Alternative Path:} could the solution instead highlight the behavior at $\xi=0$ or present a piecewise definition... \textcolor{myblue}{Alternatively,} the solution might emphasize this special case, yet it is usually handled by noting the limit, so the general form is sufficient... \newline
\textcolor{myblue}{Looking back,} one might also check the special case $\xi=0$, but the limit already gives $2a$, so the concise expression is consistent... \newline
\textcolor{myred}{So the natural continuation is:} \texttt{<|startofprediction|>}Thus, combining these results, the Fourier transform of $\chi_{[-a,a]}$ is given by $F(\xi) = \tfrac{2 \sin(a \xi)}{\xi}$. \texttt{<|endofprediction|>}  \bottomrule}

To demonstrate the benefits of `\ours`{=latex}, we provide in Tab. `\ref{tab:thinking_pattern}`{=latex} an illustrative example of its reasoning process. In this case, the model approaches the next-segment reasoning task through a structured sequence: it first abstracts the preceding context to capture the overarching flow, then determines the subsequent step, formulates a candidate continuation, verifies its plausibility, explores alternative possibilities, performs backtracking when appropriate, and ultimately produces the final answer. This structured trajectory aligns with the multi-step reasoning strategies exhibited by LLMs in complex problem-solving [@guo2025deepseek; @jaech2024openai], which helps explain the effectiveness of `\ours`{=latex}.

#### Reward Modeling.

![Comparison between Strict Reward and Prefix Reward: (a) Training Reward, (b) Response Length, (c) Validation Performance (Pass@$1$).](figures/reward_modeling.png){#fig:reward_modeling width="100%"}

In developing `\ours`{=latex}, we iteratively refined our reward modeling approach after encountering several challenges with our initial formulation. Our initial approach adopted a strict reward that required the predicted segment to convey exactly the same semantic content as the ground-truth segment. This constraint proved too rigid, leading to numerous false positives. We observed that the model often generated outputs that encompassed multiple ground-truth segments, largely due to the uneven distribution of information across sentence-based segmentation: some sentences contained only a single formula, while others might captured the complete solution to a subproblem. Such discrepancies disrupted the training process and yielded only limited improvements in downstream performance, as illustrated in Fig. `\ref{fig:reward_modeling}`{=latex}. To address this issue, we introduce a relaxed prefix reward, which assigns a score of $1$ as long as the predicted segment forms a valid prefix of the ground-truth completion. This adjustment addresses segments with varying information content and provides a more stable training signal. It also enables the model to generate longer responses, which in turn results in improved performance on downstream mathematical reasoning tasks, as shown in Fig. `\ref{fig:reward_modeling}`{=latex}.

# Related Work

#### Scaling Paradigms.

The progress of language models has been fundamentally driven by scaling compute, which can be broadly divided into training-time scaling and test-time scaling. Training-time scaling primarily relies on next-token prediction, increasing computational cost by enlarging model size or expanding pre-training data to reduce prediction loss [@radford2019language; @brown2020language; @kaplan2020scaling; @hoffmann2022training]. In contrast, test-time scaling allocates more compute during inference by generating extended chains of reasoning before producing the final answer [@brown2024large; @jaech2024openai; @muennighoff2025s1; @guo2025deepseek]. `\ours`{=latex} belongs to the training-time scaling paradigm but differs from prior approaches that emphasize supervised learning. Instead, it employs reinforcement learning (RL), allocating compute for the model to self-explore and learn from large-scale pre-training corpora. RL provides two notable advantages. First, it enables the model to uncover the latent reasoning underlying data, which can be regarded as a compressed form of deliberative thinking reflected in scientific papers or textbooks [@ruan2025reasoning]. Second, recent research suggests that RL supports better generalization compared with supervised learning [@chu2025sft; @lai2025reinforcement; @shenfeld2025rl]. The most relevant approaches are RPT [@dong2025reinforcement] and Quiet-STaR [@zelikman2024quietstar], both of which apply RL on unlabeled data for training-time scaling. However, `\ours`{=latex} differs by focusing on next-segment prediction rather than next-token prediction.

#### Reinforcement Learning in LLMs.

RL has become a central paradigm for LLMs. Early applications mainly focused on aligning model outputs with human values [@bai2022training; @ouyang2022training; @mu2024rule], typically through reward models trained on human-annotated preference pairs. More recently, RL has been used to strengthen reasoning abilities by leveraging rule-based reward functions that evaluate outputs against reference answers [@guo2025deepseek; @zhu2025surprising]. Despite these advances, both directions ultimately depend on human-provided or verifiable supervision, which limits scalability. In contrast, `\ours`{=latex} introduces the next-segment reasoning objective, where the subsequent segment in natural text serves as the reference. This design removes the need for human annotation and enables RL to scale effectively on large-scale pre-training data.

# Conclusion

This work introduces `\ours`{=latex}, a new training-time scaling paradigm that applies reinforcement learning to pre-training data. At its core, `\ours`{=latex} adopts a self-supervised next-segment reasoning objective, which removes the need for human annotations and enables RL training on large unlabeled corpora. Extensive experiments demonstrate the effectiveness of `\ours`{=latex}, yielding substantial gains in both general-domain and mathematical reasoning tasks. Moreover, the performance exhibits favorable scaling properties with respect to training compute, suggesting strong potential for further gains.

\bibliographystyle{iclr2025_conference}
\appendix

[^1]:  Work completed during an internship at Tencent.

[^2]:  The first three authors contributed equally to this work.

[^3]:  Project Lead.

[^4]: RLPT stands for Reinforcement Learning on Pre-Training data.

[^5]: <https://github.com/huggingface/Math-Verify>
