LLM Post-Training

Summary

LLM post-training is the wiki’s umbrella for supervised fine-tuning, RL-style optimization, preference optimization, and black-box population search after pretraining. The useful comparison axis is not only benchmark score; it is how much the method moves weights, where those changes land, and which prior capabilities survive.

What The Wiki Currently Believes

Dynamic Fine-Tuning reframes SFT as a policy-gradient-like update with an implicit sparse reward and inverse-probability weighting, then removes the low-probability amplification by scaling each token loss with detached target-token probability.
Reinforcement Learning Finetunes Small Subnetworks adds direct update-geometry evidence: in the paper’s tested LLM post-training settings, RL changes a sparse but broadly distributed and nearly full-rank subset of parameters, while SFT changes parameters much more densely.
ExpRL adds the coverage-building mid-training piece: use reference solutions as hidden reward scaffolds for dense on-policy RL rewards instead of cloning them as SFT or distillation targets.
RLPT moves RL earlier in the training stack by turning pre-training text into next-segment reasoning tasks. It is useful current evidence for training-time RL scaling, but it remains a work-in-progress preprint without discovered official code.
Fast-Slow Training adds a second adaptation channel to RLVR: prompts and context are optimized as fast weights while parameters are updated as slow weights, reducing slow-weight drift and preserving later learnability.
DiffusionBlocks is not an LLM post-training paper, but it adds an important future-work hook: converting pretrained large models into independently trainable blocks through fine-tuning. That could become a different adaptation interface from LoRA, prompt/context fast weights, or full-parameter updates.
DMax adds a diffusion-LM self-correction recipe: full-parameter fine-tune a pretrained masked dLLM on its own sampled noisy states, then use soft intermediate embeddings so aggressive parallel decoding remains revisable.
Flow Reasoning Models adds a flow-LM preference recipe: mine the model’s confident wrong completions, contrast them against gold under a shared corrupted state, and restrict FlowDPO to the cells where the wrong completion differs from gold. Its verifier is label free at test time, but the preference mask and training pairs are not.
iLLaDA adds a from-scratch diffusion-LM post-training recipe: keep the masked diffusion objective during SFT, train over a 25B-token instruction corpus for 12 epochs, and use variable-length generation plus confidence-based multiple-choice scoring. Its instruct model still lags Qwen2.5 7B Instruct, so diffusion-LM alignment remains open.
The Flexibility Trap adds diffusion-LM RL evidence: use a left-to-right surrogate policy during GRPO so rollouts confront uncertain reasoning forks and likelihood ratios are exact, then restore parallel diffusion decoding at inference. The result is an ICML 2026 Outstanding Paper, but it is LLaDA-Instruct math/code evidence rather than a general dLLM alignment recipe.
Latent Thought Flow adds a latent-reasoning objective that is RL-adjacent but not ordinary PPO/GRPO: a continuous GFlowNet trains a sampler so hidden trajectories are proportional to an answer-quality and compute-cost reward.
Learning is Forgetting adds the representation view: pretraining learns broad lossy compression for next-token prediction, while post-training can change which information survives by injecting preference information.
LLMs as Noisy Channels adds the scaling-law warning: SFT can act as a perturbation that creates U-shaped loss basins, so post-training should report base-distribution loss and retention, not only target-task score.
FADE adds a selective-forgetting mechanism: per-parameter decay can adapt online instead of applying one fixed forgetting rate everywhere. It is adjacent evidence for controlled parameter forgetting, not an LLM post-training result.
Evolution Strategies at Scale argues that black-box full-parameter ES can optimize LLM behavior through response-level rewards.
Evolution Strategies at the Hyperscale makes the ES path more plausible at scale through low-rank perturbation systems work.
Evolutionary Strategies lead to Catastrophic Forgetting in LLMs warns that ES can match new-task reward while causing dense, high-norm parameter drift and worse retention.
Synthetic Data for any Differentiable Target adds a data-side steering primitive: optimize a synthetic generator with metagradient rewards so ordinary SFT on the generated examples moves a target model toward a chosen differentiable metric or even hidden weight pattern.
The Universal Weight Subspace Hypothesis adds a weight-space subspace view of adapter reuse and model merging, but should be read with the mean-adapter baseline caveat tracked in the source page.
Exploration: Fine-Tuning With Parameter Decomposition adds a decomposition-basis editing primitive: after an expensive VPD pass, Goodfire reports that tuning one scalar on a German-related rank-1 parameter subcomponent can suppress German prediction in a 67M language model with fewer final-task tokens and less French/Spanish collateral damage than LoRA baselines, while still exposing Italian damage and decomposition-cost caveats.
CWM adds a code-domain staged-training case: world-model mid-training on execution and agent-environment trajectories creates a better starting point for SFT and multi-turn verifiable RL. This is post-training evidence for computational environments, not direct time-series latent-state or telemetry-control evidence.
TimeOmni-1 is the time-series reasoning example of a staged SFT-then-RL curriculum: SFT injects domain reasoning priors, then RL rewards push beyond imitation.

Weight-Update Lens

The post-training cluster should be evaluated through update geometry:

SFT gives token-level gradients on demonstrations and can overfit exact references.
DFT changes SFT’s gradient scale by downweighting low-confidence expert tokens, aiming for more stable and less outlier-dominated updates.
PPO/GRPO/RLVR-style RL uses sampled trajectories and explicit rewards, often with KL or reference constraints to control drift.
GFlowNet latent-reasoning objectives such as LTF train a distribution over hidden trajectories with flow-balance constraints and reward-proportional sampling. They should be separated from PPO/GRPO-style reward maximization because the goal is preserving a posterior over useful trajectories, not only pushing probability toward one high-reward path.
ExpRL-style RL priming keeps the actor on-policy but uses hidden reference solutions to create dense outcome or process rewards, targeting coverage over productive reasoning paths before sparse-reward RL begins.
RL-induced update sparsity may occur even without explicit sparsity regularization when training data is close to the policy distribution; the useful diagnostic is sparse/full-rank update geometry across layers, not only update norm.
RLPT-style training-time RL converts unlabeled pre-training text into reward-bearing segment prediction tasks; reward-model design and segment boundaries become part of the training objective.
FST keeps RLVR as a slow-weight update but co-optimizes context as fast weights, so task-specific lessons can live in editable textual state instead of being forced entirely into parameters.
FADE-style adaptive decay changes the memory horizon of parameters, making forgetting a controlled update variable rather than only a side effect.
ES searches directly in parameter space with scalar rewards, making credit assignment simple but risking broad dense updates.
DPG-style synthetic-data optimization keeps the outward update as ordinary SFT, but uses higher-order gradients through a target training trajectory to generate examples whose training effect targets a differentiable metric.
Shared subspace or mean-adapter methods constrain task updates to a small set of learned directions or coefficients, but must be compared against simple group-mean and rank-1 baselines before calling the directions task-specific.
Decomposition-basis edits first pay for a mechanistic weight decomposition, then tune masks or scalar prefactors on interpreted parameter subcomponents. They should be compared against LoRA, sparse full-parameter updates, and ordinary SFT with decomposition cost and off-target retention included.
Block-wise denoising updates would move only selected blocks or adapters under local denoising objectives, if pretrained conversion becomes practical.
DMax-style on-policy self-correction fine-tunes on the model’s own sampled noisy states so inference-time errors become train-time inputs; the update target is not human preference or task imitation, but recoverability under aggressive parallel decoding.
FlowDPO-style localized preference training names a self-mined wrong completion as the loser and applies the contrast only where it differs from gold. This is stronger supervision than a label-free verifier and should be evaluated separately from the test-time stability score.
iLLaDA-style diffusion SFT keeps the same random masking objective over concatenated instruction text, so prompt, response, and EOS tokens can all be masked. Its key open question is whether diffusion-specific RL or preference training can close instruct-model gaps without losing bidirectional generation advantages.
JustGRPO-style sequential diffusion RL defines an exact autoregressive policy over a bidirectional dLLM by masking the future suffix and reading the next-position logits. This makes rollout order, proposal coverage, and inference order separate design variables.
SNR-aware SFT analysis treats SFT learning rate or update pressure as a perturbation. The relevant check is whether target-task improvement also creates capacity collapse on base-language, reasoning, or domain-retention probes.

The ES catastrophic-forgetting source makes this lens concrete: new-task reward can improve while prior capabilities degrade. DFT adds the complementary lesson that conservative update scaling can improve reasoning generalization but may fail when low-probability targets contain genuinely new knowledge. RL-induced subnetwork evidence adds a sharper measurement target: when RL appears to preserve prior capability better than SFT, check whether it is also moving fewer parameters, whether those updates are full-rank, and whether the data is near the policy distribution. ExpRL adds an exploration-coverage target: when sparse rewards are too rare, references may be more useful as verifier scaffolds for on-policy dense rewards than as trajectories to imitate. RLPT adds a data/objective path: instead of waiting for human or verifiable labels, pre-training text can be reformulated into RL tasks, but then the reward model and segmentation scheme become the main failure surface. FST adds a third adaptation interface so not every task-specific improvement must become a parameter update. Learning-is-Forgetting adds the representation target: post-training changes what information survives the compression. FADE adds the memory-horizon target: not every parameter should forget at the same rate.

Using FADE-like ideas at LLM scale would require layerwise retention tests and optimizer-state or momentum accounting. The current paper’s neural evidence is final-layer online learning, not full-model LLM adaptation.

Implications For Time-Series And World Models

For time-series reasoning models, SFT can inject decomposition priors, formatting, and domain procedures, while RL can reward verifiable temporal reasoning or intervention decisions. The DFT/ES/FST contrast suggests agents should track not just task score, but parameter drift, retention of base-model skills, and whether updates preserve the model’s numeric and temporal priors.

FST is especially relevant when a model must adapt to changing domains, sensors, event streams, or operator policies. The time-series analogue is to separate fast editable context from slow learned dynamics: recent domain instructions, schema changes, incident playbooks, or task-specific constraints may be better stored in context than consolidated immediately into weights.

The durable design pattern is that a model’s preferred input interface can drift as the model trains. Post-training systems should consider co-adapting the prompt, tool context, or task wrapper with the weights, especially when the alternative is writing brittle task-specific details into persistent parameters.

The compression view sharpens the failure mode. If pretraining compresses around next-token prediction, post-training for time-series reasoning and operations should not only add task behavior. It should change the information-preservation target toward numeric state, action consequences, safe abstention, and human preference constraints without erasing broad priors.

Company-Local Block-Wise Fine-Tuning adds a deployment-oriented adaptation question. For enterprise data, the best update geometry may not be only “which optimizer moves weights least.” It may also be “which update can happen where the data lives, and what signal is allowed to leave.” DiffusionBlocks is only a starting mechanism here; gradient leakage and tenant-specific overfitting remain open risks.

Relation To Foundation TSFM Agenda

LLM post-training is adjacent to the Foundation Time-Series Model Research Agenda through reasoning, intervention-decision rewards, and fast editable context. It can improve task wrappers and temporal reasoning behavior, but it should not be counted as progress on latent-state maintenance or action-conditioned dynamics unless the post-training target directly evaluates those interfaces.

Gotchas

“RL-like” does not mean the same mechanism. DFT gives an RL interpretation of an SFT gradient; PPO/GRPO sample trajectories under explicit rewards; ES perturbs parameters and uses scalar fitness.
Sparse updates are not automatically low-rank updates. The RL subnetwork paper reports sparse but nearly full-rank deltas spread across layers.
Update sparsity depends on numerical tolerance, precision, training duration, and data-policy distance; it should be reported as a measurement protocol, not a slogan.
RLPT-style rewards are only as good as their segmenter and reward model. Relaxed prefix rewards can stabilize training, but they can also hide semantic false positives.
ExpRL-style dense rewards are only as good as the problem-matched reference, judge, rubric, and prefix slicing. Wrong references or too-weak judges can turn the reward into noise.
Reward-only success is incomplete without retention tests.
Smaller updates are not automatically better: a method can preserve priors by refusing to learn rare but important new facts.
Information-survival changes from post-training can be family- and recipe-dependent. Learning-is-Forgetting reports consistent preference-information increases for Llama, but a mixed Gemma pattern except for Gemma 3.
Fast context is not free. FST-style methods need prompt optimization, prompt-population management, and safeguards against brittle or bloated task-specific instruction lists.
Benchmark gains should be reported alongside adaptation mode: full-parameter SFT, LoRA, DFT, PPO/GRPO, DPO/RFT, ES, or staged mixtures.
Subspace-adapter gains should report mean-only, basis-only, and full-basis ablations; otherwise a claimed shared task subspace may only be a reusable average update.
Component-basis edits should report decomposition compute, component-selection procedure, autointerp label uncertainty, neighboring-domain retention, and whether a simpler localized LoRA or sparse full-parameter update gives the same target/off-target trade-off.
Benign-looking synthetic SFT examples can be a hidden control channel into model weights; surface data-quality audits are not enough when the data generator is optimized through a differentiable target.
Block-wise fine-tuning is not privacy-preserving by default. Gradients, low-rank deltas, and adapter updates can leak training data unless the protocol includes an explicit threat model and leakage tests.
DMax-style self-correction is not preference learning: self-distilled noisy states can improve recoverability under parallel decoding while preserving the base model’s blind spots unless audited with independent tasks and retention tests.
FRM’s test-time verifier being label free does not make FlowDPO unsupervised: gold completions and a task checker define the negative pairs and wrong-cell mask.
iLLaDA-style repeated diffusion SFT is not a solved alignment recipe: the paper reports large SFT gains but still leaves a substantial gap to Qwen2.5 7B Instruct and explicitly leaves RL alignment to future work.
JustGRPO’s controlled table is stronger than its heterogeneous system comparison, but exact per-position likelihoods remain expensive and preserved tokens-per-step accuracy is not the same as measured serving latency.
GFlowNet-based latent-reasoning objectives still need retention, interpretability, no-latent, and wall-clock latency checks; shorter hidden trajectories are not automatically cheaper in a deployed serving stack.

Open Questions

Which post-training methods have the best target-gain-to-parameter-drift ratio?
Can sparse RL subnetworks be identified early enough to reduce training cost, and do they remain stable under longer training or harder out-of-distribution tasks?
Does RLPT preserve broad capabilities better than extra next-token pretraining, SFT, or RLVR at matched compute?
Can ExpRL-style priming improve exploration coverage without causing hidden retention loss, and what does its update geometry look like relative to SFT, self-distillation, and sparse-reward GRPO?
Which post-training methods change the information that survives compression, and can that be measured directly?
Can adaptive decay or other controlled-forgetting mechanisms preserve stable priors while clearing stale task mappings?
Can DFT-like reward rectification, FST-style fast context, RL KL constraints, and ES low-rank perturbations be composed without fighting each other?
Which adapter-update geometries transfer task-specific information rather than only a shared mean behavior?
Can parameter decomposition expose reusable edit handles, or does each target behavior require expensive component search and post-hoc label auditing?
Can metagradient-optimized synthetic data be used safely for post-training without creating clean-label backdoors or metric-overfitting artifacts?
Can pretrained LLMs be converted into block-wise trainable systems in a way that improves private or on-premise adaptation rather than only reducing training memory?
Which retention tests should be mandatory for time-series reasoning and world-model post-training?
Can localized preference updates target constraint-violating regions of a numeric trajectory without leaking future targets or suppressing rare but valid regimes?
Can sequentially exposing high-entropy decision points during TSFM or world-model RL improve proposal coverage while retaining parallel rollout generation between those points?
Can SNR-aware scaling fits identify safe post-training intensity before catastrophic overtraining or quantization-induced degradation appears?

Alex Open Research Wiki

Explorer

LLM Post-Training

LLM Post-Training

Summary

What The Wiki Currently Believes

Weight-Update Lens

Implications For Time-Series And World Models

Relation To Foundation TSFM Agenda

Gotchas

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

LLM Post-Training

LLM Post-Training

Summary

What The Wiki Currently Believes

Weight-Update Lens

Implications For Time-Series And World Models

Relation To Foundation TSFM Agenda

Gotchas

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks