LLM Post-Training

Summary

LLM post-training is the wiki’s umbrella for supervised fine-tuning, RL-style optimization, preference optimization, and black-box population search after pretraining. The useful comparison axis is not only benchmark score; it is how much the method moves weights, where those changes land, and which prior capabilities survive.

What The Wiki Currently Believes

  • Dynamic Fine-Tuning reframes SFT as a policy-gradient-like update with an implicit sparse reward and inverse-probability weighting, then removes the low-probability amplification by scaling each token loss with detached target-token probability.
  • Fast-Slow Training adds a second adaptation channel to RLVR: prompts and context are optimized as fast weights while parameters are updated as slow weights, reducing slow-weight drift and preserving later learnability.
  • DiffusionBlocks is not an LLM post-training paper, but it adds an important future-work hook: converting pretrained large models into independently trainable blocks through fine-tuning. That could become a different adaptation interface from LoRA, prompt/context fast weights, or full-parameter updates.
  • Learning is Forgetting adds the representation view: pretraining learns broad lossy compression for next-token prediction, while post-training can change which information survives by injecting preference information.
  • FADE adds a selective-forgetting mechanism: per-parameter decay can adapt online instead of applying one fixed forgetting rate everywhere. It is adjacent evidence for controlled parameter forgetting, not an LLM post-training result.
  • Evolution Strategies at Scale argues that black-box full-parameter ES can optimize LLM behavior through response-level rewards.
  • Evolution Strategies at the Hyperscale makes the ES path more plausible at scale through low-rank perturbation systems work.
  • Evolutionary Strategies lead to Catastrophic Forgetting in LLMs warns that ES can match new-task reward while causing dense, high-norm parameter drift and worse retention.
  • CWM adds a code-domain staged-training case: world-model mid-training on execution and agent-environment trajectories creates a better starting point for SFT and multi-turn verifiable RL. This is post-training evidence for computational environments, not direct time-series latent-state or telemetry-control evidence.
  • TimeOmni-1 is the time-series reasoning example of a staged SFT-then-RL curriculum: SFT injects domain reasoning priors, then RL rewards push beyond imitation.

Weight-Update Lens

The post-training cluster should be evaluated through update geometry:

  • SFT gives token-level gradients on demonstrations and can overfit exact references.
  • DFT changes SFT’s gradient scale by downweighting low-confidence expert tokens, aiming for more stable and less outlier-dominated updates.
  • PPO/GRPO/RLVR-style RL uses sampled trajectories and explicit rewards, often with KL or reference constraints to control drift.
  • FST keeps RLVR as a slow-weight update but co-optimizes context as fast weights, so task-specific lessons can live in editable textual state instead of being forced entirely into parameters.
  • FADE-style adaptive decay changes the memory horizon of parameters, making forgetting a controlled update variable rather than only a side effect.
  • ES searches directly in parameter space with scalar rewards, making credit assignment simple but risking broad dense updates.
  • Block-wise denoising updates would move only selected blocks or adapters under local denoising objectives, if pretrained conversion becomes practical.

The ES catastrophic-forgetting source makes this lens concrete: new-task reward can improve while prior capabilities degrade. DFT adds the complementary lesson that conservative update scaling can improve reasoning generalization but may fail when low-probability targets contain genuinely new knowledge. FST adds a third path: change the adaptation interface so not every task-specific improvement must become a parameter update. Learning-is-Forgetting adds the representation target: post-training changes what information survives the compression. FADE adds the memory-horizon target: not every parameter should forget at the same rate.

Using FADE-like ideas at LLM scale would require layerwise retention tests and optimizer-state or momentum accounting. The current paper’s neural evidence is final-layer online learning, not full-model LLM adaptation.

Implications For Time-Series And World Models

For time-series reasoning models, SFT can inject decomposition priors, formatting, and domain procedures, while RL can reward verifiable temporal reasoning or intervention decisions. The DFT/ES/FST contrast suggests agents should track not just task score, but parameter drift, retention of base-model skills, and whether updates preserve the model’s numeric and temporal priors.

FST is especially relevant when a model must adapt to changing domains, sensors, event streams, or operator policies. The time-series analogue is to separate fast editable context from slow learned dynamics: recent domain instructions, schema changes, incident playbooks, or task-specific constraints may be better stored in context than consolidated immediately into weights.

The durable design pattern is that a model’s preferred input interface can drift as the model trains. Post-training systems should consider co-adapting the prompt, tool context, or task wrapper with the weights, especially when the alternative is writing brittle task-specific details into persistent parameters.

The compression view sharpens the failure mode. If pretraining compresses around next-token prediction, post-training for time-series reasoning and operations should not only add task behavior. It should change the information-preservation target toward numeric state, action consequences, safe abstention, and human preference constraints without erasing broad priors.

Company-Local Block-Wise Fine-Tuning adds a deployment-oriented adaptation question. For enterprise data, the best update geometry may not be only “which optimizer moves weights least.” It may also be “which update can happen where the data lives, and what signal is allowed to leave.” DiffusionBlocks is only a starting mechanism here; gradient leakage and tenant-specific overfitting remain open risks.

Relation To Foundation TSFM Agenda

LLM post-training is adjacent to the Foundation Time-Series Model Research Agenda through reasoning, intervention-decision rewards, and fast editable context. It can improve task wrappers and temporal reasoning behavior, but it should not be counted as progress on latent-state maintenance or action-conditioned dynamics unless the post-training target directly evaluates those interfaces.

Gotchas

  • “RL-like” does not mean the same mechanism. DFT gives an RL interpretation of an SFT gradient; PPO/GRPO sample trajectories under explicit rewards; ES perturbs parameters and uses scalar fitness.
  • Reward-only success is incomplete without retention tests.
  • Smaller updates are not automatically better: a method can preserve priors by refusing to learn rare but important new facts.
  • Information-survival changes from post-training can be family- and recipe-dependent. Learning-is-Forgetting reports consistent preference-information increases for Llama, but a mixed Gemma pattern except for Gemma 3.
  • Fast context is not free. FST-style methods need prompt optimization, prompt-population management, and safeguards against brittle or bloated task-specific instruction lists.
  • Benchmark gains should be reported alongside adaptation mode: full-parameter SFT, LoRA, DFT, PPO/GRPO, DPO/RFT, ES, or staged mixtures.
  • Block-wise fine-tuning is not privacy-preserving by default. Gradients, low-rank deltas, and adapter updates can leak training data unless the protocol includes an explicit threat model and leakage tests.

Open Questions

  • Which post-training methods have the best target-gain-to-parameter-drift ratio?
  • Which post-training methods change the information that survives compression, and can that be measured directly?
  • Can adaptive decay or other controlled-forgetting mechanisms preserve stable priors while clearing stale task mappings?
  • Can DFT-like reward rectification, FST-style fast context, RL KL constraints, and ES low-rank perturbations be composed without fighting each other?
  • Can pretrained LLMs be converted into block-wise trainable systems in a way that improves private or on-premise adaptation rather than only reducing training memory?
  • Which retention tests should be mandatory for time-series reasoning and world-model post-training?