Contradictions And Open Tensions

Dataset License Notes Need Pinned Artifacts

MicroSS has conflicting upstream license signals: the repository LICENSE file is GPL-2.0, while the README license section says Apache 2.0. LEMMA-RCA has a similar mismatch: the website and README license section say CC BY-ND 4.0, while Hugging Face metadata and one README paragraph say CC BY-NC 4.0. NeoRL-2 has an artifact-level mismatch: the GitHub README says datasets are CC BY 4.0 and code is Apache 2.0, while Hugging Face frontmatter marks the dataset repository as apache-2.0; the Hugging Face configs also list Salespromotion and Simglucose-high beyond the seven paper/GitHub tasks. Tennessee Eastman Process Simulation Data has weaker but relevant access/licensing uncertainty: the direct Dataverse landing page could not be verified from the ingest environment, while downstream references sometimes describe the artifact as CC0/Public Domain. The wiki should avoid giving reuse advice for these datasets until a pinned release artifact, exact config set, or maintainer confirmation resolves the license terms.

VLWM Reported Corpus Versus Released Artifacts

VLWM reports an open-source-intent training corpus, but its official Hugging Face dataset repository contains no payload and its public model manifest contains no visible weight file as of 2026-07-15. The paper also has two internal corpus tensions: component rows sum to 2,237.3k trajectories and 10,905.8k steps while the Overall row reports 2,179.6k and 10,604.3k, and the introduction names Ego4D while the implementation section says Ego4D was excluded and the table omits it. Those values should remain paper-reported, not silently reconciled.

Action100M is a later official dataset whose paper says it extends and improves the VLWM annotation pipeline, but it does not resolve the missing release. Action100M is HowTo100M-only and exposes segment action/actor/summary annotations; VLWM describes six video sources plus NaturalReasoning and goal/interpretation/action/state-change targets. The wiki must not describe Action100M as the exact released VLWM corpus unless an upstream source later provides an explicit mapping or compatible examples.

Action100M also has a count-versus-field-coverage tension. The paper headlines 147,092,653 annotated segments, reports that 64% of segments are 0-3 seconds, and says nodes shorter than four seconds are discarded from final GPT-OSS aggregation. The official one-video sample retains those short tree nodes with gpt: null. Until the complete corpus publishes explicit missingness statistics, the 147M figure should be described as a hierarchical segment/node count rather than a count of examples with all five structured GPT fields.

Final Embedding Versus Best Transfer State

Guillotine Regularization, Perception Encoder, and LeVLJEPA all warn that the representation exposed by a popular aggregate metric may not be the representation that transfers best. Guillotine and Perception Encoder focus on layer choice: a model’s final output can be worse than an intermediate state for downstream transfer. LeVLJEPA adds a readout-axis version: pooled zero-shot image-text alignment favors contrastive objectives, while dense patch-token readouts and frozen VLM-backbone transfer favor the non-contrastive objective in its experiments. LeNEPA adds a local time-series SSL version because its useful frozen-probe evidence depends on intermediate-layer readouts rather than an unqualified final embedding. This complicates source pages that summarize a model as “having strong representations”: the wiki should distinguish last-layer embeddings, best-layer embeddings, pooled embeddings, dense local tokens, fixed reporting layers, layer-selection budgets, and aligned outputs. For time-series and world-model sources, this is an open tension around whether reported embeddings preserve downstream-relevant dynamics or merely serve the pretraining head.

Raw Representation Similarity Versus Calibrated Alignment

Aristotelian Representation Hypothesis shows that raw global representational similarity and max-over-layer summaries can be inflated by representation width and layer-search depth. This complicates any wiki claim that larger models “converge” in representation space based only on CKA/RSA/Procrustes-style plots. The surviving calibrated evidence in that source is local-neighborhood agreement, not global metric-space convergence. For time-series and world-model sources, the unresolved tension is whether local-neighborhood alignment preserves dense numeric state, rare regimes, channel identity, events, and action history, and whether the required null calibration can preserve temporal or graph dependence rather than destroying it with naive row shuffles.

Residual Accumulation Versus Depth Retrieval

MoDA argues that the standard residual stream is an accumulation interface: earlier signals remain present only through a growing blended state. RAEv2 shows that fixed multi-layer aggregation can be useful for reconstruction and generation, but its X discussion also notes that residual-stream summation is a fixed depth-weighting scheme, not an optimal retrieval mechanism. mHC adds a third route: widen the residual stream into parallel streams and constrain residual mixing, rather than only accepting fixed accumulation or adding depth-KV retrieval. Hyperloop Transformers applies that residual-state-capacity idea at loop boundaries. Variable-Width Transformers adds a deterministic carry-forward variant: inactive residual coordinates bypass narrow layers and re-enter later wide layers, so residual capacity can vary across depth without learned projection matrices. The unresolved comparison is now fixed aggregation, learned layer weights, sparse selection, looped refinement, content-based depth retrieval, constrained matrix residual state, and deterministic residual slicing under matched memory bandwidth, cache, latency, and FLOPs. For the wiki’s time-series agenda, this is an architecture analogy until a source tests whether these mechanisms preserve numeric state, rare regimes, actions, or intervention effects under matched memory and latency budgets.

Semantic Latents Versus Pixel Fidelity

Reconstruction or Semantics? argues that semantic latent spaces can be more policy-relevant for robotic world models than reconstruction-focused latents. World Models is the older version of the same tension: its standalone VAE could preserve irrelevant visual detail while missing task-relevant features. The Prism Hypothesis, Tuna-2, and Gemma 4 12B complicate that: Prism frames semantic and pixel encoders as occupying different frequency bands, Tuna-2 argues that end-to-end pixel embeddings can beat pretrained vision encoders for unified multimodal understanding and generation, and Gemma 4 12B turns encoder-free image/audio routing into an open-weight production release. Gemma 4 12B should not be overread as “no input frontends”: it still uses lightweight patch/waveform projection before the shared decoder backbone. RAEv2 adds another route: keep pretrained semantic encoders, but aggregate multiple layers to recover local detail and improve generation/rollout behavior. A separately linked RAEv2 X discussion, not the paper itself, further complicates the mechanism because residual-stream summation acts like a fixed layer-depth weighting, not a proven optimal fusion rule. Self-Teaching Autoencoder adds a lower-evidence route: train the decoder inside the latent objective through transformed self-consistency, so reconstruction grounding is not bolted onto a pretrained latent afterward. Because that evidence is a blog/code/demo snapshot, it should be treated as a mechanism hypothesis rather than a settled resolution of the semantic-versus-fidelity tension. The wiki should not collapse these into one rule; the right latent appears task-dependent, and the layer aggregation, projection-frontend, or decoder-grounding interface is itself part of the unresolved tension.

Learned Simulator Training Versus Simulator Exploitation

World Models shows both sides of learned latent environments: a controller trained in a learned DoomRNN dream can transfer back to VizDoom, but low-uncertainty or imperfect dreams let controllers discover policies that exploit model errors and fail in the real environment. Newer action-conditioned world-model claims should therefore separate model-rollout score, uncertainty handling, and real-environment or live-stand transfer.

Genie adds a newer generated-interactive-environment case: it shows controllable visual rollouts from learned latent actions, but the paper also reports unrealistic hallucinated futures, 16-frame memory, about 1 FPS interaction, and no large-model or training-data release. Generated playability should therefore remain separate from evidence that agents trained inside the simulator transfer robustly.

stable-worldmodel adds a current reproducibility version of the same tension. It standardizes data access, solvers, and factor-of-variation sweeps, but its Push-T analyses still require prediction error and planning success to be reported separately under distribution shift. Better infrastructure lowers ambiguity; it does not remove the learned-simulator exploitation problem.

AdaJEPA adds the adaptation variant: updating a latent world model during MPC can improve closed-loop success under shift, but the benchmark must still separate prediction-error reduction from better candidate-action ranking, and must report adapted layers, buffer policy, reset policy, latency, and failure cases where online updates overfit recent transitions.

SkyJEPA strengthens the positive side with real outdoor quadrotor control rather than only visual rollout or simulator benchmarks: a learned latent dynamics model is used inside MPPI and transfers from domain-randomized simulation to physical flights, including propeller and payload changes. It does not resolve the tension completely because the public code/data are not released yet, the experiments are still within one quadrotor family, and the paper does not provide a general uncertainty or simulator-exploitation audit for arbitrary candidate interventions.

Optimization of computational budget for power system risk assessment adds a pragmatic power-grid variant: learned proxies can rank cases and allocate simulator budget, but the highest-risk contingencies still need physical-simulator validation because ML-only estimates can understate extremes.

LLMServingSim 2.0, Revati, and LLM-Emu add a serving-systems version of this tension. LLMServingSim 2.0 reports low error while expanding simulator coverage to heterogeneous and disaggregated serving, but simulator-based optimization can still overfit missing profiles, interconnect assumptions, workload traces, or framework features. Revati reduces serving-engine reimplementation drift by running real framework control logic, but moves trust into CUDA interposition, kernel-duration models, and virtual-time correctness. LLM-Emu adds a wall-clock vLLM endpoint variant: it avoids CUDA interception and keeps online HTTP behavior, but trust moves into profile coverage, synthetic-output realism, vLLM API drift, and single-node/profile-matched validation. Simulator and emulator wins are proxy-world-model evidence until held-out real deployments validate the optimized policies.

World Model for Robot Learning Survey broadens this into the current robotics benchmark taxonomy: open-loop action-conditioned video quality, closed-loop policy utility, and physical/executability diagnostics are separate evidence layers. Visual plausibility should not be treated as simulator-transfer evidence.

Next-Token Accuracy Versus Latent World-Model Quality

NextLat sharpens a language-model version of the same evidence problem. In the Manhattan taxi setting, models can reach perfect next-token legality while still producing weak internal maps, low valid-trajectory quality, or poor latent compression. That means ordinary next-token loss, perplexity, or one-step legality should not be treated as evidence that a Transformer has learned a compact world model. The unresolved comparison is whether auxiliary next-hidden-state objectives, recurrent memories, action-conditioned latent rollouts, or explicit planning losses best make internal state useful under matched compute and serving budgets. For time-series and operations, this becomes a benchmark requirement: passive forecast accuracy must be paired with latent-state probes, rare-regime preservation checks, and candidate-action or intervention rollouts before calling a model world-model-like.

Comparing Transformers and Hybrid Models at the Token Level adds a matched Olmo 3 versus Olmo Hybrid case: aggregate loss hides whether gains come from state-conditioned readout, visible-prefix retrieval, or structural closure. For time-series foundation models, average forecast loss should therefore be paired with filtered slices for rare regimes, cross-channel binding, event-conditioned transitions, exact recent recall, repeated normal spans, and structural constraints before crediting recurrence, attention, or a hybrid mixer.

Looped World Models is a useful positive case for learned-simulator architecture, but it also strengthens the tension: strong text-environment prediction and parameter-efficiency claims do not by themselves show that planners will not exploit hidden model errors, especially without public artifacts or held-out closed-loop transfer tests.

On Training in Imagination adds that a learned simulator’s score often depends on a separate learned reward model. A controller can fail because the dynamics model is wrong, because the reward model is wrong, because reward labels are too noisy for the rollout budget, or because reward labels are systematically biased. These are distinct failure modes; averaging more rollouts can reduce zero-mean reward noise but cannot remove reward bias.

Digital World Models Versus Operations World Models

Agentic World Modeling usefully defines digital world models around software laws: API contracts, UI state machines, file-system logic, permissions, type constraints, and replayable error branches. This is close to CWM, but it should not be collapsed into the operations target. A web, GUI, or code simulator can satisfy digital-world constraints while still missing numeric telemetry, graph time series, event streams, delayed interventions, hidden concurrent users, failed-action semantics, rewards, and human-approval actions. The wiki should use digital world models as an architectural precedent for action/state/constraint contracts, not as evidence that SRE action-conditioned world modeling is solved.

Raw Trajectories Versus Summary Interfaces

Scaling Test-Time Compute for Agentic Coding shows that compact structured summaries can beat raw action-observation traces for coding-agent selection and reuse. But LLM Agents Need Action-Conditioned World Models and Hierarchical Modeling with a Fixed FLOPs Budget warn that compression can erase timing, magnitude, topology, failed-action status, uncertainty, rare events, and intervention effects. The unresolved tension is how to design summary schemas and preservation probes that reduce trace noise without destroying the state needed for control.

Latent Reasoning Diversity Versus Shortcut Collapse

The Illusion of Superposition shows that continuous latent reasoning interfaces can collapse to effectively discrete behavior or shortcut directly to the answer. Latent Thought Flow is a plausible positive response: train a reward-proportional posterior over variable-length hidden trajectories with entropy-weighted GFlowNet subtrajectory balance and a reference prior. Generative Recursive Reasoning adds a trained stochastic recursive-model branch, while Probabilistic Tiny Recursive Model adds training-free recurrent-noise trajectories on a deterministic TRM. PTRM strengthens both sides of the tension: pass@ $K$ shows that diverse proposals can recover answers missed by one deterministic path, but the Maze-Hard gap between pass@ $K$ and best-Q@ $K$ shows that latent diversity is useful only when the selector is calibrated. The tension is unresolved because these positive sources report task accuracy, reasoning length, entropy, and puzzle coverage, while the negative source emphasizes causal probes and shortcut ablations. For time-series and world-model claims, the wiki should require both: trajectory-sampling gains under matched wall-clock budget, and probes showing that latent state preserves regimes, event timing, exogenous variables, action history, and multiple candidate futures rather than collapsing to an early shortcut. Reasoning length, latent-step count, and nominal FLOPs are only proxies; the unresolved matched-budget comparison must include batching, KV-cache or recurrent-state traffic, scheduler overhead, and stronger wider/deeper non-recurrent baselines.

Arbitrary-Order Flexibility Versus Exploration Coverage

iLLaDA and DMax make the positive case for bidirectional masked modeling, variable reveal order, revisable intermediate states, and aggressive parallel inference. The Flexibility Trap adds a task-conditional counterexample: on math and code, confidence-driven arbitrary order can postpone high-entropy reasoning forks until future context collapses their ambiguity, producing flatter Pass@ $k$ coverage than left-to-right order. Its JustGRPO method nevertheless preserves parallel diffusion decoding after training, so the conflict is not “parallel versus sequential models.” The unresolved design space has at least three separate axes: training rollout order, inference commitment/revision order, and task structure. Claims about flexibility should therefore report proposal coverage, one-sample quality, local entropy, exact-likelihood cost, tokens per step, and end-to-end serving cost under matched budgets. A time-series analogue must additionally test rare valid trajectories, numeric fidelity, future-target leakage, and action-conditioned branch coverage.

Objective-Relevant Compression Versus Decision-Relevant State

Learning is Forgetting argues that LLM training can improve representations by forgetting input detail that is irrelevant to next-sequence prediction. The fixed-FLOPs and world-model pages need the opposite guardrail: if the objective does not include rare events, dense numeric fidelity, topology, actions, or intervention effects, compression can erase exactly the state a time-series or action-conditioned world model needs. JEPA Slow Features is the adjacent failure mode: predictive objectives can preserve the wrong factors even without constant collapse. The unresolved question is how to make compression accountable to downstream state and control objectives, not only average prediction loss.

TurboQuant adds a lower-level vector-compression example: even when a quantizer is optimized for MSE, inner-product estimates can be biased, so the paper spends a residual QJL bit to preserve scoring geometry. The later vLLM critique adds a second unresolved layer: preserving quality and reducing KV-cache bytes can still lose to FP8 once hardware-native attention, dequantization overhead, latency, throughput, and burst-load behavior are counted. That strengthens rather than resolves the tension. TSFM compression should declare whether it preserves reconstruction, inner products or retrieval, downstream prediction, anomaly sensitivity, or control value, and whether the compressed representation improves the actual serving contract.

Latent Context Language Models and Compress & Attend Transformers add the learned-context-compression version of the same tension. LCLM improves long-context language serving by compressing prompts into soft latent tokens before decoder prefill, while CAT replaces older chunks with compressed chunk representations and exposes chunk size as a budget knob. Both are positive evidence that learned compression can work, but both also make preservation the central unresolved issue: LCLM needs reconstruction and expansion safeguards, and CAT reports that larger chunks can miss retrieval-critical details. TSFM compression claims should therefore include preservation probes for rare regimes, channel-specific deviations, event timing, exogenous variables, and action or intervention history.

Oryx, Hybrid Associative Memories, and HOLA add the routing and exact-retention versions of the same tension. Oryx can choose attention versus recurrent mixer mode across spans; HAM can choose which tokens enter a thresholded or learned KV scratchpad; HOLA fixes the cache budget and ranks exact pairs by committed delta-rule update magnitude. All three create selection policies whose failures are easy to hide in aggregate loss. A TSFM analogue should prove that the selector preserves predictable-but-critical scheduled actions, low-salience exogenous variables, rare regime boundaries, and delayed intervention effects, not only high-update observations or retrieval needles. HOLA further shows why “surprise” must be named precisely: $β ∥ e ∥$ measures how much the recurrent state changed, not calibrated uncertainty or downstream decision value.

Variable-Width Transformers adds a positive structural case: a static hidden-width bottleneck can improve language-model residual entropy and activation utilization, but it does not yet show that rare regimes, dense numeric detail, exogenous variables, or action history survive compression in time-series or world-model settings.

MiniMax Sparse Attention adds the sparse-selection version of the same tension. Selected blocks remain exact tokens rather than compressed summaries, but unselected blocks become invisible to the layer. That means sparse attention can avoid some compression artifacts while still erasing decision-relevant state through selection failure. For time-series and action-conditioned world models, the preservation probe must therefore include selection recall for rare regimes, exogenous variables, event timing, topology-dependent signals, and action history, not only reconstruction quality of retained state.

Interpretable Policy Distillation for Power Grid Topology Control is a small-grid positive case for auditable controller compression: tree policies distilled from PPO can improve reported Grid2Op closed-loop reward and survival. It does not resolve the broader tension because compressed controllers can still create deterministic safety transients and may not generalize across grid topology or action-space changes without explicit preservation and safety probes.

Universal Weight Subspace Versus Mean-Adapter Baselines

The Universal Weight Subspace Hypothesis argues that many independently trained models or adapters share architecture-specific low-rank weight subspaces that can support compression, merging, and efficient adaptation. An independent stress-test repository linked from that source reports a narrower interpretation on a subset of the paper’s settings: trained LoRA spectra are sharper than random noise, but leave-one-out functional tests suggest the useful transfer may come mostly from a shared mean update rather than a rich task-specific basis. Reinforcement Learning Finetunes Small Subnetworks adds a different reusable-geometry pattern: RL deltas can be sparse, broadly distributed, and nearly full-rank, with overlap across seeds, data, and algorithms. Exploration: Fine-Tuning With Parameter Decomposition adds a fourth pattern: reusable structure may come from an explicit mechanistic decomposition of one target model’s weights, after which a scalar component edit can be effective. The tension is not whether reusable update structure exists at all; it is whether the reusable structure is low-rank, sparse full-rank, mean-like, task-specific, or a decomposed causal component basis, and which matched total-cost and retention baselines are required before calling one geometry the preferred adaptation interface.

Latent Dynamics Versus Governed Law Revision

LeCun APTAMI, LeWorldModel, and the broader latent-prediction thread make a strong case for learned latent dynamics as the substrate for prediction and planning. Agentic World Modeling adds a stricter L3 boundary: revising the model when evidence contradicts predictions. For operational and scientific settings, that likely requires explicit, auditable, revisable constraints, tests, data-version records, or symbolic assets around the latent model. The unresolved question is whether latent dynamics alone can support governed model revision, or whether L3 needs a hybrid latent-plus-symbolic control plane.

AdaJEPA is a concrete latent-only revision mechanism inside MPC, but it does not yet provide governed revision: no persistent memory policy, explicit constraint ledger, calibrated uncertainty, or safety audit says when an update should be accepted, rejected, or rolled back.

Heuristic-Free JEPA Versus Stabilized Predictive Training

LeJEPA, VISReg, LeJEPA Identifiability, and LeWorldModel emphasize Gaussian/SIGReg-style regularization as a path away from teacher-student, stop-gradient, and schedule heuristics. VISReg strengthens the scaling narrative but also creates a new internal tension: if vanilla SIGReg can vanish under collapse, then the practical stabilizer may need VICReg-like scale control plus SWD shape matching rather than a single SIGReg term. LeNEPA strengthens the positive time-series side by using temporal SIGReg to train next-latent predictions without stop-gradient or EMA in the reported fixed-recipe setting. Sensorimotor World Models complicates that by showing a task-grounded inverse-dynamics signal can stabilize a JEPA-style world model without filling the embedding space, while also making the learned state depend on the richness and bias of the action interface. NEPA still uses causal masking and stop-gradient for next-embedding visual prediction. LeVLJEPA is a cross-modal counterweight: even with per-modality SIGReg, direct symmetric image-text regression collapses or underperforms, so multimodal JEPA may still need predictor/stop-gradient asymmetry unless temporal or VISReg-style alternatives are validated. Self-Teaching Autoencoder complicates this further: SIGReg is used, but the blog still needs transformation constraints and step-frozen judging to avoid private-language and artifact-invariance shortcuts. LeJEPA Identifiability sharpens the positive case under Gaussian/OU assumptions, while also making the caveat more explicit: non-Gaussian or policy-shaped trajectories can produce distorted non-collapsed states. The open question is whether Gaussian regularization, temporal SIGReg, VISReg-style scale/shape regularization, action-grounded inverse dynamics, reconstruction grounding, stop-gradient targets, or hybrids are the right stabilizer at scale, and whether their chosen compression preserves rare regimes, dense detail, and intervention-relevant state rather than only preventing constant collapse.

Predictable Latents Versus Task-Relevant Dynamics

Joint Embedding Predictive Architectures Focus on Slow Features shows that latent prediction can select fixed or slow distractors instead of action-relevant state, even when the embedding is not constant-collapsed. LeJEPA Identifiability adds the conditional positive side: under Gaussian/OU latents and successful whitening, predictability can recover true latent state up to rotation. Learn From Your Own Latents And Not From Tokens adds a positive synthetic hierarchy case: when the data has a balanced, recoverable hidden hierarchy, own-latent prediction can recover structure at local clustering scale. Aionoscope adds a controlled time-series diagnostic for the same problem: component presence can be recoverable while dense timing, phase, amplitude, frequency, and regime variables remain weak. That does not remove the slow-feature tension; it narrows the question to diagnostics that distinguish useful hierarchy and dense process-state accessibility from predictable nuisance equivalence classes. For time-series JEPA, the unresolved tension is how to preserve slow regime context without letting static exogenous variables, identifiers, or sensor artifacts dominate the latent state.

Temporal Straightening adds a planner-facing counterpressure: even a predictive latent can be geometrically awkward for differentiable action optimization, so the representation can be regularized to make local temporal progress straighter. This does not resolve the tension universally. Local straightness can itself become a shortcut if it suppresses meaningful turns, regime changes, irreversible transitions, or rare events, and the paper’s positive evidence is limited to visual goal-reaching tasks with symmetric Euclidean costs.

Explicit H-JEPA Stacking Versus Implicit Own-Latent Hierarchy

Learn From Your Own Latents And Not From Tokens argues that, on the Random Hierarchy Model, data2vec-like own-latent targets can implicitly climb the hierarchy without explicitly stacking separate JEPA levels. That creates a design tension for H-JEPA-style systems: explicit multi-scale modules may be useful for real video, robotics, or multivariate streams, but they should beat a strong implicit own-latent baseline under matched compute and preservation probes before the hierarchy itself gets credit.

Uniformity Prevents Collapse Versus Long-Tailed Reality

The Hidden Uniform Cluster Prior in Self-Supervised Learning argues that several SSL anti-collapse mechanisms impose a hidden uniform feature or cluster prior. LeJEPA Identifiability gives the complementary positive case for a Gaussian or whitening prior when the latent world really is close to Gaussian and isotropic. This is still in tension with naturally long-tailed temporal data, where rare events, regimes, interventions, and treatments may matter precisely because they are not uniform. Gaussian or uniform-looking target distributions can be useful stabilizers, but the wiki should treat them as priors whose fit to the domain and data-collection policy must be tested.

Targeted Data Selection Versus Support Preservation

Motion Attribution for Video Generation shows a positive targeted fine-tuning case: after computing attribution over all 10,000 candidates, its query-conditioned 1,000-clip subsets beat fine-tuning on the whole 10,000-clip experimental pool on dynamic degree for ten motion categories. This is neither a 10%-of-pretraining-data result nor evidence of 10% candidate-data access or 90% total-compute savings. A Bitter Lesson for Data Filtering and No Filter warn that filtered or selected corpora can lose long-tail support or reverse under different compute regimes. The unresolved tension is whether dynamic curricula can spend compute on high-value windows while preserving natural distribution support, untargeted capability, and normal-behavior calibration under matched end-to-end budgets.

Tokenizer Removal Has Multiple Incompatible Paths

H-Net learns hierarchical byte chunking end to end, Synergy learns routing over byte-level abstraction, Bolmo byteifies existing subword LMs through distillation, ConceptMoE compresses token streams into concepts inside an MoE, and Compute Optimal Tokenization treats compression rate as a scaling-law variable. These are not the same claim. They agree that fixed tokenization is limiting, but disagree on whether the future is byte-level modeling, learned chunking, concept-level compute allocation, transfer from subword models, or compression-aware training recipes. The wiki should compare them under explicit bytes-per-parameter, inference-FLOPs-per-byte, vocabulary/embedding-cost, and preservation-probe budgets rather than treating “fewer tokens” as automatically better.

ELF adds a different route: keep tokenization and final vocabulary decoding, but move the generative trajectory into continuous contextual embedding space and discretize only at the final step. This complicates any simple tokenizer-removal story. For multimodal time-series models, the question may be less “tokens or no tokens” and more “which parts of the pipeline should remain continuous until the final readout?”

DMax adds another intermediate route: keep the discrete vocabulary and final committed tokens, but represent decoded positions as confidence-weighted token/mask embeddings during iterative denoising. This supports the broader lesson that the continuous/discrete boundary can move inside decoding, not only at tokenization or final readout; it also adds the constraint that soft states worked only after OPUT trained the model to recover from its own predicted noisy states.

iLLaDA adds a more conservative but important route: keep discrete masked tokens and the same masked diffusion objective through pre-training and SFT, then improve scale, architecture, learning-rate schedule, variable-length generation, and confidence-based scoring. This strengthens diffusion language modeling without settling the continuous-versus-discrete boundary. It also creates a separate tension with autoregressive baselines: iLLaDA-Base is competitive with Qwen2.5 7B in the reported table, while iLLaDA-Instruct still trails Qwen2.5 7B Instruct, so post-training and benchmark-scoring protocols remain unresolved rather than a solved diffusion-versus-AR verdict.

Latent Context Language Models and Compress & Attend Transformers add a related but distinct path: keep tokenization inside local spans, then compress long context into latent representations for the decoder. That should not be merged with byte-level modeling or tokenizer removal. The unresolved question is whether learned context compression should be treated as tokenization, memory, serving optimization, or all three, and which evaluation unit replaces tokens when history is compressed.

Distributional Latent Prediction Versus Demonstrated Multi-Modality

VJEPA directly motivates the wiki’s requirement that a latent-state model represent a predictive belief instead of one averaged future embedding. It also exposes an important evidence gap: the paper says expressive predictive families can support multi-modal futures, but the evaluated target and predictor are independent diagonal Gaussians. That implementation is unimodal and can still average incompatible left-versus-right or normal-versus-failure futures into an invalid state. Sampling several points from one broad Gaussian is not evidence of separated modes.

The ICML poster strengthens the caution empirically. VJEPA is strongest in the five-environment linear Noisy-TV study at moderate nuisance scale, but deterministic JEPA is strongest and most stable at the hardest scale, while BJEPA is sensitive to misspecified or overly concentrated priors. The accepted source therefore establishes a useful probabilistic latent interface, not a general result that probabilistic JEPA dominates deterministic JEPA or that its current head solves multi-modal future modeling.

This tension also applies to passive time-series generators such as Sundial, latent-mixture world models such as World Models, and candidate-scoring approaches such as Energy-Based Transformers. A credible multi-modal-future claim should report separated regime coverage, probability calibration, invalid between-mode trajectories, tail-risk recall, multivariate constraint preservation, and action- or intervention-sensitive probability shifts—not only CRPS, quantiles, sample diversity, or a learned variance.

Energy-Based Verification Versus Many-Mode Futures

Energy-Based Transformers strengthens the case that explicit energies can provide candidate verification and dynamic compute, but it also reports extra gradient cost, sensitive optimization hyperparameters, and many-mode failures where convex energy-landscape training can merge nearby modes. This complicates the optimistic EBM/world-model thread from APTAMI and LVEBM: energy scores may be useful for ranking candidate futures or actions, but the wiki should not treat them as calibrated multi-future or action-conditioned control evidence until numeric time-series or control benchmarks test that directly.

Fourier Spectra, Functional Geometry, And Bit-Level Number Encodings

FoNE argues that digit-aligned Fourier features can encode numbers compactly as single tokens and improve arithmetic efficiency. Convergent Evolution adds a sharper warning: Fourier spikes can be universal across model families and even raw number-token frequencies while still failing to produce linearly usable modular geometry. BitTokens argues that sinusoidal number encodings are not general-purpose because multiplication and division force non-local decoding, computation, and re-encoding, while IEEE 754 bit features expose arithmetic structure directly. Pre-trained Large Language Models Use Fourier Features To Compute Addition supports the mechanistic basis for Fourier features in addition, but does not settle whether Fourier, bit-level, logarithmic, or hybrid encodings are best for auxiliary numeric values in time-series models.

TabM adds a separate axis: typed tabular numeric-feature embeddings are column-specific and often depend on feature-specific bins, frequencies, or preprocessing. Piecewise-linear and periodic feature embeddings can be useful for static tabular prediction, but they do not settle universal text-number tokenization or exact arithmetic. The wiki should keep typed feature embeddings distinct from free-standing number tokens.

Synthetic Time-Series Data Is Promising But Not Sufficiently Settled

CauKer, MantisV2, Chronos-2, TabPFN-3, TempoPFN, Toto 2.0, ChatTS, TimeOmni-1, TimeOmni-VL, and T2S all use synthetic data or synthetic annotation, but they target different bottlenecks: classification label coverage, grouped/covariate forecasting, PFN-style learned inference, benchmark-leakage control, time-series-language alignment, reasoning supervision, generation fidelity, and text-conditioned annotation over real temporal fragments. TimeCraft sharpens this split through separate TimeDP, BRIDGE, TarDiff, OATS, CaTSG, and Diff-MN branches: prototype-prompt fidelity, generated-description control, downstream-utility guidance, online pretraining gain, causal validity, and continuous-time reconstruction are different contracts. DiGA and MarS add a financial-simulation contract: matching target indicators, stylized facts, and trading-agent utility still leaves market-impact causality, data access, multi-asset correlation, and simulator-assumption risks. T2S further splits the category: TSFragment-600K is synthetic annotation over real time-series fragments, so its risk is caption-model bias and utility of the text-conditioning signal, not only simulator realism. Adjacent language-model evidence from Synthetic Data for any Differentiable Target sharpens this tension: DPG is not a time-series result, but it shows that generated examples can be optimized for hidden downstream training effects while looking benign. TarDiff is the direct clinical time-series case: influence-guided diffusion can improve downstream and minority-class utility, but makes guidance-set isolation, downstream-model dependence, and metric-target overfitting part of the synthetic-data contract. The open tension is therefore not only whether synthetic data is realistic or transferable; it is whether target-optimized synthetic or curriculum data can improve rare-state and reasoning metrics without silently steering weights, representations, or benchmark-specific behavior in ways that surface data-quality checks miss.

Aionoscope adds a diagnostic-data branch to this tension: its synthetic streams are not meant to scale pretraining volume but to reveal which exact latent-state variables a representation exposes. This is still synthetic evidence, so success should be read as a controlled unit test rather than real-domain transfer.

Scaling Laws Versus Tiny Efficient Models

Toto 2.0 reports monotonic gains across a 4M to 2.5B parameter family, while Time-MoE and Moirai-MoE argue for sparse capacity scaling. Tiny Time Mixers, Reverso, Kairos, TiRex, and Moirai 2.0 complicate the story by reporting strong results from compact or specialized backbones. Scaling Laws, Carefully adds that compute-optimality itself is a fitted frontier, not a one-run verdict: the chosen fit region, parameter-count convention, data regime, tokenizer or patcher, and loss precision can move the implied optimum. LLMs as Noisy Channels adds a third axis: even when a scaling law fits clean pretraining, low SNR from data noise, quantization, or SFT can turn monotonic gains into U-shaped degradation. Deep Learning is Not So Mysterious or Different adds that parameter count is also a poor proxy for selected-function complexity, so a larger TSFM should be judged by effective dimensionality, compressibility, and preservation probes rather than size alone. Sundial sharpens the ambiguity because it argues that TimeFlow mitigates mode collapse while its TimeBench data-scale experiments also show gains from larger corpora. The wiki should not treat parameter count as the single axis of progress, and should keep objective design, corpus scale/cleaning, perturbation regime, information density, backbone/inference engineering, and scaling-law fit assumptions separable.

Aggregate Loss Versus Capability Emergence

Implicit Curriculum Hypothesis adds a measurement tension for all scaling stories: validation loss can fall smoothly while discrete skills emerge in an ordered sequence. For TSFMs, average forecast loss can likewise hide when the model learns local numeric fidelity, rare-regime sensitivity, context use, channel coupling, event-stream parsing, or action-conditioned next-state dynamics. The wiki should treat aggregate loss as a budget signal, not as proof that the required state capabilities have emerged.

Full-Batch Sharpness Versus Batch Sharpness

SGD at the Edge of Stability complicates the common “sharpness at $2/ η$ ” shorthand. In mini-batch SGD, full-batch sharpness can settle below the full-batch gradient-descent threshold because projected gradient noise changes the self-stabilizing oscillation, while batch sharpness is the quantity that approaches the stochastic stability edge. For wiki synthesis, any claim about sharpness, flatness, or batch-size-dependent generalization should name the sharpness protocol, optimizer, loss, and batch regime rather than treating sharpness as one universal scalar.

The same source reports a CNN with cross-entropy on a small dataset that does not enter edge-of-stability, so the tension applies when the training regime sustains progressive sharpening rather than to all SGD runs.

Early Generalization Versus Late Memorization

Why Diffusion Models Don’t Memorize shows that diffusion sample quality and training-set memorization can appear on different timescales. A model can reach a useful generalization window at $τ_{gen}$ , then memorize only later at $τ_{mem}$ , with $τ_{mem}$ growing roughly linearly with dataset size in the reported setups. This complicates any simple “good validation quality means safe generator” story. It specifically qualifies T2S-, Sundial-, or TimeDP-style generation/future-distribution claims: WAPE/MSE/MRR, CRPS, MMD/KL/MDD, FID-like quality, or forecast utility should not be treated as sufficient generated-data hygiene unless paired with checkpoint/update count, dataset size, duplicate or subsequence checks, nearest-neighbor probes, and memorization or membership-inference audits.

Fixed-Point Convergence Versus Correctness

FPRM uses the residual of a looped Transformer’s hidden-state update as an adaptive halting statistic. Flow Reasoning Models supplies a complementary warning: a self-conditioned discrete-flow solver can converge to a confident but wrong fixed point, so FRM separately re-noises a completed assignment and measures whether the model returns to it. On the paper’s injected-candidate Sudoku/Zebra pools, this perturb-and-resolve score ranks correctness almost perfectly, but it still does not prove that fixed-point stability is calibrated uncertainty or correctness outside checkable puzzle domains. Probabilistic Tiny Recursive Model adds a third distinction: recurrent noise can improve proposal coverage even when the inherited supervised Q head cannot select the newly found correct candidate, as on Maze-Hard. Convergence, perturbation response, proposal coverage, and selector calibration are therefore four separate quantities.

For time-series and world-model systems, the durable rule is that convergence answers “has the update stopped changing?”, not “is the state correct?” or “is this the only plausible future?”. Halting residuals, perturbation stability, task constraints, coverage, rare-regime recall, and downstream decision utility should be measured separately. A stable average trajectory can be invalid, while a rare valid regime change can be unstable under a misspecified model.

Looped Depth And Test-Time Memory Versus Matched-Budget Baselines

Universal Transformers, Huginn, Latent Thoughts, LT2, LoopFormer, FPRM, Probabilistic Tiny Recursive Model, Parcae, Sparse Layers are Critical to Scaling Looped Language Models, ELT, The Thinking Pixel, and Looped World Models make the positive case that repeated or recursive latent computation can provide useful effective computation with shared or sparse parameters, while The Illusion of Superposition shows that latent computation can also collapse or shortcut instead of preserving multiple candidate continuations. FPRM sharpens a sub-tension inside the recursive-reasoning cluster: HRM/TRM-style hierarchy may be useful, but some of its gains may have compensated for post-norm signal propagation rather than proving that fast/slow loops are necessary. PTRM adds parallel rollout width and selector cost to the same resource ledger. The broader tension is that the comparison target keeps moving: a looped block can be compared against a shallower model, a deeper unique-weight model, a wider model with the same parameters, a single-pass fine-tuning baseline, an ordinary ensemble, or a model matched on expected training FLOPs and serving latency. LT2, Parallel Samplers, and The Recurrent Transformer add that the budget must include hardware throughput, key-value cache traffic, recurrent-state memory, OOM frontiers, and decoding schedule, not only nominal loop count.

Titans and ATLAS add explicit memory as another resource, while MIRAS and MesaNet make attentional bias, retention gates, and solver or optimization steps part of the resource accounting. Titans Revisited and Universal Transformers Need Memory sharpen the caution. More loops, more memory slots, more retention structure, and more optimization steps can substitute for one another, but only under a declared budget. Titans reports broad long-context gains, while Titans Revisited shows that under-specified implementation details, chunking, and frozen-backbone memory updates can narrow or reverse the claim. The wiki should not merge these into a generic “more thinking helps” claim.

mHC and Hyperloop Transformers add matrix-valued residual streams as another state-capacity resource. Hyperloop reports parameter-memory gains and INT4 robustness, but this does not settle whether residual-stream width, explicit memory tokens, depth-KV retrieval, extra loops, sparse latent adapters, or unique layers best preserve useful temporal state under matched latency, memory bandwidth, KV/cache, and expected-FLOPs budgets. Thinking Pixel adds the same caveat for diffusion latents: sparse latent-step gains need public artifacts and realized-cost comparisons, not only nominal adapter parameter efficiency.

Looped World Models extends the same tension into world-model rollouts. Its deferred-decoding claim should be compared against ordinary per-step decoding under rollout latency, decoder-call savings, hidden-state drift, closed-loop policy transfer, no-loop/no-deferred-decoding ablations, and simulator-exploitation tests, not only parameter count or text-environment prediction scores.

RMT, ARMT, and RATE add the segment-level recurrent-memory version of the same tension. RMT can match Transformer-XL-style cache with fewer carried vectors in reported language-modeling settings, but the cost moves to sequential segment processing, memory capacity, and BPTT stability. ARMT improves recall/rewrite and BABILong long-context QA, but remains sequential across segments and does not yet show a clean language-modeling or numeric time-series transfer win under matched serving budgets. RATE adds the action-trajectory variant: cached hidden states, explicit memory tokens, and MRV-style filtering can trade off differently across dense-feedback and sparse-cue tasks. The wiki should compare memory tokens, KV cache, SSM/RWKV/Mamba state, test-time memory, and retention valves under explicit latency, memory, and training-budget constraints.

Language Models Need Sleep adds another axis to this tension: compact SSM fast weights may have enough capacity but still fail because the model has not spent enough computation consolidating evicted context into a reasoning-usable state. Its sleep phase improves synthetic and GSM-Infinite reasoning by looping before cache eviction, but this shifts cost into consolidation and training rather than eliminating it. The wiki should separate memory capacity, memory update compute, prediction latency, and training stability.

Dragon Hatchling adds a core-fast-state variant: memory is not an add-on module or segment token block, but a large recurrent state comparable to the model’s weight matrices. Matched-budget comparisons should count state size, update bandwidth, sparsity assumptions, BPTT depth, and serving latency before comparing it with KV cache, SSM fast weights, memory tokens, or test-time memory modules.

The Illusion of Superposition sharpens this tension for latent reasoning specifically: a continuous scratchpad is not enough evidence for superposition or parallel path exploration. PTRM adds a complementary operational requirement: even when stochastic rollouts demonstrably increase pass@ $K$ , the selected output should not get credit for that diversity unless the selector converts it into best-Q@ $K$ accuracy. No-latent/no-loop ablations, proposal-coverage curves, selector calibration, and internal-state probes should accompany claims that hidden compute is doing the intended work.

The recursive-reasoning descendants also disagree on mechanism. HRM attributes gains to fast/slow hierarchy plus deep supervision; TRM argues that a simpler answer-refinement loop can outperform the hierarchy; URM attributes ARC/Sudoku gains mainly to UT-style recurrence and nonlinear Transformer components; Universal Transformers Need Memory finds memory tokens necessary only for its single-block UT+ACT setting, while noting that HRM, TRM, and URM solve Sudoku through other state mechanisms. GRAM adds trained stochastic width for multi-solution coverage, while PTRM obtains training-free width by repeatedly perturbing a deterministic TRM and selecting with its inherited Q head. The two were both submitted on 2026-05-19 and should be treated as concurrent neighboring mechanisms, not a clean predecessor chain. That strengthens the mechanism disagreement, but remains controlled puzzle evidence rather than proof of numeric time-series state tracking or action-conditioned world modeling.

Architecture Evidence Versus Biological And Reasoning Narrative

Dragon Hatchling is a useful architecture source for sparse positive recurrent fast state, language/translation comparisons, and synapse-level probes. Its public narrative is broader: the paper frames BDH as a bridge between Transformers and brain models, while Pathway’s later Sudoku blog reports a strong internal Sudoku Extreme result. The tension is that these claims have different evidential status. The wiki should cite BDH as architecture evidence, record the Sudoku result as official but not open-reproduced in the public repository, and avoid treating biological plausibility or puzzle performance as direct evidence for multivariate time-series state tracking, event streams, or action-conditioned world models.

Block-Wise Isolation Versus End-To-End Coordination

DiffusionBlocks argues that residual networks can be split into independently trainable diffusion-style denoising blocks, reducing training memory while matching end-to-end training on its reported tasks. This creates a useful tension with the usual end-to-end coordination story: local objectives may be principled, but they still need to preserve global sequence state, rare regimes, action effects, and pretrained representations. For private or company-local adaptation, block isolation is appealing because it suggests new training boundaries, but gradients or update deltas can still leak data. The wiki should treat block-wise training as a promising systems interface, not as settled evidence that local objectives, privacy, and global model quality are simultaneously solved.

For the company-local adaptation research direction, see Company-Local Block-Wise Fine-Tuning.

Linear Scan Efficiency Versus Nonlinear Recurrent Dynamics

Mamba, Mamba-2, and Mamba-3 make compact recurrent state practical by staying in structured linear hidden-state-update families. ParaRNN challenges the implied boundary: nonlinear GRU/LSTM-style recurrence can also be trained in parallel at billion-parameter scale when the hidden trajectory is solved through Newton iterations and parallel reduction. Pretraining Recurrent Networks without Recurrence adds a different challenge: perhaps nonlinear RNNs can be pretrained by imitating predictive memory states without BPTT at all. That shifts the tension from solver design to supervision design: SMT/DMT removes the long BPTT credit path during pretraining, but depends on a time-parallel Transformer teacher, DMT/post-training to control rollout drift, and evidence that the teacher does not cap tasks where nonlinear recurrence is needed. The unresolved tension is whether nonlinear recurrent dynamics are now broadly practical, or only practical for carefully structured cells with fast Newton convergence or strong predictive-memory teachers.

Univariate Simplicity Versus Native Multivariate Dynamics

Many forecasting models simplify heterogeneous pretraining by treating each channel as a univariate series, including Timer, Time-MoE, Sundial, Kairos, Reverso, TiRex, and MIRA. MIRA adds a healthcare-specific version of the same tension: it pretrains on heterogeneous and partly multivariate clinical sources, but formulates forecasting channel-independently, so it is strong irregular-time and domain-specific evidence rather than native multivariate evidence. SensorFM adds the opposite wearable-health case: it natively encodes a fixed 34-feature multivariate sensor window, but that does not resolve high-channel topology, arbitrary channel schemas, or HDTSF-scale modeling. Chronos-2, Moirai, Tiny Time Mixers, and Toto make stronger claims about covariates, grouped series, channel mixing, or factorized variate attention. U-Cast sharpens the conflict by arguing that low-channel benchmarks are a poor test of the channel-dependence question and that high-dimensional settings expose different scalability and structure-learning requirements. The tension is whether broad transfer is better served by simple channel-independent modeling or by native multivariate structure, and whether the answer changes once the channel count reaches the HDTSF regime.

Forecasting Accuracy Versus Reasoning And Control

Eidos argues for latent-space predictive learning for robust forecasting, while TimeOmni-1 targets explicit reasoning and TimeOmni-VL targets unified understanding/generation. Toto, Toto 2.0, and ChronoGraph add the observability version of the tension: passive metric forecasts are valuable, but they are not action-conditioned world models unless operator actions, control inputs, or interventions are modeled directly. A model can be a strong forecaster without being a strong reasoning or control model, and a reasoning model can fail numerical fidelity.

Benchmark Leaderboards Versus Mismatched Protocols

Time-series benchmark comparisons often mix zero-shot base checkpoints, few-shot adaptation, fine-tuning, frozen classifiers, retrieval augmentation, and ensembles. Toto 2.0 reports base, fine-tuned, and ensemble entries; TabPFN-3 reports static TabArena, API/Thinking, and specialized TabPFN-TS-3 entries; MantisV2, UniShape, and TiViT use classification protocols with downstream classifiers or fused features; TiRex and Chronos-2 discuss overlap and leakage controls; TimeRAF adds a retrieval-specific axis where zero-shot base-model scores are not directly comparable with retrieval-augmented scores unless the retrieval knowledge base provenance, near-neighbor overlap, and domain-specific train-split use are audited; U-Cast adds channel count and high-dimensional dependency structure as benchmark axes. Sundial separately reports zero-shot and once-fine-tuned FEV settings, so its leaderboard claims should not be merged without adaptation-mode labels. π0.7 adds the robotics analogue: in very broad robot corpora, claims about unseen tasks or unseen task-robot combinations are hard to verify because related skills may appear under different labels or as incidental sub-behaviors. VLA-JEPA adds another robotics protocol warning: its headline comparisons mix methods with different pretraining and fine-tuning data regimes, and its no-human-video ablation is close to or better than the full model in some SimplerEnv cells while human videos help LIBERO-Plus robustness. Grid2Op adds the energy-control version: challenge score and survival do not by themselves resolve preventive N-1 robustness, corrective response, runtime, simulator-query budget, or continuous-optimization quality. World Model for Robot Learning Survey adds the embodied-world-model version: LIBERO, RoboTwin, CALVIN, and SIMPLER-style results differ by embodiment, action space, task composition, and protocol, so survey tables should be treated as design-map evidence rather than one unified leaderboard. Leaderboard rank should be treated as provisional unless the task, horizon, context length, covariates, metrics, channel count, dependency structure, adaptation mode, retrieval policy, control objective, and pretraining-overlap policy match.

Aionoscope, LeNEPA, and SensorFM add local protocol warnings. Aionoscope is a public validation-seed diagnostic snapshot, not a hidden leaderboard or real-task transfer claim; LeNEPA is a fixed-recipe stress test, not a proof that every Diag-tuned JEPA recipe is weaker. SensorFM is a private wearable-health benchmark stack where frozen probes, demographic features, engineered-feature baselines, and agentic codegen heads should not be collapsed into one leaderboard claim. Future TSFM benchmark pages should separate public versus hidden streams, categorical versus dense targets, layer-selection budget, probe capacity, fixed-recipe reuse, and fully tuned benchmark comparisons.

Open Robotics Artifacts Versus Private Frontier Evidence

Gemini Robotics 1.5 reports strong embodied-reasoning, Motion Transfer, and long-horizon agent results, but the models and detailed architecture are private. Helix 02 is even more demonstration-heavy, with no public weights, dataset, ablations, or failure-rate protocol. GR00T N1 and OpenVLA provide stronger public artifact surfaces. The wiki should distinguish frontier capability signals from reproducible open baselines when synthesizing robotics progress.

Curated Post-Training Versus Metadata-Conditioned Mixed-Quality Robot Data

π0 emphasizes a pretraining/post-training recipe where high-quality curated data induces reliable dexterous behavior, while π0.7 argues that episode metadata can make larger mixed-quality data, failures, and autonomous rollouts useful instead of harmful. The wiki should treat data filtering and metadata-conditioned reuse as competing or complementary recipes, not one settled rule.

Quality Filtering Versus Distribution Support

A Bitter Lesson for Data Filtering argues that standard text quality filters can help at small compute but lose to unfiltered Common Crawl when model size and training steps are large enough. No Filter shows a related VLM failure mode: English-only filtering improves common Western-centric benchmarks while harming cultural and socioeconomic coverage. The unresolved tension for time-series and world-model training is whether filtering should remove low-information redundancy and corruption, or whether it silently deletes rare regimes, tail tenants, intervention windows, and natural diversity that larger models could use. The safe synthesis is to compare no-filter, loose-filter, dedup-only, and dynamic-reweighting baselines under matched compute and tail-slice probes before calling filtering beneficial.

RL Post-Training Versus Black-Box Population Search

OpenAI ES 2017 showed that ES can trade data efficiency for parallel rollout throughput in policy optimization. Evolution Strategies at Scale argues that this trade now matters for LLM post-training because response-level rewards, long horizons, and reward hacking can make RL brittle. Evolution Strategies at the Hyperscale strengthens the systems case through low-rank perturbations. Evolutionary Strategies lead to Catastrophic Forgetting in LLMs adds the counterclaim that ES can match new-task reward while degrading prior capabilities through dense, high-norm parameter drift. Reinforcement Learning Finetunes Small Subnetworks sharpens the gradient-RL side of the comparison: in the tested LLM post-training settings, RL updates are sparse, full-rank, and broadly distributed across layers, with in-distribution training data as a likely driver. FADE adds a non-LLM continual-learning counterpoint: forgetting can be useful when the decay policy is selective and tied to non-stationarity. The open tension is whether LLM post-training can make forgetting controllable rather than merely smaller, and whether sparse gradient-RL updates remain sparse under longer, harder, or more out-of-distribution objectives.

SFT Generalization Versus Weight-Update Conservatism

Dynamic Fine-Tuning reframes SFT as an RL-like update with an inverse-probability weight, then improves generalization by removing the extreme gradient amplification of low-probability expert tokens. Reinforcement Learning Finetunes Small Subnetworks adds the measured weight-space contrast: SFT updates are dense in the paper’s comparisons, while RL often changes only a sparse but full-rank subnetwork when data is close to the policy distribution. RLPT pushes the other direction by applying RL-style training-time scaling to pre-training text through next-segment rewards, but it shifts risk into the segmenter and generative reward model. This complements the ES forgetting critique: ES may move too many weights too strongly, DFT may move too conservatively on rare or unfamiliar targets, RL may preserve priors through sparse updates, and RLPT may make unlabeled data trainable only when its reward proxy is faithful. The unresolved tension is how to choose post-training objectives that gain target behavior while controlling parameter drift, retaining prior capabilities, and still learning rare new behavior.

Alex Open Research Wiki

Explorer

Contradictions And Open Tensions

Contradictions And Open Tensions

Dataset License Notes Need Pinned Artifacts

VLWM Reported Corpus Versus Released Artifacts

Final Embedding Versus Best Transfer State

Raw Representation Similarity Versus Calibrated Alignment

Residual Accumulation Versus Depth Retrieval

Semantic Latents Versus Pixel Fidelity

Learned Simulator Training Versus Simulator Exploitation

Next-Token Accuracy Versus Latent World-Model Quality

Digital World Models Versus Operations World Models

Raw Trajectories Versus Summary Interfaces

Latent Reasoning Diversity Versus Shortcut Collapse

Arbitrary-Order Flexibility Versus Exploration Coverage

Objective-Relevant Compression Versus Decision-Relevant State

Universal Weight Subspace Versus Mean-Adapter Baselines

Latent Dynamics Versus Governed Law Revision

Heuristic-Free JEPA Versus Stabilized Predictive Training

Predictable Latents Versus Task-Relevant Dynamics

Explicit H-JEPA Stacking Versus Implicit Own-Latent Hierarchy

Uniformity Prevents Collapse Versus Long-Tailed Reality

Targeted Data Selection Versus Support Preservation

Tokenizer Removal Has Multiple Incompatible Paths

Distributional Latent Prediction Versus Demonstrated Multi-Modality

Energy-Based Verification Versus Many-Mode Futures

Fourier Spectra, Functional Geometry, And Bit-Level Number Encodings

Synthetic Time-Series Data Is Promising But Not Sufficiently Settled

Scaling Laws Versus Tiny Efficient Models

Aggregate Loss Versus Capability Emergence

Full-Batch Sharpness Versus Batch Sharpness

Early Generalization Versus Late Memorization

Fixed-Point Convergence Versus Correctness

Looped Depth And Test-Time Memory Versus Matched-Budget Baselines

Architecture Evidence Versus Biological And Reasoning Narrative

Block-Wise Isolation Versus End-To-End Coordination

Linear Scan Efficiency Versus Nonlinear Recurrent Dynamics

Univariate Simplicity Versus Native Multivariate Dynamics

Forecasting Accuracy Versus Reasoning And Control

Benchmark Leaderboards Versus Mismatched Protocols

Open Robotics Artifacts Versus Private Frontier Evidence

Curated Post-Training Versus Metadata-Conditioned Mixed-Quality Robot Data

Quality Filtering Versus Distribution Support

RL Post-Training Versus Black-Box Population Search

SFT Generalization Versus Weight-Update Conservatism

Graph View

Table of Contents

Backlinks