Contradictions And Open Tensions
Dataset License Notes Need Pinned Artifacts
MicroSS has conflicting upstream license signals: the repository LICENSE file is GPL-2.0, while the README license section says Apache 2.0. LEMMA-RCA has a similar mismatch: the website and README license section say CC BY-ND 4.0, while Hugging Face metadata and one README paragraph say CC BY-NC 4.0. The wiki should avoid giving reuse advice for these datasets until a pinned release artifact or maintainer confirmation resolves the license terms.
Final Embedding Versus Best Transfer State
Guillotine Regularization and Perception Encoder both warn that a model’s final output can be worse than an intermediate state for downstream transfer. This complicates source pages that summarize a model as “having strong representations”: the wiki should distinguish last-layer embeddings, best-layer embeddings, and aligned outputs. For time-series and world-model sources, this is an open tension around whether reported embeddings preserve downstream-relevant dynamics or merely serve the pretraining head.
Residual Accumulation Versus Depth Retrieval
MoDA argues that the standard residual stream is an accumulation interface: earlier signals remain present only through a growing blended state. RAEv2 shows that fixed multi-layer aggregation can be useful for reconstruction and generation, but its X discussion also notes that residual-stream summation is a fixed depth-weighting scheme, not an optimal retrieval mechanism. mHC adds a third route: widen the residual stream into parallel streams and constrain residual mixing, rather than only accepting fixed accumulation or adding depth-KV retrieval. Hyperloop Transformers applies that residual-state-capacity idea at loop boundaries. The unresolved comparison is now fixed aggregation, learned layer weights, sparse selection, looped refinement, content-based depth retrieval, and constrained matrix residual state under matched memory bandwidth, cache, latency, and FLOPs. For the wiki’s time-series agenda, this is an architecture analogy until a source tests whether these mechanisms preserve numeric state, rare regimes, actions, or intervention effects under matched memory and latency budgets.
Semantic Latents Versus Pixel Fidelity
Reconstruction or Semantics? argues that semantic latent spaces can be more policy-relevant for robotic world models than reconstruction-focused latents. World Models is the older version of the same tension: its standalone VAE could preserve irrelevant visual detail while missing task-relevant features. The Prism Hypothesis, Tuna-2, and Gemma 4 12B complicate that: Prism frames semantic and pixel encoders as occupying different frequency bands, Tuna-2 argues that end-to-end pixel embeddings can beat pretrained vision encoders for unified multimodal understanding and generation, and Gemma 4 12B turns encoder-free image/audio routing into an open-weight production release. Gemma 4 12B should not be overread as “no input frontends”: it still uses lightweight patch/waveform projection before the shared decoder backbone. RAEv2 adds another route: keep pretrained semantic encoders, but aggregate multiple layers to recover local detail and improve generation/rollout behavior. A separately linked RAEv2 X discussion, not the paper itself, further complicates the mechanism because residual-stream summation acts like a fixed layer-depth weighting, not a proven optimal fusion rule. Self-Teaching Autoencoder adds a lower-evidence route: train the decoder inside the latent objective through transformed self-consistency, so reconstruction grounding is not bolted onto a pretrained latent afterward. Because that evidence is a blog/code/demo snapshot, it should be treated as a mechanism hypothesis rather than a settled resolution of the semantic-versus-fidelity tension. The wiki should not collapse these into one rule; the right latent appears task-dependent, and the layer aggregation, projection-frontend, or decoder-grounding interface is itself part of the unresolved tension.
Learned Simulator Training Versus Simulator Exploitation
World Models shows both sides of learned latent environments: a controller trained in a learned DoomRNN dream can transfer back to VizDoom, but low-uncertainty or imperfect dreams let controllers discover policies that exploit model errors and fail in the real environment. Newer action-conditioned world-model claims should therefore separate model-rollout score, uncertainty handling, and real-environment or live-stand transfer.
Genie adds a newer generated-interactive-environment case: it shows controllable visual rollouts from learned latent actions, but the paper also reports unrealistic hallucinated futures, 16-frame memory, about 1 FPS interaction, and no large-model or training-data release. Generated playability should therefore remain separate from evidence that agents trained inside the simulator transfer robustly.
stable-worldmodel adds a current reproducibility version of the same tension. It standardizes data access, solvers, and factor-of-variation sweeps, but its Push-T analyses still require prediction error and planning success to be reported separately under distribution shift. Better infrastructure lowers ambiguity; it does not remove the learned-simulator exploitation problem.
World Model for Robot Learning Survey broadens this into the current robotics benchmark taxonomy: open-loop action-conditioned video quality, closed-loop policy utility, and physical/executability diagnostics are separate evidence layers. Visual plausibility should not be treated as simulator-transfer evidence.
On Training in Imagination adds that a learned simulator’s score often depends on a separate learned reward model. A controller can fail because the dynamics model is wrong, because the reward model is wrong, because reward labels are too noisy for the rollout budget, or because reward labels are systematically biased. These are distinct failure modes; averaging more rollouts can reduce zero-mean reward noise but cannot remove reward bias.
Digital World Models Versus Operations World Models
Agentic World Modeling usefully defines digital world models around software laws: API contracts, UI state machines, file-system logic, permissions, type constraints, and replayable error branches. This is close to CWM, but it should not be collapsed into the operations target. A web, GUI, or code simulator can satisfy digital-world constraints while still missing numeric telemetry, graph time series, event streams, delayed interventions, hidden concurrent users, failed-action semantics, rewards, and human-approval actions. The wiki should use digital world models as an architectural precedent for action/state/constraint contracts, not as evidence that SRE action-conditioned world modeling is solved.
Raw Trajectories Versus Summary Interfaces
Scaling Test-Time Compute for Agentic Coding shows that compact structured summaries can beat raw action-observation traces for coding-agent selection and reuse. But LLM Agents Need Action-Conditioned World Models and Hierarchical Modeling with a Fixed FLOPs Budget warn that compression can erase timing, magnitude, topology, failed-action status, uncertainty, rare events, and intervention effects. The unresolved tension is how to design summary schemas and preservation probes that reduce trace noise without destroying the state needed for control.
Objective-Relevant Compression Versus Decision-Relevant State
Learning is Forgetting argues that LLM training can improve representations by forgetting input detail that is irrelevant to next-sequence prediction. The fixed-FLOPs and world-model pages need the opposite guardrail: if the objective does not include rare events, dense numeric fidelity, topology, actions, or intervention effects, compression can erase exactly the state a time-series or action-conditioned world model needs. JEPA Slow Features is the adjacent failure mode: predictive objectives can preserve the wrong factors even without constant collapse. The unresolved question is how to make compression accountable to downstream state and control objectives, not only average prediction loss.
TurboQuant adds a lower-level vector-compression example: even when a quantizer is optimized for MSE, inner-product estimates can be biased, so the paper spends a residual QJL bit to preserve scoring geometry. The later vLLM critique adds a second unresolved layer: preserving quality and reducing KV-cache bytes can still lose to FP8 once hardware-native attention, dequantization overhead, latency, throughput, and burst-load behavior are counted. That strengthens rather than resolves the tension. TSFM compression should declare whether it preserves reconstruction, inner products or retrieval, downstream prediction, anomaly sensitivity, or control value, and whether the compressed representation improves the actual serving contract.
Latent Dynamics Versus Governed Law Revision
LeCun APTAMI, LeWorldModel, and the broader latent-prediction thread make a strong case for learned latent dynamics as the substrate for prediction and planning. Agentic World Modeling adds a stricter L3 boundary: revising the model when evidence contradicts predictions. For operational and scientific settings, that likely requires explicit, auditable, revisable constraints, tests, data-version records, or symbolic assets around the latent model. The unresolved question is whether latent dynamics alone can support governed model revision, or whether L3 needs a hybrid latent-plus-symbolic control plane.
Heuristic-Free JEPA Versus Stabilized Predictive Training
LeJEPA, LeJEPA Identifiability, and LeWorldModel emphasize Gaussian/SIGReg-style regularization as a path away from teacher-student, stop-gradient, and schedule heuristics. NEPA still uses causal masking and stop-gradient for next-embedding visual prediction. Self-Teaching Autoencoder complicates this further: SIGReg is used, but the blog still needs transformation constraints and step-frozen judging to avoid private-language and artifact-invariance shortcuts. LeJEPA Identifiability sharpens the positive case under Gaussian/OU assumptions, while also making the caveat more explicit: non-Gaussian or policy-shaped trajectories can produce distorted non-collapsed states. The open question is whether Gaussian regularization can replace these stabilizers across large-scale vision and multimodal settings.
Predictable Latents Versus Task-Relevant Dynamics
Joint Embedding Predictive Architectures Focus on Slow Features shows that latent prediction can select fixed or slow distractors instead of action-relevant state, even when the embedding is not constant-collapsed. LeJEPA Identifiability adds the conditional positive side: under Gaussian/OU latents and successful whitening, predictability can recover true latent state up to rotation. The tension is now diagnostic: distinguish benign slow state from slow nuisance features or policy-shaped non-Gaussian marginals. For time-series JEPA, the unresolved tension is how to preserve slow regime context without letting static exogenous variables, identifiers, or sensor artifacts dominate the latent state.
Uniformity Prevents Collapse Versus Long-Tailed Reality
The Hidden Uniform Cluster Prior in Self-Supervised Learning argues that several SSL anti-collapse mechanisms impose a hidden uniform feature or cluster prior. LeJEPA Identifiability gives the complementary positive case for a Gaussian or whitening prior when the latent world really is close to Gaussian and isotropic. This is still in tension with naturally long-tailed temporal data, where rare events, regimes, interventions, and treatments may matter precisely because they are not uniform. Gaussian or uniform-looking target distributions can be useful stabilizers, but the wiki should treat them as priors whose fit to the domain and data-collection policy must be tested.
Tokenizer Removal Has Multiple Incompatible Paths
H-Net learns hierarchical byte chunking end to end, Synergy learns routing over byte-level abstraction, Bolmo byteifies existing subword LMs through distillation, ConceptMoE compresses token streams into concepts inside an MoE, and Compute Optimal Tokenization treats compression rate as a scaling-law variable. These are not the same claim. They agree that fixed tokenization is limiting, but disagree on whether the future is byte-level modeling, learned chunking, concept-level compute allocation, transfer from subword models, or compression-aware training recipes. The wiki should compare them under explicit bytes-per-parameter, inference-FLOPs-per-byte, vocabulary/embedding-cost, and preservation-probe budgets rather than treating “fewer tokens” as automatically better.
ELF adds a different route: keep tokenization and final vocabulary decoding, but move the generative trajectory into continuous contextual embedding space and discretize only at the final step. This complicates any simple tokenizer-removal story. For multimodal time-series models, the question may be less “tokens or no tokens” and more “which parts of the pipeline should remain continuous until the final readout?”
Energy-Based Verification Versus Many-Mode Futures
Energy-Based Transformers strengthens the case that explicit energies can provide candidate verification and dynamic compute, but it also reports extra gradient cost, sensitive optimization hyperparameters, and many-mode failures where convex energy-landscape training can merge nearby modes. This complicates the optimistic EBM/world-model thread from APTAMI and LVEBM: energy scores may be useful for ranking candidate futures or actions, but the wiki should not treat them as calibrated multi-future or action-conditioned control evidence until numeric time-series or control benchmarks test that directly.
Fourier Spectra, Functional Geometry, And Bit-Level Number Encodings
FoNE argues that digit-aligned Fourier features can encode numbers compactly as single tokens and improve arithmetic efficiency. Convergent Evolution adds a sharper warning: Fourier spikes can be universal across model families and even raw number-token frequencies while still failing to produce linearly usable modular geometry. BitTokens argues that sinusoidal number encodings are not general-purpose because multiplication and division force non-local decoding, computation, and re-encoding, while IEEE 754 bit features expose arithmetic structure directly. Pre-trained Large Language Models Use Fourier Features To Compute Addition supports the mechanistic basis for Fourier features in addition, but does not settle whether Fourier, bit-level, logarithmic, or hybrid encodings are best for auxiliary numeric values in time-series models.
TabM adds a separate axis: typed tabular numeric-feature embeddings are column-specific and often depend on feature-specific bins, frequencies, or preprocessing. Piecewise-linear and periodic feature embeddings can be useful for static tabular prediction, but they do not settle universal text-number tokenization or exact arithmetic. The wiki should keep typed feature embeddings distinct from free-standing number tokens.
Synthetic Time-Series Data Is Promising But Not Sufficiently Settled
CauKer, MantisV2, Chronos-2, TabPFN-3, TempoPFN, Toto 2.0, ChatTS, TimeOmni-1, TimeOmni-VL, and T2S all use synthetic data or synthetic annotation, but they target different bottlenecks: classification label coverage, grouped/covariate forecasting, PFN-style learned inference, benchmark-leakage control, time-series-language alignment, reasoning supervision, generation fidelity, and text-conditioned annotation over real temporal fragments. T2S further splits the category: TSFragment-600K is synthetic annotation over real time-series fragments, so its risk is caption-model bias and utility of the text-conditioning signal, not only simulator realism. The open tension is whether synthetic data quality, reasoning annotations, transfer validity, text-conditioning utility, or representation fidelity is the dominant constraint.
Scaling Laws Versus Tiny Efficient Models
Toto 2.0 reports monotonic gains across a 4M to 2.5B parameter family, while Time-MoE and Moirai-MoE argue for sparse capacity scaling. Tiny Time Mixers, Reverso, Kairos, TiRex, and Moirai 2.0 complicate the story by reporting strong results from compact or specialized backbones. Sundial sharpens the ambiguity because it argues that TimeFlow mitigates mode collapse while its TimeBench data-scale experiments also show gains from larger corpora. The wiki should not treat parameter count as the single axis of progress, and should keep objective design, corpus scale/cleaning, and backbone/inference engineering separable.
Full-Batch Sharpness Versus Batch Sharpness
SGD at the Edge of Stability complicates the common “sharpness at ” shorthand. In mini-batch SGD, full-batch sharpness can settle below the full-batch gradient-descent threshold because projected gradient noise changes the self-stabilizing oscillation, while batch sharpness is the quantity that approaches the stochastic stability edge. For wiki synthesis, any claim about sharpness, flatness, or batch-size-dependent generalization should name the sharpness protocol, optimizer, loss, and batch regime rather than treating sharpness as one universal scalar.
The same source reports a CNN with cross-entropy on a small dataset that does not enter edge-of-stability, so the tension applies when the training regime sustains progressive sharpening rather than to all SGD runs.
Looped Depth And Test-Time Memory Versus Matched-Budget Baselines
Universal Transformers, Huginn, Latent Thoughts, LoopFormer, Parcae, Sparse Layers are Critical to Scaling Looped Language Models, and ELT make the positive case that repeated depth can provide useful effective computation with shared parameters. The tension is that the comparison target keeps moving: a looped block can be compared against a shallower model, a deeper unique-weight model, a wider model with the same parameters, or a model matched on expected training FLOPs and serving latency. Parallel Samplers and The Recurrent Transformer add that the budget must include hardware throughput, key-value cache traffic, and decoding schedule, not only nominal loop count.
Titans and ATLAS add explicit memory as another resource, while MIRAS and MesaNet make attentional bias, retention gates, and solver or optimization steps part of the resource accounting. Titans Revisited and Universal Transformers Need Memory sharpen the caution. More loops, more memory slots, more retention structure, and more optimization steps can substitute for one another, but only under a declared budget. Titans reports broad long-context gains, while Titans Revisited shows that under-specified implementation details, chunking, and frozen-backbone memory updates can narrow or reverse the claim. The wiki should not merge these into a generic “more thinking helps” claim.
mHC and Hyperloop Transformers add matrix-valued residual streams as another state-capacity resource. Hyperloop reports parameter-memory gains and INT4 robustness, but this does not settle whether residual-stream width, explicit memory tokens, depth-KV retrieval, extra loops, or unique layers best preserve useful temporal state under matched latency, memory bandwidth, KV/cache, and expected-FLOPs budgets.
RMT, ARMT, and RATE add the segment-level recurrent-memory version of the same tension. RMT can match Transformer-XL-style cache with fewer carried vectors in reported language-modeling settings, but the cost moves to sequential segment processing, memory capacity, and BPTT stability. ARMT improves recall/rewrite and BABILong long-context QA, but remains sequential across segments and does not yet show a clean language-modeling or numeric time-series transfer win under matched serving budgets. RATE adds the action-trajectory variant: cached hidden states, explicit memory tokens, and MRV-style filtering can trade off differently across dense-feedback and sparse-cue tasks. The wiki should compare memory tokens, KV cache, SSM/RWKV/Mamba state, test-time memory, and retention valves under explicit latency, memory, and training-budget constraints.
Language Models Need Sleep adds another axis to this tension: compact SSM fast weights may have enough capacity but still fail because the model has not spent enough computation consolidating evicted context into a reasoning-usable state. Its sleep phase improves synthetic and GSM-Infinite reasoning by looping before cache eviction, but this shifts cost into consolidation and training rather than eliminating it. The wiki should separate memory capacity, memory update compute, prediction latency, and training stability.
Dragon Hatchling adds a core-fast-state variant: memory is not an add-on module or segment token block, but a large recurrent state comparable to the model’s weight matrices. Matched-budget comparisons should count state size, update bandwidth, sparsity assumptions, BPTT depth, and serving latency before comparing it with KV cache, SSM fast weights, memory tokens, or test-time memory modules.
The recursive-reasoning descendants also disagree on mechanism. HRM attributes gains to fast/slow hierarchy plus deep supervision; TRM argues that a simpler answer-refinement loop can outperform the hierarchy; URM attributes ARC/Sudoku gains mainly to UT-style recurrence and nonlinear Transformer components; Universal Transformers Need Memory finds memory tokens necessary only for its single-block UT+ACT setting, while noting that HRM, TRM, and URM solve Sudoku through other state mechanisms.
Architecture Evidence Versus Biological And Reasoning Narrative
Dragon Hatchling is a useful architecture source for sparse positive recurrent fast state, language/translation comparisons, and synapse-level probes. Its public narrative is broader: the paper frames BDH as a bridge between Transformers and brain models, while Pathway’s later Sudoku blog reports a strong internal Sudoku Extreme result. The tension is that these claims have different evidential status. The wiki should cite BDH as architecture evidence, record the Sudoku result as official but not open-reproduced in the public repository, and avoid treating biological plausibility or puzzle performance as direct evidence for multivariate time-series state tracking, event streams, or action-conditioned world models.
Block-Wise Isolation Versus End-To-End Coordination
DiffusionBlocks argues that residual networks can be split into independently trainable diffusion-style denoising blocks, reducing training memory while matching end-to-end training on its reported tasks. This creates a useful tension with the usual end-to-end coordination story: local objectives may be principled, but they still need to preserve global sequence state, rare regimes, action effects, and pretrained representations. For private or company-local adaptation, block isolation is appealing because it suggests new training boundaries, but gradients or update deltas can still leak data. The wiki should treat block-wise training as a promising systems interface, not as settled evidence that local objectives, privacy, and global model quality are simultaneously solved.
For the company-local adaptation research direction, see Company-Local Block-Wise Fine-Tuning.
Linear Scan Efficiency Versus Nonlinear Recurrent Dynamics
Mamba, Mamba-2, and Mamba-3 make compact recurrent state practical by staying in structured linear hidden-state-update families. ParaRNN challenges the implied boundary: nonlinear GRU/LSTM-style recurrence can also be trained in parallel at billion-parameter scale when the hidden trajectory is solved through Newton iterations and parallel reduction. The unresolved tension is whether nonlinear recurrent dynamics are now broadly practical, or only practical for carefully structured cells with fast Newton convergence.
Univariate Simplicity Versus Native Multivariate Dynamics
Many forecasting models simplify heterogeneous pretraining by treating each channel as a univariate series, including Timer, Time-MoE, Sundial, Kairos, Reverso, and TiRex. Chronos-2, Moirai, Tiny Time Mixers, and Toto make stronger claims about covariates, grouped series, channel mixing, or factorized variate attention. U-Cast sharpens the conflict by arguing that low-channel benchmarks are a poor test of the channel-dependence question and that high-dimensional settings expose different scalability and structure-learning requirements. The tension is whether broad transfer is better served by simple channel-independent modeling or by native multivariate structure, and whether the answer changes once the channel count reaches the HDTSF regime.
Forecasting Accuracy Versus Reasoning And Control
Eidos argues for latent-space predictive learning for robust forecasting, while TimeOmni-1 targets explicit reasoning and TimeOmni-VL targets unified understanding/generation. Toto, Toto 2.0, and ChronoGraph add the observability version of the tension: passive metric forecasts are valuable, but they are not action-conditioned world models unless operator actions, control inputs, or interventions are modeled directly. A model can be a strong forecaster without being a strong reasoning or control model, and a reasoning model can fail numerical fidelity.
Benchmark Leaderboards Versus Mismatched Protocols
Time-series benchmark comparisons often mix zero-shot base checkpoints, few-shot adaptation, fine-tuning, frozen classifiers, and ensembles. Toto 2.0 reports base, fine-tuned, and ensemble entries; TabPFN-3 reports static TabArena, API/Thinking, and specialized TabPFN-TS-3 entries; MantisV2, UniShape, and TiViT use classification protocols with downstream classifiers or fused features; TiRex and Chronos-2 discuss overlap and leakage controls; U-Cast adds channel count and high-dimensional dependency structure as benchmark axes. Sundial separately reports zero-shot and once-fine-tuned FEV settings, so its leaderboard claims should not be merged without adaptation-mode labels. π0.7 adds the robotics analogue: in very broad robot corpora, claims about unseen tasks or unseen task-robot combinations are hard to verify because related skills may appear under different labels or as incidental sub-behaviors. World Model for Robot Learning Survey adds the embodied-world-model version: LIBERO, RoboTwin, CALVIN, and SIMPLER-style results differ by embodiment, action space, task composition, and protocol, so survey tables should be treated as design-map evidence rather than one unified leaderboard. Leaderboard rank should be treated as provisional unless the task, horizon, context length, covariates, metrics, channel count, dependency structure, adaptation mode, and pretraining-overlap policy match.
Open Robotics Artifacts Versus Private Frontier Evidence
Gemini Robotics 1.5 reports strong embodied-reasoning, Motion Transfer, and long-horizon agent results, but the models and detailed architecture are private. Helix 02 is even more demonstration-heavy, with no public weights, dataset, ablations, or failure-rate protocol. GR00T N1 and OpenVLA provide stronger public artifact surfaces. The wiki should distinguish frontier capability signals from reproducible open baselines when synthesizing robotics progress.
Curated Post-Training Versus Metadata-Conditioned Mixed-Quality Robot Data
π0 emphasizes a pretraining/post-training recipe where high-quality curated data induces reliable dexterous behavior, while π0.7 argues that episode metadata can make larger mixed-quality data, failures, and autonomous rollouts useful instead of harmful. The wiki should treat data filtering and metadata-conditioned reuse as competing or complementary recipes, not one settled rule.
RL Post-Training Versus Black-Box Population Search
OpenAI ES 2017 showed that ES can trade data efficiency for parallel rollout throughput in policy optimization. Evolution Strategies at Scale argues that this trade now matters for LLM post-training because response-level rewards, long horizons, and reward hacking can make RL brittle. Evolution Strategies at the Hyperscale strengthens the systems case through low-rank perturbations. Evolutionary Strategies lead to Catastrophic Forgetting in LLMs adds the counterclaim that ES can match new-task reward while degrading prior capabilities through dense, high-norm parameter drift. FADE adds a non-LLM continual-learning counterpoint: forgetting can be useful when the decay policy is selective and tied to non-stationarity. The open tension is whether LLM post-training can make forgetting controllable rather than merely smaller, since FADE’s evidence is final-layer online learning rather than full-model LLM adaptation.
SFT Generalization Versus Weight-Update Conservatism
Dynamic Fine-Tuning reframes SFT as an RL-like update with an inverse-probability weight, then improves generalization by removing the extreme gradient amplification of low-probability expert tokens. This complements the ES forgetting critique: ES may move too many weights too strongly, while DFT may move too conservatively on rare or unfamiliar targets. The unresolved tension is how to choose post-training objectives that gain target behavior while controlling parameter drift and retaining prior capabilities.