Aionoscope Manifold Reconstruction Benchmark
Status: draft research idea for extending Aionoscope from linear probes to manifold-level representation diagnostics.
Collaboration
If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.
Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.
- Email: [email protected]
- X: @chemeris
- Telegram: @alexanderchemeris
Summary
Aionoscope currently asks whether a model representation preserves known latent variables of an observed synthetic process through linear probes. The next step is to reconstruct the latent-variable manifold inside the model representation and evaluate its geometry directly.
The central benchmark question is:
When the true data-generating latent variable has known geometry,
does the model representation preserve that geometry in a usable form?This moves the benchmark beyond “can a readout recover the factor?” A nonlinear probe can recover a factor from a representation that is geometrically tangled. Manifold reconstruction should ask whether the representation itself keeps the right shape: line, circle, torus, product space, branching regime graph, or a messy high-curvature embedding.
Research Spark
Goodfire’s neural-geometry series is the immediate spark. The strongest local lesson is not that time-series work should copy language-model steering directly. The transferable idea is that concepts can live on curved manifolds inside representation space, and that linear directions can miss or distort those structures.
Source status and credibility:
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior was submitted to arXiv on 2026-05-06 by Goodfire-affiliated authors with Stanford, Harvard, Northeastern, and Technion affiliations. It is a recent preprint, not peer reviewed, but credible enough as a timely research spark because it has a full arXiv paper and a Goodfire research post.
- Do Sparse Autoencoders Capture Concept Manifolds? was submitted to arXiv on 2026-04-30 by the same broader Goodfire neural-geometry cluster. It is also a recent preprint, useful for the “linear features versus curved manifolds” framing, but not yet peer-reviewed evidence.
- The Neural Geometry Series is a Goodfire research-blog series, useful for visual intuition and method framing rather than as a benchmark authority.
Core Idea
Aionoscope can generate synthetic time-series samples with known latent factors. That gives the benchmark an advantage over ordinary representation probing: it can densely sample the latent coordinates, sweep one factor at a time, and know the correct topology in advance.
For a synthetic generator
where contains known latent variables such as frequency, phase, amplitude, trend, damping, noise level, regime, or intervention parameter, run a candidate model encoder
and reconstruct the image of controlled latent sweeps in representation space:
The benchmark then evaluates not only whether is recoverable from , but whether has the expected geometry.
Why This Is Stronger Than Linear Probing
Linear probes answer a narrow question:
Can a linear readout recover this latent variable from the representation?That remains useful, especially for linearly identifiable world-model state. But it misses several important cases:
- a good cyclic variable may be represented as a circle rather than a line;
- a useful latent factor may be recoverable only by following a curved path;
- a factor may be present but geometrically tangled with nuisance variables;
- a high-scoring nonlinear probe may hide a representation that is unusable for steering, generation, or local interpolation;
- a model may preserve factor order but break topology, for example turning phase into a line with artificial endpoints.
The manifold benchmark should therefore report both readout recoverability and representation geometry. A model should get credit for preserving a factor, but extra credit for preserving it with the right topology and low distortion.
Benchmark Axes
Recoverability
Measure whether known latent variables can be recovered from model representations:
- linear probe score;
- nonlinear probe score with a fixed, simple architecture;
- sample efficiency of the probe;
- out-of-range interpolation and extrapolation under latent sweeps.
This keeps backward compatibility with the current Aionoscope story.
Geometry Fidelity
Measure whether distances and neighborhoods on the reconstructed manifold match the true latent geometry:
- geodesic-distance preservation;
- trustworthiness and continuity;
- neighborhood stability under denser sampling;
- latent-distance versus representation-distance calibration;
- local tangent consistency.
For example, a frequency sweep should usually form an ordered one-dimensional curve, while phase should form a closed one-dimensional loop.
Linearity And Curvature
Measure whether a factor is represented as a simple linear direction or a tangled curved object:
- PCA residual for the latent sweep;
- chord-distance versus geodesic-distance ratio;
- curvature and tangent-variation statistics;
- local linear-probe degradation along the manifold;
- sensitivity to sampling density and nuisance-factor variation.
The point is not to require every representation to be linear. The point is to make linearity, curvature, and tangling measurable.
Topology
Measure whether the representation preserves the expected latent topology:
- connected components for discrete regimes;
- circularity for phase-like variables;
- torus-like structure for two independent phases;
- branch structure for regime-transition graphs;
- persistent homology summaries such as Betti numbers when useful.
This is where the benchmark can become memorable: if phase is truly periodic, the representation should not need an arbitrary seam where wraps to .
Disentanglement And Cross-Coupling
Measure whether sweeping one latent variable unintentionally moves along other latent directions:
- Jacobian cross-terms between latent factors and representation coordinates;
- subspace angle between factor manifolds;
- factor recovery after nuisance-factor randomization;
- product-geometry tests for independent latent factors;
- failure cases where amplitude, phase, frequency, and trend become fused.
This is the natural bridge from “latent variable preserved” to “latent state is usable.”
Initial Synthetic Tasks
Start with factors where the expected geometry is obvious:
| Synthetic factor | Expected geometry | Failure mode to expose |
|---|---|---|
| Amplitude | interval or ray | collapsed scale, saturation, entanglement with noise |
| Frequency | ordered curve | frequency aliases, local folds, nonlinear stretching |
| Phase | circle | artificial seam, broken wraparound, line-like encoding |
| Trend slope | line | sign collapse, nonlinear compression |
| Regime ID | separated components or graph | arbitrary clustering, mixed regimes |
The first public version should be deliberately small: a sine-family benchmark with amplitude, frequency, phase, trend, and noise may be enough to prove the idea. The second version should add multivariate coupling, regime switches, exogenous variables, and intervention parameters.
Evaluation Contract
The benchmark should produce a compact report per model, layer, and pooling choice:
factor -> recoverability, geometry fidelity, linearity, topology, couplingUseful output examples:
phase: nonlinear probe strong, linear probe weak, topology correct circle, low seam error;frequency: linear probe strong, manifold nearly straight, low distortion;amplitude: recoverable but saturated at high values;two_phase: recoverable under nonlinear probe, but torus topology collapsed into a tangled curve;regime: separable components, but transition neighborhoods are wrong.
This should be more useful than one global benchmark score. The leaderboard can still summarize, but the diagnostic view should show which latent variables and which geometries survive the representation.
Relation To Foundation TSFM Agenda
This is an idea page, so the verdicts below describe the intended contribution if the proposed benchmark works. Evidence status is recorded separately in the Evidence and Missing pieces columns.
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Benchmark level | closes | Proposes a benchmark that tests latent-variable preservation, geometry, topology, and coupling rather than only forecasting or probe accuracy. Evidence is an internal design proposal plus Goodfire-inspired method framing. | Build the first public task suite, scoring code, visual reports, and model baselines. |
| Representation quality | partially closes | If validated, the benchmark distinguishes semantic-state preservation, dense numeric detail, topology, and tangling. | Need evidence across time-series encoders, foundation models, layers, and pooling choices. |
| Latent-state prediction | partially closes | Tests whether known latent process variables survive in representations and whether their geometry is usable for state tracking. | Need dynamic tasks where latent state evolves over time, not only static sampled factors. |
| Control and counterfactuals | adjacent | The action/intervention extension would evaluate whether action-conditioned parameter changes create separable manifold families. | Need synthetic actions, counterfactual rollouts, and action-conditioned model baselines. |
| Data diversity and long tail | partially closes | Synthetic generation allows dense sampling of rare regimes, endpoints, wraparound regions, and controlled nuisance variation. | Need real-domain transfer tests and checks that synthetic geometry predicts downstream utility. |
Open Questions
- Which manifold reconstruction method should be the first default: Isomap, diffusion maps, UMAP for visualization only, local PCA, principal curves, persistent homology, or a custom sweep-aware estimator?
- Should the first paper emphasize representation diagnostics, benchmark design, or the failure of linear probing?
- How should scores avoid rewarding beautiful geometry that is not useful for downstream state prediction, generation, or control?
- Which layer and pooling choices should be standardized for fair model comparisons?
- How much of the benchmark should be model-agnostic, and how much should use generative or decoder-based steering tests?
- Can the same manifold diagnostics explain why some encoders work better for TSL-JEPA or LeNEPA-style latent prediction?
- Which visualizations will make the result obvious to the community without overselling low-dimensional plots?
Related Pages
- Foundation Time-Series Model Research Agenda
- Latent-Space Predictive Learning
- Latent-State Time-Series Modeling
- Time-Series Benchmark Hygiene
- Synthetic Data For Time Series
- Time Series Forecasting Using Manifold Learning
- When Does LeJEPA Learn a World Model?
- A Cookbook of Self-Supervised Learning
- TSL-JEPA