Aionoscope Manifold Reconstruction Benchmark

Status: draft research idea for extending the published Aionoscope linear-probe diagnostic into manifold-level representation diagnostics.

Collaboration

If this direction resonates with you, I would be happy to talk with like-minded people, collaborate on research, and work on use-cases together.

Ideas are not the bottleneck. Hands are. Time-series modeling should be moving at least as fast as vision, audio, and robotics.

Summary

Aionoscope now provides the published MILETS 2026 baseline: it asks whether a frozen model representation preserves known latent process variables through pooled linear probes. The next step is to reconstruct the latent-variable manifold inside the model representation and evaluate its geometry directly.

The central benchmark question is:

When the true data-generating latent variable has known geometry,
does the model representation preserve that geometry in a usable form?

This would move the benchmark beyond “can a readout recover the factor?” A nonlinear probe can recover a factor from a representation that is geometrically tangled. Manifold reconstruction should ask whether the representation itself keeps the right shape: line, circle, torus, product space, branching regime graph, or a messy high-curvature embedding.

Research Spark

Goodfire’s neural-geometry series is the immediate spark. The strongest local lesson is not that time-series work should copy language-model steering directly. The transferable idea is that concepts can live on curved manifolds inside representation space, and that linear directions can miss or distort those structures.

Source status and credibility:

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior was submitted to arXiv on 2026-05-06 by Goodfire-affiliated authors with Stanford, Harvard, Northeastern, and Technion affiliations. It is a recent preprint, not peer reviewed, but credible enough as a timely research spark because it has a full arXiv paper and a Goodfire research post.
Do Sparse Autoencoders Capture Concept Manifolds? was submitted to arXiv on 2026-04-30 by the same broader Goodfire neural-geometry cluster. It is also a recent preprint, useful for the “linear features versus curved manifolds” framing, but not yet peer-reviewed evidence.
The Neural Geometry Series is a Goodfire research-blog series, useful for visual intuition and method framing rather than as a benchmark authority.

Core Idea

Aionoscope can generate synthetic time-series samples with known latent factors. That gives the benchmark an advantage over ordinary representation probing: it can densely sample the latent coordinates, sweep one factor at a time, and know the correct topology in advance.

For a synthetic generator

x_{1 : T} = g (θ, ϵ),

where $θ$ contains known latent variables such as frequency, phase, amplitude, trend, damping, noise level, regime, or intervention parameter, run a candidate model encoder

z = f (x_{1 : T})

and reconstruct the image of controlled latent sweeps in representation space:

M_{j} = {f (g (θ_{j}, θ_{- j}, ϵ)) : θ_{j} \in Θ_{j}} .

The benchmark then evaluates not only whether $θ_{j}$ is recoverable from $z$ , but whether $M_{j}$ has the expected geometry.

Why This Is Stronger Than Linear Probing

Linear probes answer a narrow question:

Can a linear readout recover this latent variable from the representation?

That remains useful, especially for linearly identifiable world-model state. But it misses several important cases:

a good cyclic variable may be represented as a circle rather than a line;
a useful latent factor may be recoverable only by following a curved path;
a factor may be present but geometrically tangled with nuisance variables;
a high-scoring nonlinear probe may hide a representation that is unusable for steering, generation, or local interpolation;
a model may preserve factor order but break topology, for example turning phase into a line with artificial endpoints.

The manifold benchmark should therefore report both readout recoverability and representation geometry. A model should get credit for preserving a factor, but extra credit for preserving it with the right topology and low distortion.

Benchmark Axes

Recoverability

Measure whether known latent variables can be recovered from model representations:

linear probe score;
nonlinear probe score with a fixed, simple architecture;
sample efficiency of the probe;
out-of-range interpolation and extrapolation under latent sweeps.

This keeps backward compatibility with the current Aionoscope story.

Geometry Fidelity

Measure whether distances and neighborhoods on the reconstructed manifold match the true latent geometry:

geodesic-distance preservation;
trustworthiness and continuity;
neighborhood stability under denser sampling;
latent-distance versus representation-distance calibration;
local tangent consistency.

For example, a frequency sweep should usually form an ordered one-dimensional curve, while phase should form a closed one-dimensional loop.

Linearity And Curvature

Measure whether a factor is represented as a simple linear direction or a tangled curved object:

PCA residual for the latent sweep;
chord-distance versus geodesic-distance ratio;
curvature and tangent-variation statistics;
local linear-probe degradation along the manifold;
sensitivity to sampling density and nuisance-factor variation.

The point is not to require every representation to be linear. The point is to make linearity, curvature, and tangling measurable.

Topology

Measure whether the representation preserves the expected latent topology:

connected components for discrete regimes;
circularity for phase-like variables;
torus-like structure for two independent phases;
branch structure for regime-transition graphs;
persistent homology summaries such as Betti numbers when useful.

This is where the benchmark can become memorable: if phase is truly periodic, the representation should not need an arbitrary seam where $2 π$ wraps to $0$ .

Disentanglement And Cross-Coupling

Measure whether sweeping one latent variable unintentionally moves along other latent directions:

Jacobian cross-terms between latent factors and representation coordinates;
subspace angle between factor manifolds;
factor recovery after nuisance-factor randomization;
product-geometry tests for independent latent factors;
failure cases where amplitude, phase, frequency, and trend become fused.

This is the natural bridge from “latent variable preserved” to “latent state is usable.”

Initial Synthetic Tasks

Start with factors where the expected geometry is obvious:

Synthetic factor	Expected geometry	Failure mode to expose
Amplitude	interval or ray	collapsed scale, saturation, entanglement with noise
Frequency	ordered curve	frequency aliases, local folds, nonlinear stretching
Phase	circle	artificial seam, broken wraparound, line-like encoding
Trend slope	line	sign collapse, nonlinear compression
Regime ID	separated components or graph	arbitrary clustering, mixed regimes

The first public version should be deliberately small: a sine-family benchmark with amplitude, frequency, phase, trend, and noise may be enough to prove the idea. The second version should add multivariate coupling, regime switches, exogenous variables, and intervention parameters.

Evaluation Contract

The benchmark should produce a compact report per model, layer, and pooling choice:

factor -> recoverability, geometry fidelity, linearity, topology, coupling

Useful output examples:

phase: nonlinear probe strong, linear probe weak, topology correct circle, low seam error;
frequency: linear probe strong, manifold nearly straight, low distortion;
amplitude: recoverable but saturated at high values;
two_phase: recoverable under nonlinear probe, but torus topology collapsed into a tangled curve;
regime: separable components, but transition neighborhoods are wrong.

This should be more useful than one global benchmark score. The leaderboard can still summarize, but the diagnostic view should show which latent variables and which geometries survive the representation.

Relation To Foundation TSFM Agenda

This is an idea page, so the verdicts below describe the intended contribution if the proposed benchmark works. Evidence status is recorded separately in the Evidence and Missing pieces columns.

Agenda slot	Verdict	Evidence	Missing pieces
Benchmark level	closes	Proposes a benchmark that tests latent-variable preservation, geometry, topology, and coupling rather than only forecasting or probe accuracy. Evidence is an internal design proposal plus Goodfire-inspired method framing.	Build the first public task suite, scoring code, visual reports, and model baselines.
Representation quality	partially closes	If validated, the benchmark distinguishes semantic-state preservation, dense numeric detail, topology, and tangling.	Need evidence across time-series encoders, foundation models, layers, and pooling choices.
Latent-state prediction	partially closes	Tests whether known latent process variables survive in representations and whether their geometry is usable for state tracking.	Need dynamic tasks where latent state evolves over time, not only static sampled factors.
Control and counterfactuals	adjacent	The action/intervention extension would evaluate whether action-conditioned parameter changes create separable manifold families.	Need synthetic actions, counterfactual rollouts, and action-conditioned model baselines.
Data diversity and long tail	partially closes	Synthetic generation allows dense sampling of rare regimes, endpoints, wraparound regions, and controlled nuisance variation.	Need real-domain transfer tests and checks that synthetic geometry predicts downstream utility.

Open Questions

Which manifold reconstruction method should be the first default: Isomap, diffusion maps, UMAP for visualization only, local PCA, principal curves, persistent homology, or a custom sweep-aware estimator?
Should the first paper emphasize representation diagnostics, benchmark design, or the failure of linear probing?
How should scores avoid rewarding beautiful geometry that is not useful for downstream state prediction, generation, or control?
Which layer and pooling choices should be standardized for fair model comparisons?
How much of the benchmark should be model-agnostic, and how much should use generative or decoder-based steering tests?
Can the same manifold diagnostics explain why some encoders work better for TSL-JEPA or LeNEPA-style latent prediction?
Which visualizations will make the result obvious to the community without overselling low-dimensional plots?

Alex Open Research Wiki

Explorer

Aionoscope Manifold Reconstruction Benchmark

Aionoscope Manifold Reconstruction Benchmark

Collaboration

Summary

Research Spark

Core Idea

Why This Is Stronger Than Linear Probing

Benchmark Axes

Recoverability

Geometry Fidelity

Linearity And Curvature

Topology

Disentanglement And Cross-Coupling

Initial Synthetic Tasks

Evaluation Contract

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Aionoscope Manifold Reconstruction Benchmark

Aionoscope Manifold Reconstruction Benchmark

Collaboration

Summary

Research Spark

Core Idea

Why This Is Stronger Than Linear Probing

Benchmark Axes

Recoverability

Geometry Fidelity

Linearity And Curvature

Topology

Disentanglement And Cross-Coupling

Initial Synthetic Tasks

Evaluation Contract

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks