Representational Similarity Calibration

Summary

Representational similarity calibration is the wiki’s protocol for comparing embedding spaces across model families, model sizes, layers, modalities, or checkpoints without mistaking measurement artifacts for learned structure.

The current anchor source is Revisiting the Platonic Representation Hypothesis: An Aristotelian View. Its main lesson is that raw similarity metrics are not self-calibrating. Width, sample size, metric family, and layer-search policy can move the null baseline, so a raw score or max-over-layer trend is not enough evidence that two models have converged to the same representation.

What The Wiki Currently Believes

Raw CKA, RV, RSA, Procrustes, and related global/geometric scores can be inflated by high-dimensional finite-sample effects. A wider representation can look more similar even when sample correspondences are random.
Layerwise representation comparisons have a selection problem. If a report searches all layer pairs and publishes the maximum, deeper models have more chances to produce a high null score.
The reported null must match the reported statistic. If the headline number is max-over-layers, top- $k$ over layers, or a family-level aggregate, calibration should apply to that aggregate rather than to each scalar cell independently.
Local-neighborhood metrics such as mutual $k$ -NN have a different null regime and are less width-confounded when $k$ is fixed, but they are not magic. They report neighborhood agreement, not calibrated distances, dense-value preservation, or downstream control utility.
For dependent samples, permutation calibration must preserve dependence structure. Temporal, spatial, grouped, user-level, patient-level, graph-node, or tenant-level data often need blocked, grouped, or otherwise restricted permutations.

Calibration Pattern

For a scalar similarity score $s_{obs} = s (X, Y)$ :

Define the null hypothesis as no relationship beyond the marginal statistics of the two representation sets.
Generate $K$ null scores by permuting sample correspondences in one representation set: $s^{(k)} = s (X, π_{k} (Y))$ .
Compute a right-tail critical value $τ_{α}$ from the combined set of observed and null scores.
Report both a calibrated effect size and a permutation $p$ -value.

s_{cal} = max (\frac{s _{obs} - τ _{α}}{s _{m a x} - τ _{α}}, 0), p = \frac{1 + # { k : s ^{(k)} \geq s _{obs} }}{K + 1} .

For a layerwise comparison, use the same sample permutation across layers and build a null distribution for the final aggregate:

flowchart TD
  A[Model A layers] --> S[Layer-pair similarity matrix]
  B[Model B layers] --> S
  S --> T[Reported statistic T: max, top-k, mean, row-max]
  B --> P[Same permutation applied to all B layers]
  P --> SN[Null similarity matrices]
  SN --> TN[Null distribution of T]
  T --> C[Calibrated aggregate + p-value]
  TN --> C

Why It Matters For This Knowledge Base

The wiki frequently compares representations: self-supervised encoders, JEPA/NEPA targets, intermediate layers, vision-language models, latent-state time-series models, and world-model state encoders. Without calibration, an apparent representation-quality improvement can be a metric artifact.

For the foundation time-series model agenda, this matters in four places:

Use case	Required separation
Comparing encoders across scales	Raw similarity versus calibrated similarity; sample size and embedding width should be explicit.
Choosing the best layer	Probe transfer should be separated from max-over-layer similarity; max-over-layer similarity needs aggregation-aware calibration.
Evaluating latent-state consistency	Local-neighborhood agreement should be separated from dense numeric detail, rare-regime retention, and action-conditioned rollout utility.
Monitoring checkpoint trajectories	Representation drift/convergence should be tested with a null that respects checkpoint, sample, channel, and temporal dependence.

Time-Series Translation

Naively permuting rows is usually wrong for real temporal data. A TSFM calibration protocol should ask what exchangeability means for the dataset:

Single univariate series: use block permutations or circular shifts rather than arbitrary timestamp shuffles when autocorrelation matters.
Panel data: permute within entities only when entity-specific distributions should be preserved; otherwise use entity-level permutations.
Multivariate telemetry: preserve channel groups, topology neighborhoods, and known service/host/region groupings when they are part of the null.
Event/action streams: preserve action/event timing structure when the null is meant to break representation correspondence without destroying the action process itself.
Clinical or user data: preserve patient/user/tenant clusters and leakage boundaries.

The target is not to make every representation comparison conservative by default. The target is to make the null match the claim. If the claim is “these two models share cross-channel state structure,” the null should break cross-model correspondence while retaining within-series and within-channel structure that is not evidence for the claim.

Gotchas

A calibrated similarity score is still not a downstream metric. It says that a representation relation exceeds a null baseline, not that the latent state is useful for forecasting, generation, anomaly detection, or control.
Local neighborhood agreement can be semantically useful while still losing exact numeric values. TSFM pages should pair local-neighborhood evidence with reconstruction, rollout, rare-event, and intervention probes.
Aggregation-aware calibration is required when the analysis selects the best layer or best head after seeing the similarity matrix.
The calibration budget matters. Low $K$ can make thresholds unstable; the anchor source recommends about $K \geq 200$ for stable reporting.
Multiple metrics encode different invariances. CKA, CCA, RSA, Procrustes, mKNN, CKNNA, and downstream probes should not be collapsed into one scalar representation-quality claim.

Open Questions

Which restricted-permutation families should become standard for high-dimensional multivariate time-series benchmarks?
Should TSFM model cards report calibrated representation similarity between scales, datasets, and checkpoint stages?
Which metric family best predicts downstream transfer when dense numeric detail matters: calibrated global similarity, local neighborhoods, supervised probes, or latent rollout diagnostics?
Can local-neighborhood convergence across TSFM families reveal shared system-state structure, or only shared easy regimes and common seasonal motifs?

Alex Open Research Wiki

Explorer

Representational Similarity Calibration

Representational Similarity Calibration

Summary

What The Wiki Currently Believes

Calibration Pattern

Why It Matters For This Knowledge Base

Time-Series Translation

Gotchas

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Representational Similarity Calibration

Representational Similarity Calibration

Summary

What The Wiki Currently Believes

Calibration Pattern

Why It Matters For This Knowledge Base

Time-Series Translation

Gotchas

Related Pages

Open Questions

Graph View

Table of Contents

Backlinks