Representational Similarity Calibration
Summary
Representational similarity calibration is the wiki’s protocol for comparing embedding spaces across model families, model sizes, layers, modalities, or checkpoints without mistaking measurement artifacts for learned structure.
The current anchor source is Revisiting the Platonic Representation Hypothesis: An Aristotelian View. Its main lesson is that raw similarity metrics are not self-calibrating. Width, sample size, metric family, and layer-search policy can move the null baseline, so a raw score or max-over-layer trend is not enough evidence that two models have converged to the same representation.
What The Wiki Currently Believes
- Raw CKA, RV, RSA, Procrustes, and related global/geometric scores can be inflated by high-dimensional finite-sample effects. A wider representation can look more similar even when sample correspondences are random.
- Layerwise representation comparisons have a selection problem. If a report searches all layer pairs and publishes the maximum, deeper models have more chances to produce a high null score.
- The reported null must match the reported statistic. If the headline number is max-over-layers, top- over layers, or a family-level aggregate, calibration should apply to that aggregate rather than to each scalar cell independently.
- Local-neighborhood metrics such as mutual -NN have a different null regime and are less width-confounded when is fixed, but they are not magic. They report neighborhood agreement, not calibrated distances, dense-value preservation, or downstream control utility.
- For dependent samples, permutation calibration must preserve dependence structure. Temporal, spatial, grouped, user-level, patient-level, graph-node, or tenant-level data often need blocked, grouped, or otherwise restricted permutations.
Calibration Pattern
For a scalar similarity score :
- Define the null hypothesis as no relationship beyond the marginal statistics of the two representation sets.
- Generate null scores by permuting sample correspondences in one representation set: .
- Compute a right-tail critical value from the combined set of observed and null scores.
- Report both a calibrated effect size and a permutation -value.
For a layerwise comparison, use the same sample permutation across layers and build a null distribution for the final aggregate:
flowchart TD A[Model A layers] --> S[Layer-pair similarity matrix] B[Model B layers] --> S S --> T[Reported statistic T: max, top-k, mean, row-max] B --> P[Same permutation applied to all B layers] P --> SN[Null similarity matrices] SN --> TN[Null distribution of T] T --> C[Calibrated aggregate + p-value] TN --> C
Why It Matters For This Knowledge Base
The wiki frequently compares representations: self-supervised encoders, JEPA/NEPA targets, intermediate layers, vision-language models, latent-state time-series models, and world-model state encoders. Without calibration, an apparent representation-quality improvement can be a metric artifact.
For the foundation time-series model agenda, this matters in four places:
| Use case | Required separation |
|---|---|
| Comparing encoders across scales | Raw similarity versus calibrated similarity; sample size and embedding width should be explicit. |
| Choosing the best layer | Probe transfer should be separated from max-over-layer similarity; max-over-layer similarity needs aggregation-aware calibration. |
| Evaluating latent-state consistency | Local-neighborhood agreement should be separated from dense numeric detail, rare-regime retention, and action-conditioned rollout utility. |
| Monitoring checkpoint trajectories | Representation drift/convergence should be tested with a null that respects checkpoint, sample, channel, and temporal dependence. |
Time-Series Translation
Naively permuting rows is usually wrong for real temporal data. A TSFM calibration protocol should ask what exchangeability means for the dataset:
- Single univariate series: use block permutations or circular shifts rather than arbitrary timestamp shuffles when autocorrelation matters.
- Panel data: permute within entities only when entity-specific distributions should be preserved; otherwise use entity-level permutations.
- Multivariate telemetry: preserve channel groups, topology neighborhoods, and known service/host/region groupings when they are part of the null.
- Event/action streams: preserve action/event timing structure when the null is meant to break representation correspondence without destroying the action process itself.
- Clinical or user data: preserve patient/user/tenant clusters and leakage boundaries.
The target is not to make every representation comparison conservative by default. The target is to make the null match the claim. If the claim is “these two models share cross-channel state structure,” the null should break cross-model correspondence while retaining within-series and within-channel structure that is not evidence for the claim.
Gotchas
- A calibrated similarity score is still not a downstream metric. It says that a representation relation exceeds a null baseline, not that the latent state is useful for forecasting, generation, anomaly detection, or control.
- Local neighborhood agreement can be semantically useful while still losing exact numeric values. TSFM pages should pair local-neighborhood evidence with reconstruction, rollout, rare-event, and intervention probes.
- Aggregation-aware calibration is required when the analysis selects the best layer or best head after seeing the similarity matrix.
- The calibration budget matters. Low can make thresholds unstable; the anchor source recommends about for stable reporting.
- Multiple metrics encode different invariances. CKA, CCA, RSA, Procrustes, mKNN, CKNNA, and downstream probes should not be collapsed into one scalar representation-quality claim.
Related Pages
- Time-Series Benchmark Hygiene
- Self-Supervised Representation Learning
- Intermediate-Layer Representations
- Vision-Language Models
- Representation Collapse
- Latent-Space Predictive Learning
- Foundation Time-Series Model Research Agenda
Open Questions
- Which restricted-permutation families should become standard for high-dimensional multivariate time-series benchmarks?
- Should TSFM model cards report calibrated representation similarity between scales, datasets, and checkpoint stages?
- Which metric family best predicts downstream transfer when dense numeric detail matters: calibrated global similarity, local neighborhoods, supervised probes, or latent rollout diagnostics?
- Can local-neighborhood convergence across TSFM families reveal shared system-state structure, or only shared easy regimes and common seasonal motifs?