Intermediate-Layer Representations

Summary

Intermediate-layer representations are a recurring warning against treating the final model output as the default state for downstream transfer. The final layer is often an interface optimized for the pretraining objective; the most reusable latent state may live earlier.

What The Wiki Currently Believes

Guillotine Regularization shows that SSL projectors can improve training while making the final representation worse for downstream tasks; the best layer can be the backbone, an intermediate projector layer, or another trunk layer.
Perception Encoder shows the same pattern at large vision-language scale: PE Core learns strong general features, but the best language and spatial features are hidden in intermediate layers until alignment tuning lifts them to the output.
RAEv2 operationalizes the lesson for generation: summing the last $K$ vision-encoder layers improves the reconstruction-generation tradeoff, while the X discussion clarifies that summing residual-stream outputs is also a fixed depth-reweighting scheme rather than a settled optimal aggregator.
MoDA turns the same warning into an internal architecture primitive: instead of selecting, summing, or probing layers after the fact, later layers retrieve from previous depth key/value memories through attention.
mHC and Hyperloop Transformers add a constrained residual-state alternative: instead of retrieving previous layers through depth KV, a model can widen the carried residual state into parallel streams and constrain mixing. This belongs in the same comparison as fixed layer sums, learned layer weights, sparse selection, and depth attention.
TiViT is a time-series-adjacent example: rendered time-series images can benefit from frozen vision-model hidden representations rather than only final outputs.
LeNEPA adds a time-series SSL example: frozen-probe results depend on intermediate-layer readouts, and the paper’s fixed L4 summaries should not be replaced by an unqualified final-layer representation claim.
Implicit Curriculum Hypothesis moves intermediate representations from post-hoc transfer features to training-time diagnostics: residual-stream function vectors predict held-out composite-task learning trajectories across checkpoints.
Is One Layer Enough? adds a tabular foundation model case: per-layer decoders can expose predictive intermediate states before those states are aligned with the original final decoder.
Aristotelian Representation Hypothesis adds the statistical caveat for layer search itself: if a comparison scans all layer pairs and reports the maximum, deeper models receive a larger null search space and the aggregate needs its own permutation calibration.

Gotchas

Best layer is a protocol variable. It can change with target task, target data distribution, OOD shift, optimizer, architecture, and whether the target task matches the pretraining invariances.
Max-over-layer similarity is a multiple-comparisons statistic. Reports should distinguish actual downstream best-layer transfer from raw layerwise representational similarity, and should calibrate the reported aggregate rather than each cell independently.
“Discard the head” is too coarse. A projector or late block can contain useful intermediate states even if its final output is too objective-specific.
Layerwise probes should use the target evaluation protocol. Source-task or projector-only probes can miss downstream-relevant state.
Training-time representation probes are not automatically causal. A function-vector neighborhood that predicts a learning trajectory is a diagnostic relation unless perturbations or interventions show the representation drives the capability.
Alignment tuning and layer cutting are different moves. Layer cutting selects an already useful internal state; alignment tuning changes the model so a desired state is exposed at the output.
Depth retrieval is not free layer selection. MoDA-style depth KV makes intermediate state accessible, but cache growth, latency, and matched-budget baselines decide whether the interface is actually better than fixed aggregation, unique extra layers, or looped depth.
Residual-state capacity is not free either. Matrix-valued residual streams can make carried state richer, but memory bandwidth, kernel requirements, and serving latency decide whether they beat retrieval, unique layers, or explicit memory tokens.
Invariance can erase information. For time-series and world-model work, augmentations or objectives that remove scale, phase, local detail, channel identity, action information, or exogenous variables may make final embeddings brittle even when earlier layers still encode those factors.
A fixed reporting layer is part of the benchmark contract. For LeNEPA-style frozen probes, the selected intermediate layer should be named rather than treated as a generic model output.
Decoder alignment is part of the representation contract. One Layer Enough’s tabular tuned lens shows that a final decoder can understate the predictive content of an intermediate TFM state.

Implications For Time-Series And World Models

For time-series encoders, the analogous question is not “which visual layer is best?” but “which latent state preserves the dynamics needed by the downstream task?” A forecasting head, reconstruction head, contrastive projector, or language-alignment adapter may optimize a useful interface while suppressing variables needed for classification, anomaly detection, counterfactual prediction, or action-conditioned world modeling.

The implicit-curriculum result adds a training-time version of the same question: can a latent probe measured early predict which time-series capabilities will emerge later? A useful TSFM release would expose enough checkpoint and layer access to test whether regime, channel-coupling, context-use, and intervention probes follow predictable trajectories.

When evaluating temporal representation models, the wiki should prefer reports that identify the probed layer, head, pooling rule, adaptation budget, target split, and whether the evaluation is in-distribution or OOD. LeNEPA’s fixed-recipe probes are useful partly because they expose layer choice as an explicit protocol variable rather than hiding it inside one final embedding.

Relation To Foundation TSFM Agenda

This page is a representation-quality warning for the Foundation Time-Series Model Research Agenda. It belongs under the semantic-state-versus-dense-numeric-detail slot: a final embedding can be convenient for one objective while an intermediate layer preserves the state needed for forecasting, generation, classification, anomaly detection, or intervention reasoning.

Open Questions

Can one pretraining objective produce a final representation that is simultaneously good for global semantics, local structure, temporal dynamics, and intervention-sensitive state?
Which layerwise diagnostics best predict downstream transfer before training many probes?
Should foundation-model releases expose stable intermediate-layer APIs by default?
Should generation and world-modeling systems use fixed layer sums, learned weights, sparse layer selection, feature-pyramid-style fusion, matrix residual streams, or depth attention over intermediate states?
Can bounded depth-KV slots preserve task-relevant intermediate state under a fixed serving budget?

Alex Open Research Wiki

Explorer

Intermediate-Layer Representations

Intermediate-Layer Representations

Summary

What The Wiki Currently Believes

Gotchas

Implications For Time-Series And World Models

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Intermediate-Layer Representations

Intermediate-Layer Representations

Summary

What The Wiki Currently Believes

Gotchas

Implications For Time-Series And World Models

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks