Intermediate-Layer Representations

Summary

Intermediate-layer representations are a recurring warning against treating the final model output as the default state for downstream transfer. The final layer is often an interface optimized for the pretraining objective; the most reusable latent state may live earlier.

What The Wiki Currently Believes

  • Guillotine Regularization shows that SSL projectors can improve training while making the final representation worse for downstream tasks; the best layer can be the backbone, an intermediate projector layer, or another trunk layer.
  • Perception Encoder shows the same pattern at large vision-language scale: PE Core learns strong general features, but the best language and spatial features are hidden in intermediate layers until alignment tuning lifts them to the output.
  • RAEv2 operationalizes the lesson for generation: summing the last vision-encoder layers improves the reconstruction-generation tradeoff, while the X discussion clarifies that summing residual-stream outputs is also a fixed depth-reweighting scheme rather than a settled optimal aggregator.
  • MoDA turns the same warning into an internal architecture primitive: instead of selecting, summing, or probing layers after the fact, later layers retrieve from previous depth key/value memories through attention.
  • mHC and Hyperloop Transformers add a constrained residual-state alternative: instead of retrieving previous layers through depth KV, a model can widen the carried residual state into parallel streams and constrain mixing. This belongs in the same comparison as fixed layer sums, learned layer weights, sparse selection, and depth attention.
  • TiViT is a time-series-adjacent example: rendered time-series images can benefit from frozen vision-model hidden representations rather than only final outputs.

Gotchas

  • Best layer is a protocol variable. It can change with target task, target data distribution, OOD shift, optimizer, architecture, and whether the target task matches the pretraining invariances.
  • “Discard the head” is too coarse. A projector or late block can contain useful intermediate states even if its final output is too objective-specific.
  • Layerwise probes should use the target evaluation protocol. Source-task or projector-only probes can miss downstream-relevant state.
  • Alignment tuning and layer cutting are different moves. Layer cutting selects an already useful internal state; alignment tuning changes the model so a desired state is exposed at the output.
  • Depth retrieval is not free layer selection. MoDA-style depth KV makes intermediate state accessible, but cache growth, latency, and matched-budget baselines decide whether the interface is actually better than fixed aggregation, unique extra layers, or looped depth.
  • Residual-state capacity is not free either. Matrix-valued residual streams can make carried state richer, but memory bandwidth, kernel requirements, and serving latency decide whether they beat retrieval, unique layers, or explicit memory tokens.
  • Invariance can erase information. For time-series and world-model work, augmentations or objectives that remove scale, phase, local detail, channel identity, action information, or exogenous variables may make final embeddings brittle even when earlier layers still encode those factors.

Implications For Time-Series And World Models

For time-series encoders, the analogous question is not “which visual layer is best?” but “which latent state preserves the dynamics needed by the downstream task?” A forecasting head, reconstruction head, contrastive projector, or language-alignment adapter may optimize a useful interface while suppressing variables needed for classification, anomaly detection, counterfactual prediction, or action-conditioned world modeling.

When evaluating temporal representation models, the wiki should prefer reports that identify the probed layer, head, pooling rule, adaptation budget, target split, and whether the evaluation is in-distribution or OOD.

Relation To Foundation TSFM Agenda

This page is a representation-quality warning for the Foundation Time-Series Model Research Agenda. It belongs under the semantic-state-versus-dense-numeric-detail slot: a final embedding can be convenient for one objective while an intermediate layer preserves the state needed for forecasting, generation, classification, anomaly detection, or intervention reasoning.

Open Questions

  • Can one pretraining objective produce a final representation that is simultaneously good for global semantics, local structure, temporal dynamics, and intervention-sensitive state?
  • Which layerwise diagnostics best predict downstream transfer before training many probes?
  • Should foundation-model releases expose stable intermediate-layer APIs by default?
  • Should generation and world-modeling systems use fixed layer sums, learned weights, sparse layer selection, feature-pyramid-style fusion, matrix residual streams, or depth attention over intermediate states?
  • Can bounded depth-KV slots preserve task-relevant intermediate state under a fixed serving budget?