Vision Foundation Models
Summary
Vision foundation models in this wiki are evaluated not only by classification, but by dense features, downstream transfer, latent-space usefulness, and compatibility with generation.
What The Wiki Currently Believes
- DINOv3 is the strongest scaled SSL vision reference, emphasizing frozen versatility and dense feature quality.
- ELT is a visual-generation architecture source, not a general vision-encoder source. It matters here because it shows a looped-depth route for parameter-efficient image/video generation and any-time inference.
- Florence-2 is the strongest current data-engine example: a compact prompt-based generalist vision model becomes practical because FLD-5B supplies dense, multi-task annotations built through specialist annotation, filtering, and iterative refinement.
- Genie is adjacent to vision foundation models through spatiotemporal video tokenization and interactive video generation, but its main wiki role is world-model/control-interface evidence rather than frozen visual representation transfer.
- Perception Encoder is the strongest current warning that final-layer visual embeddings can hide the best general features; its language and spatial alignment stages are downstream repairs, not proof that the raw output was already universal.
- Guillotine Regularization provides the earlier projector-level version of the same lesson: layer choice is part of transfer evaluation.
- NEPA explores whether next-embedding prediction can be a simple generative-pretraining alternative for vision.
- Prism argues that semantic and pixel encoders occupy different spectral roles.
- RAEv2 shows a practical middle path for generation: aggregate multiple pretrained encoder layers, then use REPA to regularize spatial structure and provide internal guidance.
- Reconstruction or Semantics? shows semantic visual latents can be more useful than reconstruction latents for policy-relevant robotic world models.
- Self-Teaching Autoencoder is a blog/code boundary case for decoder-grounded latent consistency: it suggests a decoder can be trained through transformed representation agreement, but it is not foundation-scale or peer-reviewed evidence.
- TiViT shows that intermediate features from frozen vision encoders can be repurposed for time-series classification when numeric series are rendered as images.
- Tuna-2 challenges reliance on pretrained vision encoders by learning pixel embeddings end to end.
- Gemma 4 12B adds release-level evidence that a vision encoder can be replaced by a lightweight patch projection frontend in a production/open-weight multimodal model, with the shared LLM backbone taking over visual processing.
Evidence
The corpus does not point to one universal visual representation. It instead maps a tradeoff between semantic abstraction, dense spatial fidelity, pixel-level generation, lightweight projection into a shared backbone, downstream control, cross-domain transfer to non-image data such as rendered time series, data-engine quality, generation-time compute allocation, and the layer at which useful state is exposed.
RAEv2 sharpens that tradeoff. It supports the idea that pretrained semantic encoders can be made more useful for generation by exposing more than the final layer, but the linked discussion warns that the exact layer-weighting mechanism remains open.
Relation To Foundation TSFM Agenda
Vision foundation models are adjacent to the Foundation Time-Series Model Research Agenda as representation-learning analogs. They are useful for the semantic-state-versus-dense-detail slot, intermediate-layer access, generation/understanding tension, and data-engine design. They should not be counted as time-series evidence except where a source explicitly bridges into rendered or numeric time-series tasks.
Open Questions
- Can one visual representation support dense prediction, generation, policy evaluation, and VQA without task-specific compromise?
- Should vision foundation model benchmarks always report best-layer and last-layer performance separately?
- Is pixel-space end-to-end training a scaling substitute for pretrained semantic encoders?
- When does a Gemma 4 12B-style projection frontend need enough parameters or positional structure that it starts behaving like a small encoder again?
- When are intermediate vision features useful because of visual pretraining scale, and when are they useful because the input was converted into a 2D patching problem?
- Can Florence-style iterative data engines be replicated for temporal domains without amplifying self-labeling errors?
- Do RAEv2-style multi-layer semantic latents, Prism-style unified autoencoding, or Tuna-2-style pixel embeddings scale better once high-resolution generation and action-conditioned rollout are both required?