Vision Foundation Models

Summary

Vision foundation models in this wiki are evaluated not only by classification, but by dense features, downstream transfer, latent-space usefulness, and compatibility with generation.

What The Wiki Currently Believes

DINOv3 is the strongest scaled SSL vision reference, emphasizing frozen versatility and dense feature quality.
S4L is a historical ICCV 2019 oral source for combining self-supervised and semi-supervised image learning. It is not current vision SOTA, but it is useful context for label-scarce ImageNet baselines and unlabeled-data objectives.
ELT is a visual-generation architecture source, not a general vision-encoder source. It matters here because it shows a looped-depth route for parameter-efficient image/video generation and any-time inference.
Florence-2 is the strongest current data-engine example: a compact prompt-based generalist vision model becomes practical because FLD-5B supplies dense, multi-task annotations built through specialist annotation, filtering, and iterative refinement.
Action100M adds the large video-action data-engine case: V-JEPA 2 hierarchy, PerceptionLM/Llama captions, and GPT-OSS Self-Refine produce multi-scale open-vocabulary action supervision. The release is only a 10% preview and remains automatic pseudo-label evidence rather than human-ground-truth physical state.
Motive adds a video-generation data-curation branch: motion-masked projected gradients identify fine-tuning clips that improve target temporal dynamics, but this is data-attribution evidence rather than a general visual representation or world-model result.
Genie is adjacent to vision foundation models through spatiotemporal video tokenization and interactive video generation, but its main wiki role is world-model/control-interface evidence rather than frozen visual representation transfer.
Perception Encoder is the strongest current warning that final-layer visual embeddings can hide the best general features; its language and spatial alignment stages are downstream repairs, not proof that the raw output was already universal.
Guillotine Regularization provides the earlier projector-level version of the same lesson: layer choice is part of transfer evaluation.
NEPA explores whether next-embedding prediction can be a simple generative-pretraining alternative for vision.
LeVLJEPA adds vision-language evidence that non-contrastive cross-modal pretraining can improve dense patch-token and frozen-backbone transfer while trailing contrastive objectives on pooled zero-shot alignment.
Prism argues that semantic and pixel encoders occupy different spectral roles.
RAEv2 shows a practical middle path for generation: aggregate multiple pretrained encoder layers, then use REPA to regularize spatial structure and provide internal guidance.
The Thinking Pixel adds the generation-time dynamic-compute variant: recursive sparse adapter experts refine visual diffusion latents inside joint attention, improving some text-image alignment metrics but without a public code/model release.
Reconstruction or Semantics? shows semantic visual latents can be more useful than reconstruction latents for policy-relevant robotic world models.
Self-Teaching Autoencoder is a blog/code boundary case for decoder-grounded latent consistency: it suggests a decoder can be trained through transformed representation agreement, but it is not foundation-scale or peer-reviewed evidence.
TiViT shows that intermediate features from frozen vision encoders can be repurposed for time-series classification when numeric series are rendered as images.
Tuna-2 challenges reliance on pretrained vision encoders by learning pixel embeddings end to end.
Gemma 4 12B adds release-level evidence that a vision encoder can be replaced by a lightweight patch projection frontend in a production/open-weight multimodal model, with the shared LLM backbone taking over visual processing.

Evidence

The corpus does not point to one universal visual representation. It instead maps a tradeoff between semantic abstraction, dense spatial fidelity, pixel-level generation, lightweight projection into a shared backbone, downstream control, cross-domain transfer to non-image data such as rendered time series, data-engine quality, generation-time compute allocation, recursive latent refinement, objective choice, and the layer or token level at which useful state is exposed.

RAEv2 sharpens that tradeoff. It supports the idea that pretrained semantic encoders can be made more useful for generation by exposing more than the final layer, but the linked discussion warns that the exact layer-weighting mechanism remains open.

Relation To Foundation TSFM Agenda

Vision foundation models are adjacent to the Foundation Time-Series Model Research Agenda as representation-learning analogs. They are useful for the semantic-state-versus-dense-detail slot, intermediate-layer access, generation/understanding tension, and data-engine design. They should not be counted as time-series evidence except where a source explicitly bridges into rendered or numeric time-series tasks.

Open Questions

Can one visual representation support dense prediction, generation, policy evaluation, and VQA without task-specific compromise?
Can non-contrastive cross-modal objectives preserve dense features better than contrastive objectives without sacrificing too much global alignment?
Should vision foundation model benchmarks always report best-layer and last-layer performance separately?
Is pixel-space end-to-end training a scaling substitute for pretrained semantic encoders?
When does a Gemma 4 12B-style projection frontend need enough parameters or positional structure that it starts behaving like a small encoder again?
When are intermediate vision features useful because of visual pretraining scale, and when are they useful because the input was converted into a 2D patching problem?
Can Florence-style iterative data engines be replicated for temporal domains without amplifying self-labeling errors?
When should a visual generator spend extra recursive latent compute instead of using a stronger encoder, a larger backbone, better guidance, or more targeted fine-tuning?
Do RAEv2-style multi-layer semantic latents, Prism-style unified autoencoding, or Tuna-2-style pixel embeddings scale better once high-resolution generation and action-conditioned rollout are both required?

Alex Open Research Wiki

Explorer

Vision Foundation Models

Vision Foundation Models

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Vision Foundation Models

Vision Foundation Models

Summary

What The Wiki Currently Believes

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks