Florence-2

Summary

Florence-2 is a Microsoft/Azure AI prompt-based vision foundation model trained on the FLD-5B data-engine dataset. It serializes many task outputs as text or location tokens, allowing one sequence-to-sequence model to cover captioning, object detection, grounding, segmentation, OCR-style tasks, and related vision-language tasks.

Role In The Wiki

Florence-2 is the main local example of a practical foundation model whose value comes from an iterative dataset engine, dense multi-task labels, and a structured output contract. It complements Perception Encoder: Perception Encoder uses a video data engine to create better image-video training pairs, while Florence-2 uses a broader visual annotation engine to create dense, multi-task supervision.

For time-series research, Florence-2 is a cross-domain pattern for label scarcity and output design: create many structured labels for the same observation, train a seed model when real data needs annotation repair, and expose non-text outputs through a promptable interface instead of forcing everything into free-form prose.

Data Engine

The Florence data engine starts with image collections and existing partial labels, adds synthetic labels from specialist models, filters noisy text and region annotations, and iteratively refines the dataset with a trained multitask model. FLD-5B contains 126M images and 5.4B annotations across text, region-text, and text-phrase-region formats.

TSL-JEPA Translation

For TSL-JEPA, Florence-2 is a reminder that the data distribution and the output contract are part of the model recipe. A time-series window should not have only one loose text caption if the goal is a general query-conditioned representation. It should have dense structured targets: labels, numeric properties, shape tags, event flags, retrieval views, and optional captions.

The other transferable lesson is that structured outputs can be generated through one promptable interface. In time-series systems, a model can output an alert label, scalar value, segment boundary, retrieval embedding, or caption depending on the query, while free text remains a selective readout rather than the whole model objective.

Evidence

Official Artifacts

Relation To Foundation TSFM Agenda

Use the source-level agenda mapping in florence-2-2023 rather than duplicating verdict rows here.

At the entity level, Florence-2 is the main local example of a practical foundation model whose value comes from an iterative dataset engine, dense multi-task supervision, and structured output serialization. It complements Perception Encoder: Perception Encoder uses a video data engine to create better image-video training pairs, while Florence-2 uses a broader visual annotation engine to create dense, multi-task supervision. This page should stay as the object card; source pages carry slot-level verdicts, evidence, and missing pieces.