Helix: A Vision-Language-Action Model for Generalist Humanoid Control
Source
- Raw Markdown: paper_helix-2025.md
- Official technical writeup: Helix: A Vision-Language-Action Model for Generalist Humanoid Control
- Product page: Figure Helix
Core Claim
Helix is Figure AI’s generalist humanoid VLA for natural-language-conditioned upper-body control. The official writeup presents a two-system architecture: a slower VLM semantic layer produces a latent task representation, and a faster visuomotor policy turns that latent plus high-frequency observations into continuous upper-body control inputs.
Method Notes
- System 2 is described as a 7B open-source/open-weight VLM running at 7-9 Hz for scene understanding and language comprehension.
- System 1 is described as an 80M cross-attention encoder-decoder visuomotor Transformer running a 200 Hz upper-body control loop.
- The official writeup says Helix is trained end-to-end from pixels and text to continuous actions with a standard regression loss; this is not a diffusion or flow-matching claim.
- Helix is an action generator over humanoid trajectories, not a validated future-observation world model.
Evidence And Limitations
The writeup reports about 500 hours of teleoperated trajectory data, onboard deployment on dual embedded GPUs, qualitative full-upper-body control, multi-robot collaboration, and object generalization. Caveats are substantial: the evidence is company-published, the dataset and weights are not public, and detailed benchmark protocols/failure rates are not provided.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | adjacent | Helix maps pixels, text, wrist/finger state, and task latents to continuous upper-body action targets in a 200 Hz control loop, which is an analogy for digital-world robot actuation. | It is an action generator, not a validated future-observation or counterfactual world model. |
| Fast/slow architecture | adjacent | The raw writeup separates a slower 7B VLM semantic layer from an 80M high-rate visuomotor Transformer. | No public ablations showing which temporal abstraction level is reusable for TSFMs. |
| Evidence quality | warning | Claims are based on a company technical writeup with no released data, weights, benchmark protocols, or failure statistics. | Needs independent datasets and reproducible evaluation. |
Links Into The Wiki
- Helix
- Helix 02
- Foundation Time-Series Model Research Agenda
- Robotics Time-Series Modeling
- Robotics Text Conditioning
Open Questions
- Is regression over continuous high-rate humanoid actions enough when paired with a strong slow semantic latent?
- Which Helix claims can be independently evaluated without public model weights or datasets?