Helix: A Vision-Language-Action Model for Generalist Humanoid Control

Source

Raw Markdown: paper_helix-2025.md
Official technical writeup: Helix: A Vision-Language-Action Model for Generalist Humanoid Control
Product page: Figure Helix

Core Claim

Helix is Figure AI’s generalist humanoid VLA for natural-language-conditioned upper-body control. The official writeup presents a two-system architecture: a slower VLM semantic layer produces a latent task representation, and a faster visuomotor policy turns that latent plus high-frequency observations into continuous upper-body control inputs.

Method Notes

System 2 is described as a 7B open-source/open-weight VLM running at 7-9 Hz for scene understanding and language comprehension.
System 1 is described as an 80M cross-attention encoder-decoder visuomotor Transformer running a 200 Hz upper-body control loop.
The official writeup says Helix is trained end-to-end from pixels and text to continuous actions with a standard regression loss; this is not a diffusion or flow-matching claim.
Helix is an action generator over humanoid trajectories, not a validated future-observation world model.

Evidence And Limitations

The writeup reports about 500 hours of teleoperated trajectory data, onboard deployment on dual embedded GPUs, qualitative full-upper-body control, multi-robot collaboration, and object generalization. Caveats are substantial: the evidence is company-published, the dataset and weights are not public, and detailed benchmark protocols/failure rates are not provided.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Causal structure, counterfactuals, and control	adjacent	Helix maps pixels, text, wrist/finger state, and task latents to continuous upper-body action targets in a 200 Hz control loop, which is an analogy for digital-world robot actuation.	It is an action generator, not a validated future-observation or counterfactual world model.
Fast/slow architecture	adjacent	The raw writeup separates a slower 7B VLM semantic layer from an 80M high-rate visuomotor Transformer.	No public ablations showing which temporal abstraction level is reusable for TSFMs.
Evidence quality	warning	Claims are based on a company technical writeup with no released data, weights, benchmark protocols, or failure statistics.	Needs independent datasets and reproducible evaluation.

Links Into The Wiki

Open Questions

Is regression over continuous high-rate humanoid actions enough when paired with a strong slow semantic latent?
Which Helix claims can be independently evaluated without public model weights or datasets?

Alex Open Research Wiki

Explorer

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

Helix: A Vision-Language-Action Model for Generalist Humanoid Control

Source

Core Claim

Method Notes

Evidence And Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks