Robotics Text Conditioning
Summary
Text in robotics has moved from a single language instruction embedding to a multi-level control interface. It can now serve as task context, planner output, executable code, subtask handoff, action-token output space, metadata, strategy hints, visual-subgoal descriptions, progress explanation, and a safety/debugging channel.
The current best mental model is not one path from text to motors, but a stack:
human instruction
-> embodied reasoning / planning text
-> subtask, skill, metadata, or subgoal prompt
-> vision-language-action policy or action expert
-> continuous control inputsIn this wiki’s terminology, most text is context, not an action by itself. It becomes an action only when the system turns it into a selected skill, generated policy program, action token, control input, or subtask command passed to a controller.
World Model for Robot Learning Survey uses a broad predictive-control notation where language or task conditioning can specify desired futures. In this wiki, keep that signal as task context unless the system turns it into a selected skill, action token, or control input.
What The Wiki Currently Believes
Established Pattern: Instruction Embeddings
The older language-conditioned policy interface is simple: pair each robot trajectory with a natural-language instruction, encode the instruction, fuse it with visual and robot-state observations, and train a policy to predict actions. Open X-Embodiment is the local source anchor for this line because it standardizes robot trajectories with language instructions and trains RT-X models over image histories and discretized end-effector actions.
This pattern treats text as task context:
image_history + instruction_embedding -> next action or action tokenIt is useful for object selection, task identity, and simple semantic generalization, but it usually does not expose the model’s intermediate plan.
Planner Pattern: Text Selects Skills
The SayCan-style pattern keeps the language model above the motor policy. A high-level instruction is decomposed into textual skills or subtasks, while robot-specific affordance or value functions decide which skill is feasible in the current state. This makes text a planning interface over a library of low-level behaviors.
The core shape is:
instruction + current state + skill list -> selected textual skill
selected skill -> low-level controllerThis is still valuable when the robot’s action space is too brittle for end-to-end control, when safety gates are needed, or when developers want interpretable plans.
Program Pattern: Text Becomes Policy Code
Code-as-policies systems use an LLM to transform a natural-language command into executable policy code that calls perception and control APIs. This gives text a precise operational form: loops, conditionals, geometry, object references, and feedback can be written explicitly.
This pattern is strong for structured tabletop tasks, geometric instructions, or environments with reliable symbolic/perception APIs. Its weakness is that it depends on API design and runtime guardrails rather than learning the full policy end to end.
Multimodal Sentence Pattern
PaLM-E-style models inject continuous embodied observations into a language-model embedding space, creating “multimodal sentences” that interleave text, image encodings, and state estimates. The important shift is that robot observations are not merely side inputs to a policy head; they are projected into the same sequence interface as language tokens.
The interface looks like:
text tokens + visual tokens + state tokens -> textual plan or robot-relevant completionThis is a bridge between language-model reasoning and embodied state, but it is not by itself the dominant low-level continuous-control solution.
VLA Pattern: Text And Actions Share The Token Interface
RT-2 and OpenVLA-style models turn robot actions into text-like output tokens so a vision-language model can be fine-tuned for robot control. The text instruction remains an ordinary language input, image features are projected into the language-model token stream, and the model autoregressively emits action tokens.
This makes robot control look like conditional language modeling:
image tokens + instruction tokens -> action tokensOpen X-Embodiment records the RT-X version of this idea: RT-1-X and RT-2-X both take images plus text instructions and output tokenized end-effector actions, with RT-2-X using a VLM backbone and action-as-language representation.
The strength is semantic transfer from web-scale vision-language pretraining. The weakness is that discrete action tokens can be awkward for high-frequency, dexterous, continuous control.
Diffusion And Flow Pattern: Text Conditions Continuous Action Experts
Newer robotics foundation models increasingly keep the semantic backbone but move low-level action generation into a diffusion or flow-matching action expert. Text remains part of the observation/context block, but actions are generated as continuous chunks rather than ordinary language tokens.
The interface becomes:
image tokens + instruction tokens + state tokens -> continuous action chunkThis pattern is visible in π0, RDT-1B, Octo, GR00T N1, π0.7, and related lines. The key modeling change is that text guides the action distribution, while the action decoder is optimized for physical control rather than next-token language modeling.
The important caveat is that “text conditions continuous actions” does not always mean diffusion or flow. Helix uses a slow VLM semantic latent to condition a fast visuomotor Transformer trained for continuous control. Helix 02 extends the hierarchy with a learned whole-body controller below the 200 Hz visuomotor layer. These sources support a broader fast/slow pattern, not a single denoising-only recipe.
Hierarchical Reasoning Pattern
The newest trend is to separate high-level embodied reasoning from low-level action generation. A reasoning model consumes text, images, state, task constraints, and sometimes tool outputs; it then emits natural-language subtasks, progress assessments, or strategy prompts for a lower-level policy.
The Gemini Robotics-ER / Gemini Robotics split is the clearest public version of this pattern: the embodied reasoning model plans and emits natural-language step instructions, while the VLA/action model executes them. GR00T N1’s System 2 / System 1 split has the same broad shape: a vision-language module interprets the environment and instructions, then a diffusion/flow Transformer generates continuous actions. Helix and Helix 02 show a related split where the fast motor layer is a visuomotor Transformer/controller stack rather than a stated diffusion or flow model.
This turns text into a handoff protocol between model layers:
mission text -> plan text -> subtask text -> action expertSteerable Prompt Pattern
π0.7-style prompting broadens text from “what to do” into “what strategy to use.” The prompt may include:
- task instruction;
- subtask instruction from a high-level policy;
- metadata such as quality, speed, or control mode;
- visual subgoal or world-model output;
- embodiment or control-modality hints.
This is important because it lets a single policy consume heterogeneous data: demonstrations, autonomous rollouts, failures, human videos, and curated specialist data can be made more usable when prompt metadata tells the model how to interpret the behavior.
π0.7 also turns live language coaching into reusable supervision: coaching episodes can train a high-level subtask policy, so text is both runtime context and a low-cost way to teach long-horizon autonomy without new low-level teleoperation.
Local Robotics Text-Conditioning Corpus
| Pattern | Local Anchors | Text Role |
|---|---|---|
| Action-as-language VLA | RT-2, OpenVLA | Instruction tokens condition image observations; action tokens are emitted through the language-model interface |
| Continuous action expert | Octo, RDT-1B, π0, GR00T N1 | Text remains context while diffusion/flow action heads generate continuous control-input chunks |
| Better action tokenization | FAST | Text-conditioned autoregression is preserved, but action chunks are compressed into better discrete tokens |
| Embodied reasoning and thinking | Gemini Robotics 1.5 | Text is both user instruction and intermediate plan/thought/subtask handoff |
| Steerable metadata and subgoals | π0.7 | Prompt includes task, subtask, quality, speed, mistake, control mode, and visual-subgoal context |
| Humanoid fast/slow hierarchy | Helix, Helix 02 | Slow VLM semantic context conditions high-rate continuous motor control |
Action-Level Reasoning Pattern
Action Chain-of-Thought style work points to a limit of pure language reasoning: intermediate language subtasks may be too coarse for precise control. A newer direction is to express reasoning as coarse action intents, reference trajectories, or latent action priors that condition the final action head.
This suggests a likely convergence:
language reasoning for semantic structure
+ action-space reasoning for physical detail
+ continuous action decoder for executionInterface Taxonomy
| Text Role | Input Or Output | Typical Consumer | Main Benefit | Failure Mode |
|---|---|---|---|---|
| Task instruction | Input | Policy or VLA | Human-friendly task specification | Ambiguous or underspecified commands |
| Skill label | Intermediate output | Low-level controller | Interpretable planning | Skill library bottleneck |
| Policy code | Output | Runtime/API layer | Explicit logic and geometry | API fragility and safety risks |
| Multimodal sentence text | Input/output | LLM/VLM backbone | Shared sequence interface for reasoning | Weak low-level control |
| Action token string | Output | Robot action decoder | Reuses language-model training machinery | Discretization and high-frequency control limits |
| Subtask instruction | Intermediate output | VLA/action expert | Long-horizon decomposition | Error propagation between planner and executor |
| Metadata prompt | Input | Generalist policy | Behavior steering across data quality, speed, or embodiment | Metadata can become inconsistent or underspecified |
| Visual subgoal caption/prompt | Input | Policy or world model | Bridges semantic goal and physical target | May omit contact/geometry details |
| Natural-language rationale | Output | Human/operator/debugger | Transparency and monitoring | Rationale can be unfaithful to action cause |
Practical Design Guidance
- Use plain instruction embeddings when the task set is narrow, the language is simple, and closed-loop behavior is short-horizon.
- Use planner-over-skills when safety, interpretability, or discrete skill reuse matters more than dexterity.
- Use code generation only when perception/control APIs are reliable and runtime verification is strong.
- Use VLA action tokens when semantic generalization is more important than high-frequency precision.
- Use diffusion or flow action experts when continuous dexterous control, multimodal action distributions, or action chunks matter.
- Use hierarchical reasoning when tasks require tools, long-horizon planning, progress estimation, or changing constraints.
- Use steerable metadata prompts when training data mixes high-quality demonstrations, failures, autonomous data, different control modes, or different robot embodiments.
Relation To Foundation TSFM Agenda
Robotics text conditioning is an analogy and interface source for the Foundation Time-Series Model Research Agenda. It clarifies how language can serve as context, handoff protocol, metadata, or selective explanation without being confused with actions or control inputs. For digital-world robots, the portable lesson is to keep operator language, system context, candidate interventions, and execution primitives as distinct interface fields.
External Anchors To Ingest Next
These sources SHOULD be ingested as full source pages if robotics text conditioning becomes a larger durable branch of the wiki:
- SayCan: Do As I Can, Not As I Say
- Code as Policies
- PaLM-E: An Embodied Multimodal Language Model
- RT-1: Robotics Transformer for Real-World Control at Scale
- ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Open Questions
- Which information should remain natural language, as in Gemini Robotics 1.5 thinking/subtask traces, and which should be converted into action-space, latent, or visual-subgoal representations?
- How can a policy verify that a natural-language plan is faithful to its actual action trajectory?
- Should metadata prompts such as quality, speed, control mode, and embodiment be human-authored, learned, or generated by a world model?
- How much language capability is lost when a VLM is fine-tuned only on robot action data?
- Can action-level reasoning replace language chain-of-thought for precise contact-rich manipulation?