Robotics Text Conditioning

Summary

Text in robotics has moved from a single language instruction embedding to a multi-level control interface. It can now serve as task context, planner output, executable code, subtask handoff, action-token output space, metadata, strategy hints, visual-subgoal descriptions, progress explanation, and a safety/debugging channel.

The current best mental model is not one path from text to motors, but a stack:

human instruction
  -> embodied reasoning / planning text
  -> subtask, skill, metadata, or subgoal prompt
  -> vision-language-action policy or action expert
  -> continuous control inputs

In this wiki’s terminology, most text is context, not an action by itself. It becomes an action only when the system turns it into a selected skill, generated policy program, action token, control input, or subtask command passed to a controller.

World Model for Robot Learning Survey uses a broad predictive-control notation where language or task conditioning can specify desired futures. In this wiki, keep that signal as task context unless the system turns it into a selected skill, action token, or control input.

What The Wiki Currently Believes

Established Pattern: Instruction Embeddings

The older language-conditioned policy interface is simple: pair each robot trajectory with a natural-language instruction, encode the instruction, fuse it with visual and robot-state observations, and train a policy to predict actions. Open X-Embodiment is the local source anchor for this line because it standardizes robot trajectories with language instructions and trains RT-X models over image histories and discretized end-effector actions.

This pattern treats text as task context:

image_history + instruction_embedding -> next action or action token

It is useful for object selection, task identity, and simple semantic generalization, but it usually does not expose the model’s intermediate plan.

Planner Pattern: Text Selects Skills

The SayCan-style pattern keeps the language model above the motor policy. A high-level instruction is decomposed into textual skills or subtasks, while robot-specific affordance or value functions decide which skill is feasible in the current state. This makes text a planning interface over a library of low-level behaviors.

The core shape is:

instruction + current state + skill list -> selected textual skill
selected skill -> low-level controller

This is still valuable when the robot’s action space is too brittle for end-to-end control, when safety gates are needed, or when developers want interpretable plans.

Program Pattern: Text Becomes Policy Code

Code-as-policies systems use an LLM to transform a natural-language command into executable policy code that calls perception and control APIs. This gives text a precise operational form: loops, conditionals, geometry, object references, and feedback can be written explicitly.

This pattern is strong for structured tabletop tasks, geometric instructions, or environments with reliable symbolic/perception APIs. Its weakness is that it depends on API design and runtime guardrails rather than learning the full policy end to end.

Multimodal Sentence Pattern

PaLM-E-style models inject continuous embodied observations into a language-model embedding space, creating “multimodal sentences” that interleave text, image encodings, and state estimates. The important shift is that robot observations are not merely side inputs to a policy head; they are projected into the same sequence interface as language tokens.

The interface looks like:

text tokens + visual tokens + state tokens -> textual plan or robot-relevant completion

This is a bridge between language-model reasoning and embodied state, but it is not by itself the dominant low-level continuous-control solution.

VLA Pattern: Text And Actions Share The Token Interface

RT-2 and OpenVLA-style models turn robot actions into text-like output tokens so a vision-language model can be fine-tuned for robot control. The text instruction remains an ordinary language input, image features are projected into the language-model token stream, and the model autoregressively emits action tokens.

This makes robot control look like conditional language modeling:

image tokens + instruction tokens -> action tokens

Open X-Embodiment records the RT-X version of this idea: RT-1-X and RT-2-X both take images plus text instructions and output tokenized end-effector actions, with RT-2-X using a VLM backbone and action-as-language representation.

The strength is semantic transfer from web-scale vision-language pretraining. The weakness is that discrete action tokens can be awkward for high-frequency, dexterous, continuous control.

Diffusion And Flow Pattern: Text Conditions Continuous Action Experts

Newer robotics foundation models increasingly keep the semantic backbone but move low-level action generation into a diffusion or flow-matching action expert. Text remains part of the observation/context block, but actions are generated as continuous chunks rather than ordinary language tokens.

The interface becomes:

image tokens + instruction tokens + state tokens -> continuous action chunk

This pattern is visible in π0, RDT-1B, Octo, GR00T N1, π0.7, and related lines. The key modeling change is that text guides the action distribution, while the action decoder is optimized for physical control rather than next-token language modeling.

The important caveat is that “text conditions continuous actions” does not always mean diffusion or flow. Helix uses a slow VLM semantic latent to condition a fast visuomotor Transformer trained for continuous control. Helix 02 extends the hierarchy with a learned whole-body controller below the 200 Hz visuomotor layer. These sources support a broader fast/slow pattern, not a single denoising-only recipe.

Hierarchical Reasoning Pattern

The newest trend is to separate high-level embodied reasoning from low-level action generation. A reasoning model consumes text, images, state, task constraints, and sometimes tool outputs; it then emits natural-language subtasks, progress assessments, or strategy prompts for a lower-level policy.

The Gemini Robotics-ER / Gemini Robotics split is the clearest public version of this pattern: the embodied reasoning model plans and emits natural-language step instructions, while the VLA/action model executes them. GR00T N1’s System 2 / System 1 split has the same broad shape: a vision-language module interprets the environment and instructions, then a diffusion/flow Transformer generates continuous actions. Helix and Helix 02 show a related split where the fast motor layer is a visuomotor Transformer/controller stack rather than a stated diffusion or flow model.

This turns text into a handoff protocol between model layers:

mission text -> plan text -> subtask text -> action expert

Steerable Prompt Pattern

π0.7-style prompting broadens text from “what to do” into “what strategy to use.” The prompt may include:

  • task instruction;
  • subtask instruction from a high-level policy;
  • metadata such as quality, speed, or control mode;
  • visual subgoal or world-model output;
  • embodiment or control-modality hints.

This is important because it lets a single policy consume heterogeneous data: demonstrations, autonomous rollouts, failures, human videos, and curated specialist data can be made more usable when prompt metadata tells the model how to interpret the behavior.

π0.7 also turns live language coaching into reusable supervision: coaching episodes can train a high-level subtask policy, so text is both runtime context and a low-cost way to teach long-horizon autonomy without new low-level teleoperation.

Local Robotics Text-Conditioning Corpus

PatternLocal AnchorsText Role
Action-as-language VLART-2, OpenVLAInstruction tokens condition image observations; action tokens are emitted through the language-model interface
Continuous action expertOcto, RDT-1B, π0, GR00T N1Text remains context while diffusion/flow action heads generate continuous control-input chunks
Better action tokenizationFASTText-conditioned autoregression is preserved, but action chunks are compressed into better discrete tokens
Embodied reasoning and thinkingGemini Robotics 1.5Text is both user instruction and intermediate plan/thought/subtask handoff
Steerable metadata and subgoalsπ0.7Prompt includes task, subtask, quality, speed, mistake, control mode, and visual-subgoal context
Humanoid fast/slow hierarchyHelix, Helix 02Slow VLM semantic context conditions high-rate continuous motor control

Action-Level Reasoning Pattern

Action Chain-of-Thought style work points to a limit of pure language reasoning: intermediate language subtasks may be too coarse for precise control. A newer direction is to express reasoning as coarse action intents, reference trajectories, or latent action priors that condition the final action head.

This suggests a likely convergence:

language reasoning for semantic structure
+ action-space reasoning for physical detail
+ continuous action decoder for execution

Interface Taxonomy

Text RoleInput Or OutputTypical ConsumerMain BenefitFailure Mode
Task instructionInputPolicy or VLAHuman-friendly task specificationAmbiguous or underspecified commands
Skill labelIntermediate outputLow-level controllerInterpretable planningSkill library bottleneck
Policy codeOutputRuntime/API layerExplicit logic and geometryAPI fragility and safety risks
Multimodal sentence textInput/outputLLM/VLM backboneShared sequence interface for reasoningWeak low-level control
Action token stringOutputRobot action decoderReuses language-model training machineryDiscretization and high-frequency control limits
Subtask instructionIntermediate outputVLA/action expertLong-horizon decompositionError propagation between planner and executor
Metadata promptInputGeneralist policyBehavior steering across data quality, speed, or embodimentMetadata can become inconsistent or underspecified
Visual subgoal caption/promptInputPolicy or world modelBridges semantic goal and physical targetMay omit contact/geometry details
Natural-language rationaleOutputHuman/operator/debuggerTransparency and monitoringRationale can be unfaithful to action cause

Practical Design Guidance

  • Use plain instruction embeddings when the task set is narrow, the language is simple, and closed-loop behavior is short-horizon.
  • Use planner-over-skills when safety, interpretability, or discrete skill reuse matters more than dexterity.
  • Use code generation only when perception/control APIs are reliable and runtime verification is strong.
  • Use VLA action tokens when semantic generalization is more important than high-frequency precision.
  • Use diffusion or flow action experts when continuous dexterous control, multimodal action distributions, or action chunks matter.
  • Use hierarchical reasoning when tasks require tools, long-horizon planning, progress estimation, or changing constraints.
  • Use steerable metadata prompts when training data mixes high-quality demonstrations, failures, autonomous data, different control modes, or different robot embodiments.

Relation To Foundation TSFM Agenda

Robotics text conditioning is an analogy and interface source for the Foundation Time-Series Model Research Agenda. It clarifies how language can serve as context, handoff protocol, metadata, or selective explanation without being confused with actions or control inputs. For digital-world robots, the portable lesson is to keep operator language, system context, candidate interventions, and execution primitives as distinct interface fields.

External Anchors To Ingest Next

These sources SHOULD be ingested as full source pages if robotics text conditioning becomes a larger durable branch of the wiki:

Open Questions

  • Which information should remain natural language, as in Gemini Robotics 1.5 thinking/subtask traces, and which should be converted into action-space, latent, or visual-subgoal representations?
  • How can a policy verify that a natural-language plan is faithful to its actual action trajectory?
  • Should metadata prompts such as quality, speed, control mode, and embodiment be human-authored, learned, or generated by a world model?
  • How much language capability is lost when a VLM is fine-tuned only on robot action data?
  • Can action-level reasoning replace language chain-of-thought for precise contact-rich manipulation?