Robotics Time-Series Modeling
Summary
Robotics foundation models usually do not treat sensor data as a plain forecasting table. They model action-conditioned multimodal trajectories: recent observations, language or task context, proprioceptive state when available, and actions or control inputs over a short control horizon. The time-series object is a trajectory, not a standalone sequence of scalar measurements.
The dominant interface is therefore:
context, observation_history, optional action_history -> action_chunkor, for world models:
context, observation_history, action_sequence -> future_observation_or_latent_trajectoryThe current model-ingest answer is: mostly yes for high-dexterity continuous action generation, but not universally. Many recent generalist robot policies use a diffusion or flow-matching action expert for the fast motor interface: Diffusion Policy, Octo, RDT-1B, π0, GR00T N1, and π0.7. But RT-2, OpenVLA, and FAST keep an autoregressive action-token path; ACT uses Transformer/CVAE action chunks; Helix uses a fast visuomotor Transformer with continuous-action regression; and Helix 02 states a Transformer S1 plus learned S0 controller without claiming diffusion or flow.
So the precise rule is: modern robotics has moved away from naive next-token VLM decoding as the only motor interface, toward short-horizon chunked continuous control. Diffusion and flow matching are the dominant public recipe for that continuous action expert, but Transformers remain central as VLM backbones, denoising/flow backbones, action-token models, and fast visuomotor policies.
This page uses Terminology: robot commands are actions or numeric control inputs; camera, proprioception, force, tactile, and IMU samples are observations; task text and embodiment metadata are context; uncontrolled scene changes are events or exogenous variables.
What The Wiki Currently Believes
Robotics Data Is Trajectory Data
The closest local sources are embodied trajectory datasets and robotic world-model papers. Open X-Embodiment standardizes robot-learning datasets across embodiments and trains RT-X policies over image history, language instructions, and discretized end-effector actions. DROID adds in-the-wild manipulation episodes with synchronized cameras, calibration, depth, and language. BridgeData V2 is a frequent substrate for language-conditioned manipulation policies and action-conditioned world-model evaluation. RoboTurk is earlier teleoperation context: the raw signal is still a time-ordered demonstration with human-generated control inputs.
This differs from most Time-Series Foundation Models. A forecasting model predicts future observations from history; a robot policy chooses a control input; an action-conditioned world model predicts how future observations change under candidate actions.
Common Encoding Pattern
Robotics models tend to encode each timestep as a multimodal bundle:
| Component | Usual Encoding | Time-Series Meaning |
|---|---|---|
| Image/video | CNN, ViT, VLM, or latent encoder tokens per frame | High-dimensional observation of partial state |
| Proprioception | Normalized numeric features projected to tokens or fused into the action head | Observed robot state, not the whole environment state |
| Force, tactile, IMU | High-frequency numeric streams, often downsampled, windowed, or encoded by a small temporal encoder | Contact and motion observations; usually more time-sensitive than RGB |
| Language instruction | Text tokens or a frozen text/VLM embedding | Task context, not a time series unless it changes over the episode |
| Embodiment/control metadata | Robot ID, control mode, action-space metadata, control frequency | Static or slowly changing context that disambiguates action semantics |
| Actions/control inputs | Discrete bins, text-like action tokens, continuous chunks, diffusion/flow trajectories, or frequency-space tokens | Controllable inputs used for policy output or world-model conditioning |
The dominant VLA recipe is late fusion: images, language, and robot-state vectors are encoded by modality-specific frontends, projected to a shared token width, and then concatenated into one Transformer sequence. Proprioception usually enters as one or more continuous state tokens after an MLP or linear projection rather than through scalar binning.
Canonical Action/State Interfaces
The cross-embodiment problem is mostly an interface problem. A policy trained across robots should not require every embodiment to expose the same raw joints, actuator commands, controller gains, camera rig, or control frequency. It needs a more stable contract between the general policy and the robot-specific mechanics.
Open X-Embodiment coarsely aligns heterogeneous robots by choosing a canonical camera stream and mapping controls into a normalized end-effector action representation, but this loses embodiment-specific details. Newer physical-intelligence-style models increasingly add metadata for control mode, speed, quality, visual subgoals, or embodiment so the same model can interpret the same numeric action representation differently.
The common pieces are:
- End-effector delta pose: a relative command for the tool, gripper, hand, or suction cup, usually
dx, dy, dzplus an orientation delta. This is more portable than joint commands because many different arms can move a tool through the same local workspace displacement. - Gripper action: an open/close, width, force, or suction command. The physical gripper may differ, but the task-level primitive “release or grasp” is often shared.
- Normalized proprioception: robot-state observations such as joint positions, joint velocities, end-effector pose, gripper state, force/torque, base motion, or IMU values projected into normalized numeric tokens. These tokens give the model body awareness without forcing all robots to share the same raw encoder units.
- Action chunks: short future sequences of control inputs. They let the policy express a local trajectory rather than a single twitch, then replan under fresh observations.
- Embodiment and control metadata: robot ID, control mode, action-space shape, control frequency, action horizon, speed or quality tags, and capability masks. These fields tell the model how to interpret an otherwise similar numeric action vector.
- Robot-specific adapter or controller: inverse kinematics, motion planning, trajectory tracking, safety filters, learned whole-body controllers, or actuator-level control loops that map the canonical command to actual joints and motors.
The resulting stack is:
observations + task context + embodiment metadata
-> shared policy or world model
-> canonical action/state interface
-> robot-specific adapter or controller
-> joints, actuators, and safety limitsThis interface is not magic. End-effector actions do not preserve every detail of a dexterous hand, humanoid balance controller, contact-rich insertion task, or force-sensitive manipulation. Cross-embodiment normalization is therefore a lossy modeling decision. Agents SHOULD record what was normalized, discretized, resampled, masked, or dropped when converting robot datasets.
Time Handling
Robotics time is mostly control time, not calendar time. The model sees a finite history window and outputs either the next control input, a short action chunk, or a future latent/video trajectory. The key design choices are:
- History length: RT-X-style policies use a short image history; world models condition on a finite visual-action history; diffusion and flow policies condition on recent observations before generating action chunks.
- Action horizon: ACT, diffusion policies, RDT-like models, and pi0-like flow policies predict multiple future control inputs at once, then execute part of the chunk before replanning.
- Receding horizon: diffusion and flow policies usually replan repeatedly rather than executing an entire long rollout open-loop.
- Temporal smoothing: ACT-style temporal ensembling averages overlapping action chunks to reduce jitter and compounding errors.
- Frame rate and control frequency: cross-robot data may use different control rates; models either normalize actions into a shared representation, include control-frequency metadata, or rely on the dataset conversion layer.
- Generative timestep: diffusion and flow policies have a second notion of time: the denoising or flow timestep. This is usually encoded explicitly, often with sinusoidal features, and is separate from physical control time.
- Elapsed time versus token index: when sampling is irregular, token position is not enough. This is the same failure mode that Kairos addresses for time-series forecasting with mixed patch sizes and physical-time calibration.
The practical rule is: keep delta_t, control frequency, frame subsampling, and action horizon explicit whenever data crosses robots, sensors, or datasets.
Action Encoding
There are four main action interfaces in the local robotics corpus:
- Discretized action tokens. RT-2 and OpenVLA bucket action dimensions and train with categorical next-token losses, often through a VLM backbone.
- Continuous action chunks without denoising. ACT predicts continuous action chunks with a Transformer/CVAE-style generative model, while Helix reports a fast visuomotor Transformer trained with regression for continuous upper-body control.
- Diffusion or flow action chunks. Diffusion Policy, Octo, RDT-1B, π0, GR00T N1, and π0.7 use diffusion or flow matching over short future control-input trajectories.
- Compressed action tokens. FAST moves an action chunk into frequency space before discrete tokenization, making high-frequency continuous controls more compatible with autoregressive VLA training.
For this wiki, all four are representations of control inputs. They should not be confused with passive events or exogenous variables.
Attention And Sequence Mixers
Robotics models reuse Transformer components, but the attention pattern is shaped by the embodied interface:
| Pattern | Typical Use | Attention / Mixer |
|---|---|---|
| Decoder-only policy Transformer | RT-1/RT-X-style policies over image-history and language tokens | Causal or autoregressive self-attention over compressed observation/context tokens before action-token prediction |
| VLM-as-policy | RT-2/OpenVLA-style policies | Pretrained vision-language attention reused; action tokens become part of the output vocabulary |
| Action Chunking Transformer | Bimanual imitation learning and low-data dexterous tasks | Transformer over observations and latent action sequence, with temporal ensembling at inference |
| Diffusion Transformer policy | Diffusion Policy, Octo, and RDT-style policies | Denoising network over action trajectories, conditioned on visual/language/proprioceptive observations by cross-attention or token fusion |
| Flow-matching action expert | pi0, pi0.7, and GR00T-style policies | VLM backbone supplies semantic context; a separate continuous action expert generates fluent control trajectories |
| Fast visuomotor Transformer | Helix-style humanoid policies | Slow VLM latents condition a high-rate Transformer policy that outputs continuous control inputs directly |
| Whole-body controller hierarchy | Helix 02-style humanoid stacks | Semantic VLA layer supplies targets to a faster learned body controller for balance, contact, and actuator-level execution |
| Blockwise multimodal policy | pi0-style physical-intelligence policies | Separate blocks for image/language context, robot state, and future action tokens; causal masking prevents later action tokens from leaking into observation/state processing |
| Action-conditioned latent world model | Robotic video/latent rollout for planning and policy evaluation | DiT-style transition model; Reconstruction Or Semantics? uses factorized spatial attention within frames and causal temporal attention across frames; RAEv2 tests multi-layer RAE latents for action-conditioned navigation rollouts |
| Recurrent or SSM mixer | Low-latency control, long sensor histories, or deployment-constrained loops | Mamba-2-style state-space mixers are attractive when recurrent inference matters more than full attention over long histories |
The attention axis is not just “global versus local.” Robotics often needs structured mixing across space, time, sensor channel, language context, and action horizon. Factorized attention is common because full attention over all video patches, timesteps, views, and action tokens becomes expensive quickly.
Latent World Models
World Model for Robot Learning Survey gives the current high-level map for this section. It separates world models for policies, learned simulators/evaluators, and robotic video generation, and it narrows the success criterion from visually plausible futures to action-conditioned consistency, physical executability, and downstream task or policy utility.
Reconstruction Or Semantics? is the strongest local anchor for robotic world models. It frames action-conditioned video world models as predictors of future observations from observation and action histories, but argues that the latent space matters more than pixel fidelity alone. Semantic latents can preserve action-relevant object state, task progress, and controllability better than reconstruction latents.
Genie adds a different robotics boundary case: it trains a smaller model on action-free robot videos from RT-1-style data and shows that a latent action model can recover consistent action-like codes even when ground-truth robot actions are ignored. Treat this as evidence for latent action discovery from image/video trajectories, not as a full robot policy, typed control-input model, or contact-rich manipulation benchmark.
RAEv2 adds a narrower navigation-world-model result: in a RECON setup with four past egocentric frames and action tokens, multi-layer RAE latents reduce rollout flicker and improve video prediction metrics over the original RAE baseline. Treat this as evidence about latent interface quality for visual future-state prediction, not as a complete planning or policy-evaluation result.
stable-worldmodel is the infrastructure complement to these model papers. It standardizes trajectory data, planners, and evaluation protocols across visual/control environments, then uses factors of variation to test whether action-conditioned latent world models still plan under visual, geometric, or physical shifts. Treat it as a benchmark and implementation surface, not as evidence that any particular latent objective solves robustness.
That is directly relevant to sensor time series: the model should not merely reconstruct the next frame or next sensor value. It should preserve the latent state variables that make action consequences predictable.
Numeric State Trajectories
Not all robotics time series are vision-heavy. D4RL and CausalWorld are cleaner numeric-control anchors: they expose state-action trajectories and are closer to classical action-conditioned dynamics learning. They are useful when the question is about action-conditioned multivariate time series rather than visual manipulation.
Lessons For Time-Series Foundation Models
Robotics is useful for Time-Series Foundation Models because it exposes what passive forecasting often hides: many real systems are not only observed over time, they are acted on. The transferable lesson is to move from history -> future toward history + planned controls -> future trajectory distribution.
Actions Should Be First-Class Inputs
Classic TSFMs usually forecast future observations from historical observations and optional covariates. Robotics makes the action channel unavoidable. A robot policy must choose a control input, and a robotic world model must predict future observations conditional on candidate action sequences.
This suggests that TSFMs for healthcare, observability, energy, industrial control, recommender systems, education, and finance execution should separate:
- actions and control inputs chosen by an agent or operator;
- interventions whose causal consequences matter;
- events that are observed but not controlled;
- exogenous variables that condition the future but are not policy choices.
Without that separation, a TSFM can forecast passively but cannot answer counterfactual questions such as “what happens if this control plan is executed?”
Forecasting Should Become Counterfactual-Friendly
Robotic world models are useful when they let a planner compare possible action sequences. The analogous TSFM interface is:
past observations + planned actions/control inputs -> distribution over futuresThis is stronger than ordinary covariate-conditioned forecasting. Known future exogenous variables describe what is expected to happen outside the modeled agent’s control; planned actions or interventions describe choices whose alternatives should also be evaluable.
Action Chunking Suggests Blockwise Future Generation
Robotics increasingly predicts short action chunks instead of only the next action. That pattern can transfer to time-series modeling: generate coherent future blocks or latent trajectory chunks, execute or evaluate part of the plan, then recondition and regenerate. This may be more stable than purely step-by-step autoregression for long horizons, especially when control inputs, delays, and feedback loops matter.
For TSFMs, the corresponding design question is whether a model should decode:
- one next value or patch;
- a full forecast horizon;
- a latent future trajectory block;
- a conditional plan-response rollout under a candidate action chunk.
A Timestep Is A Sensor Bundle
Robotics treats each timestep as a bundle of observations, actions, context, and metadata. TSFMs often still reduce the interface to values x channels. The robotics interface argues for richer time-series examples that can include numeric features, categorical event streams, text context, topology, device metadata, missingness, sampling rate, known future plans, and operator actions.
This is especially relevant to observability and industrial systems: metrics are observations, deployments or rollbacks can be actions or interventions, incidents can be events or exogenous shocks, topology is context, and textual tickets or runbooks are auxiliary context rather than numeric channels.
Physical Time Should Be Explicit
Robotics has to track camera rates, control frequency, sensor latency, frame subsampling, and action horizon. TSFMs should apply the same discipline to irregular or multi-rate time series. Token index is not always physical time.
The practical carryover is to keep delta_t, sampling rate, patch duration, control frequency, and forecast-horizon semantics explicit. Kairos is a non-robotics example of this same principle: when token granularity changes, position index alone is a weak proxy for elapsed time.
Latent State Quality Matters More Than Reconstruction
Reconstruction Or Semantics? shows the robotics version of a broader TSFM problem: a model can reconstruct observations well while losing action-relevant latent structure. For TSFMs, low error alone does not prove that the representation preserves regimes, controllability, causal variables, rare failure modes, or decision-relevant uncertainty.
Better evaluation should ask:
- Can actions or interventions be recovered from latent transitions?
- Does the latent state preserve regime, task, outcome, or failure information?
- Does the forecast remain useful when planned controls change?
- Can the model rank candidate plans, not only predict the historical continuation?
Attention Should Respect Interface Semantics
Robotics models increasingly use structured attention: separate blocks for observations, state, actions, language, spatial tokens, temporal tokens, and action horizons. TSFMs can borrow this idea by separating past observations, known future exogenous variables, candidate actions/interventions, static context, and target future tokens.
That matters for both leakage and semantics. A target future token SHOULD NOT attend to labels or future observations that would be unavailable at decision time, while a forecast decoder MAY attend to known future exogenous variables and planned actions if those are part of the declared interface.
Continuous Outputs Should Not Always Become Language Tokens
Robotics exposes the limits of naive action discretization. Continuous control often benefits from continuous heads, diffusion, flow matching, action chunks, or better sequence tokenizers. TSFMs should take the same lesson seriously for numerical values: language-token compatibility is not always worth losing scale, precision, smoothness, or physical units.
For numeric time series, promising output interfaces include continuous probabilistic heads, quantile or distributional decoders, diffusion or flow heads over future blocks, and latent trajectory generators. Tokenization remains useful, but it should be justified by task behavior rather than inherited from language modeling.
Evaluation Should Be Decision-Facing
Robot policies are ultimately judged by closed-loop success, not only prediction loss. TSFMs should adopt analogous decision-facing evaluations when the downstream use case is operational:
- Did the model choose or rank better interventions?
- Did it prevent an incident or reduce cost?
- Did it preserve calibrated uncertainty around rare but important futures?
- Did it support a planner, controller, analyst, or operator better than a passive forecaster?
The larger direction is that TSFMs should become action-conditioned world models for time series when the domain has meaningful actions, control inputs, interventions, or decision logs.
Design Heuristics
- Preserve the distinction between observation, state, latent state, action, control input, and exogenous variable.
- Treat RGB/video as an observation stream, not as the state itself.
- Keep proprioception and action history even when vision dominates; contact-rich manipulation often needs short-term motion and gripper state.
- Represent high-frequency tactile, force, and IMU streams as windowed multivariate time series with explicit sampling rates before fusing them with lower-rate camera frames. The common adaptation path is to add them as new state or observation tokens during fine-tuning.
- Use action chunks when one-step controls are noisy, delayed, or too locally ambiguous.
- Use action-conditioned world models when the task is planning or policy evaluation, not direct behavior cloning.
- Use semantic latent spaces when downstream control cares about object identity, task progress, and action recoverability more than pixel-level reconstruction.
- For visual world models, treat the encoder layer aggregation rule as part of the data/model interface; final-layer latents, multi-layer sums, and learned layer weights may preserve different local details.
- Treat cross-embodiment normalization as a lossy modeling decision; record what was normalized, discretized, resampled, or dropped.
Relation To Foundation TSFM Agenda
This page is the physical-robotics analogy for the Foundation Time-Series Model Research Agenda. It supports the agenda’s control, latent-state, context, and generation/editing slots by showing the mature version of the interface: observations plus context and action history produce control-input chunks or future latent trajectories.
For foundation time-series models, robotics remains an analogy layer unless a source is mapped to a specific slot. It provides strong interface evidence for control, context, and action-conditioned modeling, but a digital-world TSFM still needs numeric telemetry, event streams, topology, and typed interventions rather than image-heavy manipulation alone.
Local Robotics Model Corpus
| Source | Action Interface | Fast-Part Classification |
|---|---|---|
| RT-2 | Discretized action tokens emitted by a VLM | Autoregressive VLA, not diffusion/flow |
| ACT | Continuous action chunks | Transformer/CVAE action chunking, not diffusion/flow |
| Diffusion Policy | Continuous action trajectories | Diffusion denoising policy |
| Octo | Flexible continuous action chunks | Transformer policy with diffusion action head |
| OpenVLA | Discretized action tokens | Autoregressive VLA, not diffusion/flow |
| RDT-1B | Continuous bimanual action chunks | Diffusion Transformer |
| π0 | Continuous action chunks | Flow-matching action expert |
| FAST | Compressed action tokens | Tokenization path for autoregressive VLAs |
| GR00T N1 | Continuous humanoid motor actions | DiT/action flow-matching module |
| Gemini Robotics 1.5 | Continuous numeric robot control inputs plus optional thinking text | Hierarchical VLA; source does not specify diffusion/flow |
| π0.7 | Steerable continuous action chunks | Flow-matching action expert plus FAST-supervised VLM and subgoal world model |
| Helix | Continuous upper-body humanoid controls | Fast visuomotor Transformer/regression, not stated as diffusion/flow |
| Helix 02 | Full-body joint targets and actuator commands | S1 Transformer plus S0 learned controller; no diffusion/flow claim |
External Anchors To Ingest Next
The main immediate robotics action-model sources are now local. Remaining useful follow-ups:
- RT-1: Robotics Transformer for Real-World Control at Scale
- SayCan: Do As I Can, Not As I Say
- Code as Policies
- PaLM-E: An Embodied Multimodal Language Model
- ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models
Open Questions
- Should this wiki treat robot manipulation datasets as a separate branch of Action-Conditioned Time-Series Datasets, or keep them in a robotics-specific topic because vision and embodiment dominate the interface?
- What is the best canonical representation for proprioception, force, tactile, and IMU streams when combining them with VLM-style image tokens?
- When should robot action chunks be represented as continuous trajectories, discrete tokens, or frequency-space tokens?
- Can factorized spatial-temporal attention preserve contact dynamics as well as it preserves visual task progress?
- Which benchmarks evaluate action recoverability, closed-loop success, and latent-state faithfulness rather than only video fidelity or imitation loss?
- Which embodied benchmark family should the wiki ingest next as a primary anchor: open-loop action-conditioned prediction, closed-loop policy evaluation, or executability diagnostics?
- How should robotics benchmarks distinguish genuinely unseen tasks from recombinations of skills latent in massive multi-robot training corpora?
- Can latent actions inferred from action-free robot videos be aligned with typed control-input schemas strongly enough for policy evaluation, transfer, or safety analysis?