Robotics Time-Series Modeling

Summary

Robotics foundation models usually do not treat sensor data as a plain forecasting table. They model action-conditioned multimodal trajectories: recent observations, language or task context, proprioceptive state when available, and actions or control inputs over a short control horizon. The time-series object is a trajectory, not a standalone sequence of scalar measurements.

The dominant interface is therefore:

context, observation_history, optional action_history -> action_chunk

or, for world models:

context, observation_history, action_sequence -> future_observation_or_latent_trajectory

The current model-ingest answer is: mostly yes for high-dexterity continuous action generation, but not universally. Many recent generalist robot policies use a diffusion or flow-matching action expert for the fast motor interface: Diffusion Policy, Octo, RDT-1B, π0, GR00T N1, and π0.7. But RT-2, OpenVLA, and FAST keep an autoregressive action-token path; ACT uses Transformer/CVAE action chunks; Helix uses a fast visuomotor Transformer with continuous-action regression; and Helix 02 states a Transformer S1 plus learned S0 controller without claiming diffusion or flow.

So the precise rule is: modern robotics has moved away from naive next-token VLM decoding as the only motor interface, toward short-horizon chunked continuous control. Diffusion and flow matching are the dominant public recipe for that continuous action expert, but Transformers remain central as VLM backbones, denoising/flow backbones, action-token models, and fast visuomotor policies.

This page uses Terminology: robot commands are actions or numeric control inputs; camera, proprioception, force, tactile, and IMU samples are observations; task text and embodiment metadata are context; uncontrolled scene changes are events or exogenous variables.

What The Wiki Currently Believes

Robotics Data Is Trajectory Data

The closest local sources are embodied trajectory datasets and robotic world-model papers. Open X-Embodiment standardizes robot-learning datasets across embodiments and trains RT-X policies over image history, language instructions, and discretized end-effector actions. DROID adds in-the-wild manipulation episodes with synchronized cameras, calibration, depth, and language. BridgeData V2 is a frequent substrate for language-conditioned manipulation policies and action-conditioned world-model evaluation. RoboTurk is earlier teleoperation context: the raw signal is still a time-ordered demonstration with human-generated control inputs.

This differs from most Time-Series Foundation Models. A forecasting model predicts future observations from history; a robot policy chooses a control input; an action-conditioned world model predicts how future observations change under candidate actions.

Common Encoding Pattern

Robotics models tend to encode each timestep as a multimodal bundle:

ComponentUsual EncodingTime-Series Meaning
Image/videoCNN, ViT, VLM, or latent encoder tokens per frameHigh-dimensional observation of partial state
ProprioceptionNormalized numeric features projected to tokens or fused into the action headObserved robot state, not the whole environment state
Force, tactile, IMUHigh-frequency numeric streams, often downsampled, windowed, or encoded by a small temporal encoderContact and motion observations; usually more time-sensitive than RGB
Language instructionText tokens or a frozen text/VLM embeddingTask context, not a time series unless it changes over the episode
Embodiment/control metadataRobot ID, control mode, action-space metadata, control frequencyStatic or slowly changing context that disambiguates action semantics
Actions/control inputsDiscrete bins, text-like action tokens, continuous chunks, diffusion/flow trajectories, or frequency-space tokensControllable inputs used for policy output or world-model conditioning

The dominant VLA recipe is late fusion: images, language, and robot-state vectors are encoded by modality-specific frontends, projected to a shared token width, and then concatenated into one Transformer sequence. Proprioception usually enters as one or more continuous state tokens after an MLP or linear projection rather than through scalar binning.

Canonical Action/State Interfaces

The cross-embodiment problem is mostly an interface problem. A policy trained across robots should not require every embodiment to expose the same raw joints, actuator commands, controller gains, camera rig, or control frequency. It needs a more stable contract between the general policy and the robot-specific mechanics.

Open X-Embodiment coarsely aligns heterogeneous robots by choosing a canonical camera stream and mapping controls into a normalized end-effector action representation, but this loses embodiment-specific details. Newer physical-intelligence-style models increasingly add metadata for control mode, speed, quality, visual subgoals, or embodiment so the same model can interpret the same numeric action representation differently.

The common pieces are:

  • End-effector delta pose: a relative command for the tool, gripper, hand, or suction cup, usually dx, dy, dz plus an orientation delta. This is more portable than joint commands because many different arms can move a tool through the same local workspace displacement.
  • Gripper action: an open/close, width, force, or suction command. The physical gripper may differ, but the task-level primitive “release or grasp” is often shared.
  • Normalized proprioception: robot-state observations such as joint positions, joint velocities, end-effector pose, gripper state, force/torque, base motion, or IMU values projected into normalized numeric tokens. These tokens give the model body awareness without forcing all robots to share the same raw encoder units.
  • Action chunks: short future sequences of control inputs. They let the policy express a local trajectory rather than a single twitch, then replan under fresh observations.
  • Embodiment and control metadata: robot ID, control mode, action-space shape, control frequency, action horizon, speed or quality tags, and capability masks. These fields tell the model how to interpret an otherwise similar numeric action vector.
  • Robot-specific adapter or controller: inverse kinematics, motion planning, trajectory tracking, safety filters, learned whole-body controllers, or actuator-level control loops that map the canonical command to actual joints and motors.

The resulting stack is:

observations + task context + embodiment metadata
  -> shared policy or world model
  -> canonical action/state interface
  -> robot-specific adapter or controller
  -> joints, actuators, and safety limits

This interface is not magic. End-effector actions do not preserve every detail of a dexterous hand, humanoid balance controller, contact-rich insertion task, or force-sensitive manipulation. Cross-embodiment normalization is therefore a lossy modeling decision. Agents SHOULD record what was normalized, discretized, resampled, masked, or dropped when converting robot datasets.

Time Handling

Robotics time is mostly control time, not calendar time. The model sees a finite history window and outputs either the next control input, a short action chunk, or a future latent/video trajectory. The key design choices are:

  • History length: RT-X-style policies use a short image history; world models condition on a finite visual-action history; diffusion and flow policies condition on recent observations before generating action chunks.
  • Action horizon: ACT, diffusion policies, RDT-like models, and pi0-like flow policies predict multiple future control inputs at once, then execute part of the chunk before replanning.
  • Receding horizon: diffusion and flow policies usually replan repeatedly rather than executing an entire long rollout open-loop.
  • Temporal smoothing: ACT-style temporal ensembling averages overlapping action chunks to reduce jitter and compounding errors.
  • Frame rate and control frequency: cross-robot data may use different control rates; models either normalize actions into a shared representation, include control-frequency metadata, or rely on the dataset conversion layer.
  • Generative timestep: diffusion and flow policies have a second notion of time: the denoising or flow timestep. This is usually encoded explicitly, often with sinusoidal features, and is separate from physical control time.
  • Elapsed time versus token index: when sampling is irregular, token position is not enough. This is the same failure mode that Kairos addresses for time-series forecasting with mixed patch sizes and physical-time calibration.

The practical rule is: keep delta_t, control frequency, frame subsampling, and action horizon explicit whenever data crosses robots, sensors, or datasets.

Action Encoding

There are four main action interfaces in the local robotics corpus:

  1. Discretized action tokens. RT-2 and OpenVLA bucket action dimensions and train with categorical next-token losses, often through a VLM backbone.
  2. Continuous action chunks without denoising. ACT predicts continuous action chunks with a Transformer/CVAE-style generative model, while Helix reports a fast visuomotor Transformer trained with regression for continuous upper-body control.
  3. Diffusion or flow action chunks. Diffusion Policy, Octo, RDT-1B, π0, GR00T N1, and π0.7 use diffusion or flow matching over short future control-input trajectories.
  4. Compressed action tokens. FAST moves an action chunk into frequency space before discrete tokenization, making high-frequency continuous controls more compatible with autoregressive VLA training.

For this wiki, all four are representations of control inputs. They should not be confused with passive events or exogenous variables.

Attention And Sequence Mixers

Robotics models reuse Transformer components, but the attention pattern is shaped by the embodied interface:

PatternTypical UseAttention / Mixer
Decoder-only policy TransformerRT-1/RT-X-style policies over image-history and language tokensCausal or autoregressive self-attention over compressed observation/context tokens before action-token prediction
VLM-as-policyRT-2/OpenVLA-style policiesPretrained vision-language attention reused; action tokens become part of the output vocabulary
Action Chunking TransformerBimanual imitation learning and low-data dexterous tasksTransformer over observations and latent action sequence, with temporal ensembling at inference
Diffusion Transformer policyDiffusion Policy, Octo, and RDT-style policiesDenoising network over action trajectories, conditioned on visual/language/proprioceptive observations by cross-attention or token fusion
Flow-matching action expertpi0, pi0.7, and GR00T-style policiesVLM backbone supplies semantic context; a separate continuous action expert generates fluent control trajectories
Fast visuomotor TransformerHelix-style humanoid policiesSlow VLM latents condition a high-rate Transformer policy that outputs continuous control inputs directly
Whole-body controller hierarchyHelix 02-style humanoid stacksSemantic VLA layer supplies targets to a faster learned body controller for balance, contact, and actuator-level execution
Blockwise multimodal policypi0-style physical-intelligence policiesSeparate blocks for image/language context, robot state, and future action tokens; causal masking prevents later action tokens from leaking into observation/state processing
Action-conditioned latent world modelRobotic video/latent rollout for planning and policy evaluationDiT-style transition model; Reconstruction Or Semantics? uses factorized spatial attention within frames and causal temporal attention across frames; RAEv2 tests multi-layer RAE latents for action-conditioned navigation rollouts
Recurrent or SSM mixerLow-latency control, long sensor histories, or deployment-constrained loopsMamba-2-style state-space mixers are attractive when recurrent inference matters more than full attention over long histories

The attention axis is not just “global versus local.” Robotics often needs structured mixing across space, time, sensor channel, language context, and action horizon. Factorized attention is common because full attention over all video patches, timesteps, views, and action tokens becomes expensive quickly.

Latent World Models

World Model for Robot Learning Survey gives the current high-level map for this section. It separates world models for policies, learned simulators/evaluators, and robotic video generation, and it narrows the success criterion from visually plausible futures to action-conditioned consistency, physical executability, and downstream task or policy utility.

Reconstruction Or Semantics? is the strongest local anchor for robotic world models. It frames action-conditioned video world models as predictors of future observations from observation and action histories, but argues that the latent space matters more than pixel fidelity alone. Semantic latents can preserve action-relevant object state, task progress, and controllability better than reconstruction latents.

Genie adds a different robotics boundary case: it trains a smaller model on action-free robot videos from RT-1-style data and shows that a latent action model can recover consistent action-like codes even when ground-truth robot actions are ignored. Treat this as evidence for latent action discovery from image/video trajectories, not as a full robot policy, typed control-input model, or contact-rich manipulation benchmark.

RAEv2 adds a narrower navigation-world-model result: in a RECON setup with four past egocentric frames and action tokens, multi-layer RAE latents reduce rollout flicker and improve video prediction metrics over the original RAE baseline. Treat this as evidence about latent interface quality for visual future-state prediction, not as a complete planning or policy-evaluation result.

stable-worldmodel is the infrastructure complement to these model papers. It standardizes trajectory data, planners, and evaluation protocols across visual/control environments, then uses factors of variation to test whether action-conditioned latent world models still plan under visual, geometric, or physical shifts. Treat it as a benchmark and implementation surface, not as evidence that any particular latent objective solves robustness.

That is directly relevant to sensor time series: the model should not merely reconstruct the next frame or next sensor value. It should preserve the latent state variables that make action consequences predictable.

Numeric State Trajectories

Not all robotics time series are vision-heavy. D4RL and CausalWorld are cleaner numeric-control anchors: they expose state-action trajectories and are closer to classical action-conditioned dynamics learning. They are useful when the question is about action-conditioned multivariate time series rather than visual manipulation.

Lessons For Time-Series Foundation Models

Robotics is useful for Time-Series Foundation Models because it exposes what passive forecasting often hides: many real systems are not only observed over time, they are acted on. The transferable lesson is to move from history -> future toward history + planned controls -> future trajectory distribution.

Actions Should Be First-Class Inputs

Classic TSFMs usually forecast future observations from historical observations and optional covariates. Robotics makes the action channel unavoidable. A robot policy must choose a control input, and a robotic world model must predict future observations conditional on candidate action sequences.

This suggests that TSFMs for healthcare, observability, energy, industrial control, recommender systems, education, and finance execution should separate:

  • actions and control inputs chosen by an agent or operator;
  • interventions whose causal consequences matter;
  • events that are observed but not controlled;
  • exogenous variables that condition the future but are not policy choices.

Without that separation, a TSFM can forecast passively but cannot answer counterfactual questions such as “what happens if this control plan is executed?”

Forecasting Should Become Counterfactual-Friendly

Robotic world models are useful when they let a planner compare possible action sequences. The analogous TSFM interface is:

past observations + planned actions/control inputs -> distribution over futures

This is stronger than ordinary covariate-conditioned forecasting. Known future exogenous variables describe what is expected to happen outside the modeled agent’s control; planned actions or interventions describe choices whose alternatives should also be evaluable.

Action Chunking Suggests Blockwise Future Generation

Robotics increasingly predicts short action chunks instead of only the next action. That pattern can transfer to time-series modeling: generate coherent future blocks or latent trajectory chunks, execute or evaluate part of the plan, then recondition and regenerate. This may be more stable than purely step-by-step autoregression for long horizons, especially when control inputs, delays, and feedback loops matter.

For TSFMs, the corresponding design question is whether a model should decode:

  • one next value or patch;
  • a full forecast horizon;
  • a latent future trajectory block;
  • a conditional plan-response rollout under a candidate action chunk.

A Timestep Is A Sensor Bundle

Robotics treats each timestep as a bundle of observations, actions, context, and metadata. TSFMs often still reduce the interface to values x channels. The robotics interface argues for richer time-series examples that can include numeric features, categorical event streams, text context, topology, device metadata, missingness, sampling rate, known future plans, and operator actions.

This is especially relevant to observability and industrial systems: metrics are observations, deployments or rollbacks can be actions or interventions, incidents can be events or exogenous shocks, topology is context, and textual tickets or runbooks are auxiliary context rather than numeric channels.

Physical Time Should Be Explicit

Robotics has to track camera rates, control frequency, sensor latency, frame subsampling, and action horizon. TSFMs should apply the same discipline to irregular or multi-rate time series. Token index is not always physical time.

The practical carryover is to keep delta_t, sampling rate, patch duration, control frequency, and forecast-horizon semantics explicit. Kairos is a non-robotics example of this same principle: when token granularity changes, position index alone is a weak proxy for elapsed time.

Latent State Quality Matters More Than Reconstruction

Reconstruction Or Semantics? shows the robotics version of a broader TSFM problem: a model can reconstruct observations well while losing action-relevant latent structure. For TSFMs, low error alone does not prove that the representation preserves regimes, controllability, causal variables, rare failure modes, or decision-relevant uncertainty.

Better evaluation should ask:

  • Can actions or interventions be recovered from latent transitions?
  • Does the latent state preserve regime, task, outcome, or failure information?
  • Does the forecast remain useful when planned controls change?
  • Can the model rank candidate plans, not only predict the historical continuation?

Attention Should Respect Interface Semantics

Robotics models increasingly use structured attention: separate blocks for observations, state, actions, language, spatial tokens, temporal tokens, and action horizons. TSFMs can borrow this idea by separating past observations, known future exogenous variables, candidate actions/interventions, static context, and target future tokens.

That matters for both leakage and semantics. A target future token SHOULD NOT attend to labels or future observations that would be unavailable at decision time, while a forecast decoder MAY attend to known future exogenous variables and planned actions if those are part of the declared interface.

Continuous Outputs Should Not Always Become Language Tokens

Robotics exposes the limits of naive action discretization. Continuous control often benefits from continuous heads, diffusion, flow matching, action chunks, or better sequence tokenizers. TSFMs should take the same lesson seriously for numerical values: language-token compatibility is not always worth losing scale, precision, smoothness, or physical units.

For numeric time series, promising output interfaces include continuous probabilistic heads, quantile or distributional decoders, diffusion or flow heads over future blocks, and latent trajectory generators. Tokenization remains useful, but it should be justified by task behavior rather than inherited from language modeling.

Evaluation Should Be Decision-Facing

Robot policies are ultimately judged by closed-loop success, not only prediction loss. TSFMs should adopt analogous decision-facing evaluations when the downstream use case is operational:

  • Did the model choose or rank better interventions?
  • Did it prevent an incident or reduce cost?
  • Did it preserve calibrated uncertainty around rare but important futures?
  • Did it support a planner, controller, analyst, or operator better than a passive forecaster?

The larger direction is that TSFMs should become action-conditioned world models for time series when the domain has meaningful actions, control inputs, interventions, or decision logs.

Design Heuristics

  • Preserve the distinction between observation, state, latent state, action, control input, and exogenous variable.
  • Treat RGB/video as an observation stream, not as the state itself.
  • Keep proprioception and action history even when vision dominates; contact-rich manipulation often needs short-term motion and gripper state.
  • Represent high-frequency tactile, force, and IMU streams as windowed multivariate time series with explicit sampling rates before fusing them with lower-rate camera frames. The common adaptation path is to add them as new state or observation tokens during fine-tuning.
  • Use action chunks when one-step controls are noisy, delayed, or too locally ambiguous.
  • Use action-conditioned world models when the task is planning or policy evaluation, not direct behavior cloning.
  • Use semantic latent spaces when downstream control cares about object identity, task progress, and action recoverability more than pixel-level reconstruction.
  • For visual world models, treat the encoder layer aggregation rule as part of the data/model interface; final-layer latents, multi-layer sums, and learned layer weights may preserve different local details.
  • Treat cross-embodiment normalization as a lossy modeling decision; record what was normalized, discretized, resampled, or dropped.

Relation To Foundation TSFM Agenda

This page is the physical-robotics analogy for the Foundation Time-Series Model Research Agenda. It supports the agenda’s control, latent-state, context, and generation/editing slots by showing the mature version of the interface: observations plus context and action history produce control-input chunks or future latent trajectories.

For foundation time-series models, robotics remains an analogy layer unless a source is mapped to a specific slot. It provides strong interface evidence for control, context, and action-conditioned modeling, but a digital-world TSFM still needs numeric telemetry, event streams, topology, and typed interventions rather than image-heavy manipulation alone.

Local Robotics Model Corpus

SourceAction InterfaceFast-Part Classification
RT-2Discretized action tokens emitted by a VLMAutoregressive VLA, not diffusion/flow
ACTContinuous action chunksTransformer/CVAE action chunking, not diffusion/flow
Diffusion PolicyContinuous action trajectoriesDiffusion denoising policy
OctoFlexible continuous action chunksTransformer policy with diffusion action head
OpenVLADiscretized action tokensAutoregressive VLA, not diffusion/flow
RDT-1BContinuous bimanual action chunksDiffusion Transformer
π0Continuous action chunksFlow-matching action expert
FASTCompressed action tokensTokenization path for autoregressive VLAs
GR00T N1Continuous humanoid motor actionsDiT/action flow-matching module
Gemini Robotics 1.5Continuous numeric robot control inputs plus optional thinking textHierarchical VLA; source does not specify diffusion/flow
π0.7Steerable continuous action chunksFlow-matching action expert plus FAST-supervised VLM and subgoal world model
HelixContinuous upper-body humanoid controlsFast visuomotor Transformer/regression, not stated as diffusion/flow
Helix 02Full-body joint targets and actuator commandsS1 Transformer plus S0 learned controller; no diffusion/flow claim

External Anchors To Ingest Next

The main immediate robotics action-model sources are now local. Remaining useful follow-ups:

Open Questions

  • Should this wiki treat robot manipulation datasets as a separate branch of Action-Conditioned Time-Series Datasets, or keep them in a robotics-specific topic because vision and embodiment dominate the interface?
  • What is the best canonical representation for proprioception, force, tactile, and IMU streams when combining them with VLM-style image tokens?
  • When should robot action chunks be represented as continuous trajectories, discrete tokens, or frequency-space tokens?
  • Can factorized spatial-temporal attention preserve contact dynamics as well as it preserves visual task progress?
  • Which benchmarks evaluate action recoverability, closed-loop success, and latent-state faithfulness rather than only video fidelity or imitation loss?
  • Which embodied benchmark family should the wiki ingest next as a primary anchor: open-loop action-conditioned prediction, closed-loop policy evaluation, or executability diagnostics?
  • How should robotics benchmarks distinguish genuinely unseen tasks from recombinations of skills latent in massive multi-robot training corpora?
  • Can latent actions inferred from action-free robot videos be aligned with typed control-input schemas strongly enough for policy evaluation, transfer, or safety analysis?