Latent Action Models

Summary

A latent action model (LAM) is an action-discovery module: it infers a compact action-like code from raw observation transitions when explicit action labels are absent or unavailable.

In this wiki, LAMs matter because they separate two problems that are often collapsed:

  • learning future dynamics from observation sequences;
  • recovering a control interface that can make those dynamics steerable.

Genie is the current anchor source. It learns discrete latent actions from unlabeled image/video trajectories, then trains a dynamics model that can generate future video tokens conditioned on those latent actions.

Core Contract

The useful LAM contract is:

where is an observation, is a latent state or token representation, and is an inferred latent action. The latent action is an action-like code that explains a transition; it is not automatically a logged action, typed control input, intervention, or executable command.

flowchart LR
  Obs["observation history"]
  NextObs["next observation"]
  LAM["latent action model"]
  Code["latent action code"]
  State["latent state or tokens"]
  Dynamics["action-conditioned dynamics"]
  Rollout["future observation rollout"]
  Adapter["optional action adapter"]
  Typed["typed action or control input"]

  Obs --> LAM
  NextObs --> LAM
  LAM --> Code
  Obs --> State
  State --> Dynamics
  Code --> Dynamics --> Rollout
  Code --> Adapter --> Typed

What The Wiki Currently Believes

  • LAMs are useful when important trajectories exist without trustworthy action labels, such as internet video, action-free robot videos, or historical traces where the real control channel was not logged.
  • A learned latent action can make a passive observation model controllable, but only inside the learned action space.
  • A LAM becomes operationally useful only when its latent actions can be aligned with real actions, control inputs, interventions, or safe human-facing commands.
  • When typed actions can be logged directly, the wiki should still prefer explicit action and outcome records over inferring them later.
  • A strong LAM is evidence that the data contract was missing a control channel, not evidence that the recovered channel has causal semantics by default.
  • For the broader robotics taxonomy around latent-space world models and video-to-action pipelines, use World Model for Robot Learning Survey as a map rather than as primary evidence for any one listed LAM system.

Relation To Genie

Genie introduces the main LAM pattern for this corpus. The paper trains a latent action model from pairs of observations, uses the learned discrete action code to condition video-token dynamics, and shows that the resulting interface can support controllable visual generation.

The most important transfer lesson is not “pixels are enough.” It is that observation-only trajectories can sometimes contain enough transition structure to recover a usable action abstraction. For Alex’s time-series agenda, that suggests a recovery path when action logs are missing, but also a diagnostic: if a model must invent latent actions to explain production telemetry, the system probably needs better action logging.

Boundary With Nearby Concepts

ConceptDifference From LAM
Passive dynamics modelPredicts future observations from history without any controllable action-like code.
Action-conditioned world modelConditions on known actions, control inputs, or interventions; a LAM tries to infer such a code when it is missing.
Action tokenizerCompresses known action or control-input trajectories; a LAM discovers action-like codes from observation transitions.
Latent state modelRepresents the system state; a LAM represents the transition choice or control-like factor between states.
Exogenous-variable modelConditions on outside variables; a LAM should not silently absorb passive shocks and call them actions.

Relation To Foundation TSFM Agenda

LAMs are adjacent to foundation time-series modeling, but they are important for the control and data-contract slots.

Agenda slotVerdictEvidenceMissing pieces
Control and counterfactualsadjacentGenie shows an image/video model can recover a small latent action space from observation-only trajectories and use it for controllable rollout.Needs alignment to typed actions, interventions, outcomes, failures, timing, and safety constraints.
Data contractswarningLAMs expose that action-free trajectories may still contain hidden control structure.Operational systems should log the action channel directly whenever possible.
Representation qualityadjacentSeparating latent state from latent action clarifies what part of the representation explains state and what part explains transition choice.Needs tests that the latent action tracks decision-relevant controls rather than visual nuisance factors or exogenous events.

Open Questions

  • How should latent actions be evaluated when no ground-truth action labels exist?
  • When does a latent action code recover a true control factor rather than a cluster of visual transitions?
  • How much labeled data is needed to map latent actions onto typed actions or control inputs?
  • Can a LAM help recover missing operator actions in observability or industrial time series, or would it mostly expose schema gaps?
  • Should LAMs be discrete, continuous, hierarchical, or tied to an action vocabulary?
  • What safeguards prevent a planner from treating inferred latent actions as causal interventions before they are validated?