Latent Action Models
Summary
A latent action model (LAM) is an action-discovery module: it infers a compact action-like code from raw observation transitions when explicit action labels are absent or unavailable.
In this wiki, LAMs matter because they separate two problems that are often collapsed:
- learning future dynamics from observation sequences;
- recovering a control interface that can make those dynamics steerable.
Genie is the current anchor source. It learns discrete latent actions from unlabeled image/video trajectories, then trains a dynamics model that can generate future video tokens conditioned on those latent actions.
Core Contract
The useful LAM contract is:
where is an observation, is a latent state or token representation, and is an inferred latent action. The latent action is an action-like code that explains a transition; it is not automatically a logged action, typed control input, intervention, or executable command.
flowchart LR Obs["observation history"] NextObs["next observation"] LAM["latent action model"] Code["latent action code"] State["latent state or tokens"] Dynamics["action-conditioned dynamics"] Rollout["future observation rollout"] Adapter["optional action adapter"] Typed["typed action or control input"] Obs --> LAM NextObs --> LAM LAM --> Code Obs --> State State --> Dynamics Code --> Dynamics --> Rollout Code --> Adapter --> Typed
What The Wiki Currently Believes
- LAMs are useful when important trajectories exist without trustworthy action labels, such as internet video, action-free robot videos, or historical traces where the real control channel was not logged.
- A learned latent action can make a passive observation model controllable, but only inside the learned action space.
- A LAM becomes operationally useful only when its latent actions can be aligned with real actions, control inputs, interventions, or safe human-facing commands.
- When typed actions can be logged directly, the wiki should still prefer explicit action and outcome records over inferring them later.
- A strong LAM is evidence that the data contract was missing a control channel, not evidence that the recovered channel has causal semantics by default.
- For the broader robotics taxonomy around latent-space world models and video-to-action pipelines, use World Model for Robot Learning Survey as a map rather than as primary evidence for any one listed LAM system.
Relation To Genie
Genie introduces the main LAM pattern for this corpus. The paper trains a latent action model from pairs of observations, uses the learned discrete action code to condition video-token dynamics, and shows that the resulting interface can support controllable visual generation.
The most important transfer lesson is not “pixels are enough.” It is that observation-only trajectories can sometimes contain enough transition structure to recover a usable action abstraction. For Alex’s time-series agenda, that suggests a recovery path when action logs are missing, but also a diagnostic: if a model must invent latent actions to explain production telemetry, the system probably needs better action logging.
Boundary With Nearby Concepts
| Concept | Difference From LAM |
|---|---|
| Passive dynamics model | Predicts future observations from history without any controllable action-like code. |
| Action-conditioned world model | Conditions on known actions, control inputs, or interventions; a LAM tries to infer such a code when it is missing. |
| Action tokenizer | Compresses known action or control-input trajectories; a LAM discovers action-like codes from observation transitions. |
| Latent state model | Represents the system state; a LAM represents the transition choice or control-like factor between states. |
| Exogenous-variable model | Conditions on outside variables; a LAM should not silently absorb passive shocks and call them actions. |
Relation To Foundation TSFM Agenda
LAMs are adjacent to foundation time-series modeling, but they are important for the control and data-contract slots.
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and counterfactuals | adjacent | Genie shows an image/video model can recover a small latent action space from observation-only trajectories and use it for controllable rollout. | Needs alignment to typed actions, interventions, outcomes, failures, timing, and safety constraints. |
| Data contracts | warning | LAMs expose that action-free trajectories may still contain hidden control structure. | Operational systems should log the action channel directly whenever possible. |
| Representation quality | adjacent | Separating latent state from latent action clarifies what part of the representation explains state and what part explains transition choice. | Needs tests that the latent action tracks decision-relevant controls rather than visual nuisance factors or exogenous events. |
Open Questions
- How should latent actions be evaluated when no ground-truth action labels exist?
- When does a latent action code recover a true control factor rather than a cluster of visual transitions?
- How much labeled data is needed to map latent actions onto typed actions or control inputs?
- Can a LAM help recover missing operator actions in observability or industrial time series, or would it mostly expose schema gaps?
- Should LAMs be discrete, continuous, hierarchical, or tied to an action vocabulary?
- What safeguards prevent a planner from treating inferred latent actions as causal interventions before they are validated?