Latent Action Models

Summary

A latent action model (LAM) is an action-discovery module: it infers a compact action-like code from raw observation transitions when explicit action labels are absent or unavailable.

In this wiki, LAMs matter because they separate two problems that are often collapsed:

learning future dynamics from observation sequences;
recovering a control interface that can make those dynamics steerable.

Genie is the current anchor source. It learns discrete latent actions from unlabeled image/video trajectories, then trains a dynamics model that can generate future video tokens conditioned on those latent actions. VLA-JEPA is the current robotics VLA variant: it learns latent-action tokens through a leakage-free latent state-prediction objective, then aligns them with robot control-input trajectories through a flow-matching action head. OTF-LAM adds the agent-ambiguity warning: observation-only transitions show mixed effects, so a monolithic latent action can absorb distractors, camera dynamics, and background motion unless the transition interface is structured.

Core Contract

The useful LAM contract is:

q_{ϕ} (a_{t} ∣ x_{\leq t}, x_{t + 1})

p_{θ} (z_{t + 1} ∣ z_{\leq t}, a_{t})

where $x_{t}$ is an observation, $z_{t}$ is a latent state or token representation, and $a_{t}$ is an inferred latent action. The latent action is an action-like code that explains a transition; it is not automatically a logged action, typed control input, intervention, or executable command.

flowchart LR
  Obs["observation history"]
  NextObs["next observation"]
  LAM["latent action model"]
  Code["latent action code"]
  State["latent state or tokens"]
  Dynamics["action-conditioned dynamics"]
  Rollout["future observation rollout"]
  Adapter["optional action adapter"]
  Typed["typed action or control input"]

  Obs --> LAM
  NextObs --> LAM
  LAM --> Code
  Obs --> State
  State --> Dynamics
  Code --> Dynamics --> Rollout
  Code --> Adapter --> Typed

What The Wiki Currently Believes

LAMs are useful when important trajectories exist without trustworthy action labels, such as internet video, action-free robot videos, or historical traces where the real control channel was not logged.
A learned latent action can make a passive observation model controllable, but only inside the learned action space.
A LAM becomes operationally useful only when its latent actions can be aligned with real actions, control inputs, interventions, or safe human-facing commands.
When typed actions can be logged directly, the wiki should still prefer explicit action and outcome records over inferring them later.
Sensorimotor World Models is the supervised-action contrast case: it uses the executed continuous action as the inverse-dynamics target, so it should be read as evidence for action-grounded representation learning, not latent action discovery.
A strong LAM is evidence that the data contract was missing a control channel, not evidence that the recovered channel has causal semantics by default.
For the broader robotics taxonomy around latent-space world models and video-to-action pipelines, use World Model for Robot Learning Survey as a map rather than as primary evidence for any one listed LAM system.
VLA-JEPA sharpens the leakage boundary: future observations can supervise latent actions through target-side embeddings, but SHOULD NOT enter the learner path as inputs when the learned code is meant to represent transition dynamics.
OTF-LAM sharpens the source-attribution boundary: when the agent, camera, distractors, and background all change together, the first recoverable object may be a factorized set of observed-transition effects rather than the true controlled action.

Relation To Genie And VLA-JEPA

Genie introduces the main LAM pattern for this corpus. The paper trains a latent action model from pairs of observations, uses the learned discrete action code to condition video-token dynamics, and shows that the resulting interface can support controllable visual generation.

VLA-JEPA is not the same architecture as Genie, but it is useful evidence for the same contract in VLA policy pretraining. It treats latent actions as transition-conditioning tokens whose usefulness is tested through downstream robot success and robustness, not only generated-video controllability.

OTF-LAM is not a bigger Genie-style world model. It is a decomposition step for the ambiguous inverse problem: learn reusable observed-transition primitives first, then aggregate them into a state-aware latent action. That makes it useful when the latent action should not immediately decide which observed change belongs to the agent.

The most important transfer lesson is not “pixels are enough.” It is that observation-only trajectories can sometimes contain enough transition structure to recover a usable action abstraction, but that abstraction can also be contaminated by non-agent transition sources. For Alex’s time-series agenda, that suggests a recovery path when action logs are missing, but also a diagnostic: if a model must invent latent actions to explain production telemetry, the system probably needs better action logging.

Boundary With Nearby Concepts

Concept	Difference From LAM
Passive dynamics model	Predicts future observations from history without any controllable action-like code.
Action-conditioned world model	Conditions on known actions, control inputs, or interventions; a LAM tries to infer such a code when it is missing.
Observed-transition factorizer	Decomposes what changed in the observation; a LAM still needs to decide which factors should become action-like.
Sensorimotor World Models	Complementary supervised-action case: inverse dynamics predicts the logged action/control input, so the learned transition signal is not a latent action model even though it uses an inverse-dynamics head.
Action tokenizer	Compresses known action or control-input trajectories; a LAM discovers action-like codes from observation transitions.
Latent state model	Represents the system state; a LAM represents the transition choice or control-like factor between states.
Exogenous-variable model	Conditions on outside variables; a LAM should not silently absorb passive shocks and call them actions.

Relation To Foundation TSFM Agenda

LAMs are adjacent to foundation time-series modeling, but they are important for the control and data-contract slots.

Agenda slot	Verdict	Evidence	Missing pieces
Control and counterfactuals	adjacent	Genie shows an image/video model can recover a small latent action space from observation-only trajectories and use it for controllable rollout. VLA-JEPA adds a VLA pretraining variant where latent-action tokens condition a latent world model and a flow-matching action head. OTF-LAM adds a factorized transition-effect bridge before action abstraction.	Needs alignment to typed actions, interventions, outcomes, failures, timing, and safety constraints.
Data contracts	warning	LAMs expose that action-free trajectories may still contain hidden control structure. OTF-LAM shows those trajectories can also contain mixed effects from distractors, camera, and background dynamics.	Operational systems should log the action channel directly whenever possible.
Representation quality	adjacent	Separating latent state from latent action clarifies what part of the representation explains state and what part explains transition choice. OTF-LAM further separates local observed-transition effects before action aggregation.	Needs tests that the latent action tracks decision-relevant controls rather than visual nuisance factors or exogenous events.

Open Questions

How should latent actions be evaluated when no ground-truth action labels exist?
When does a latent action code recover a true control factor rather than a cluster of visual transitions?
How much labeled data is needed to map latent actions onto typed actions or control inputs? OTF-LAM reports a narrow bridge using 32 action-labeled DCS trajectories for the latent-to-action mapping, but richer game, robot, GUI, and telemetry settings remain open.
Can a LAM help recover missing operator actions in observability or industrial time series, or would it mostly expose schema gaps?
Should LAMs be discrete, continuous, hierarchical, or tied to an action vocabulary?
What safeguards prevent a planner from treating inferred latent actions as causal interventions before they are validated?

Alex Open Research Wiki

Explorer

Latent Action Models

Latent Action Models

Summary

Core Contract

What The Wiki Currently Believes

Relation To Genie And VLA-JEPA

Boundary With Nearby Concepts

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Latent Action Models

Latent Action Models

Summary

Core Contract

What The Wiki Currently Believes

Relation To Genie And VLA-JEPA

Boundary With Nearby Concepts

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks