---
author:
- 'Gemini Robotics Team, Google DeepMind[^1]'
bibliography:
- references.bib
title: '`\geminirobotics{}`{=latex} `\grlatestversion{}`{=latex}: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer'
---

```{=latex}
\renewcommand\paragraph{\@startsection{paragraph}{4}{\z@}%
            {-2.5ex\@plus -1ex \@minus -.25ex}%
            {1.25ex \@plus .25ex}%
            {\itshape\normalsize\bfseries}}
```
```{=latex}
\renewcommand\subparagraph{%
 \@startsection {subparagraph}{5}{\z@ }{3.25ex \@plus 1ex
 \@minus .2ex}{-1em}{\normalfont \normalsize \bfseries }}
```
```{=latex}
\newcommand{\grlatestversion}{1.5}
```
```{=latex}
\newcommand{\grlegacyversion}{1.0}
```
```{=latex}
\newcommand{\geminiroboticscommon}{Gemini Robotics}
```
```{=latex}
\newcommand{\grlatest}{\geminiroboticscommon\ \grlatestversion}
```
```{=latex}
\newcommand{\grlatestVLA}{\geminiroboticscommon\ \grlatestversion}
```
```{=latex}
\newcommand{\grlegacy}{\geminiroboticscommon}
```
```{=latex}
\newcommand{\grshort}{GR}
```
```{=latex}
\newcommand{\grshortlatest}{\grshort\ \grlatestversion}
```
```{=latex}
\newcommand{\grshortlatestVLA}{\grshort\ \grlatestversion}
```
```{=latex}
\newcommand{\grshortlegacy}{\grlegacy}
```
```{=latex}
\newcommand{\grshortlatestER}{\grshort-ER \grlatestversion}
```
```{=latex}
\newcommand{\grshortlegacyER}{\grshort-ER}
```
```{=latex}
\newcommand{\grlatestER}{\geminiroboticscommon-ER \grlatestversion}
```
```{=latex}
\newcommand{\grlegacyER}{\geminiroboticscommon-ER}
```
```{=latex}
\newcommand{\grod}{Gemini Robotics On-Device}
```
```{=latex}
\newcommand{\grodshort}{GRoD}
```
```{=latex}
\newcommand{\geminirobotics}{\geminiroboticscommon}
```
```{=latex}
\newcommand{\geminiroboticsAction}{\geminiroboticscommon}
```
```{=latex}
\newcommand{\geminiroboticsER}{\geminiroboticscommon-ER}
```
```{=latex}
\let\cite\citep
```
```{=latex}
\renewcommand{\today}{}
```
```{=latex}
\maketitle
```
Introduction {#sec:intro}
============

Truly general robots will require a deep understanding of the physical world. Our previous work, `\grlegacy{}`{=latex} [@team2025gemini], established a strong foundation by leveraging Gemini's rich world knowledge to create a Vision-Language-Action (VLA) model that exhibits impressive interactivity, generality, and dexterity in direct robot control. We now introduce the **`\grlatest{}`{=latex}** (`\grshortlatest{}`{=latex}) family of robot foundation models, built on the latest generation of Gemini [@comanici2025gemini]. The new model family significantly enhances the capabilities of `\grlegacy{}`{=latex} and brings Gemini's advanced thinking and agentic paradigm to the physical world. It includes `\grlatest{}`{=latex}, a multi-embodiment VLA model [@zitkovich2023rt; @intelligence2025pi05; @bjorck2025gr00t; @wen2025dexvla] with strong reasoning and generalization, and `\grlatestER{}`{=latex}, a generalist Vision-Language Model (VLM) that achieves a new state-of-the-art across embodied reasoning benchmarks. We combine these two models into an agentic system that enables robots to solve complex problems by orchestrating user dialogue, high-level reasoning and planning, agentic tool use and low-level action.

`\noindent `{=latex}**`\grlatestVLA{}`{=latex}** advances the frontier of Vision-Language-Action (VLA) pre-training by integrating two core breakthroughs. Firstly, a novel architecture and a Motion Transfer (MT) mechanism enable the model to learn from diverse robot data sources, forming a unified understanding of motion and physics. This multi-embodiment pre-training allows `\grshortlatestVLA{}`{=latex} to control multiple robots, including the ALOHA, Bi-arm Franka, and Apollo humanoid robots, without any robot-specific post-training, and it also enables zero-shot skill transfer from one robot to another. Secondly, `\grshortlatestVLA{}`{=latex} is a Thinking VLA that can explicitly reason about its actions, interleaving a stream of thoughts with physical movements. This allows the model to convert visual observations into language-based thoughts, simplify complex instructions, detect task success or failure, propose recovery behaviors, and make the robot's actions more interpretable to human users.

`\noindent `{=latex}**`\grlatestER{}`{=latex}** (`\grshortlatestER{}`{=latex}) advances the state-of-the-art for embodied reasoning [@li2023interactive; @chen2024spatialvlm; @chen2024automating; @zhi2025closed], i.e., the visuo-spatial-temporal understanding of the physical world that is required for robotic applications. Building upon Gemini's state-of-the-art thinking and multimodal capabilities, `\grshortlatestER{}`{=latex} significantly outperforms other frontier models across a broad suite of embodied intelligence benchmarks, while retaining the general capabilities of a frontier model and being considerably faster. `\grshortlatestER{}`{=latex}'s physical understanding combines naturally with Gemini's ability to use tools, communicate using modalities like video and audio, and write code, opening up a broad spectrum of potential applications.

`\noindent `{=latex}To achieve truly general-purpose robot agents, we combine our models in an agentic framework (Figure `\ref{fig:gr1.5_overview}`{=latex}). This framework is key to unlocking new capabilities: it handles long-horizon task execution via complex planning and adaptive orchestration, facilitates multimodal interaction, enables robots to leverage user-specified tools (e.g. web search) to solve problems and complete tasks, and implements a multi-layered safety mechanism through explicit reasoning about safety violations.

Method Overview {#sec:overview}
===============

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=\textwidth]{src/assets/gr_1.5_overview_v6.pdf}
    \caption{The Gemini Robotics 1.5 family of models consists of Gemini Robotics 1.5, a VLA, and Gemini Robotics 1.5-ER, a VLM with state-of-the-art embodied reasoning capabilities. They can be combined together to form a powerful agentic framework.}
    \label{fig:gr1.5_overview}
\end{figure*}
```
Model & Architecture
--------------------

**`\grlatest{}`{=latex} model family**. Both `\grlatest{}`{=latex} and `\grlatestER{}`{=latex} inherit Gemini's multimodal world knowledge. `\grlatestER{}`{=latex} (`\grshortlatestER{}`{=latex} for short), our VLM, fully retains Gemini's capabilities including advanced reasoning, tool use, and more. It has additionally been optimized for complex embodied reasoning problems such as task planning, reasoning for spatial expertise, and task progress estimation. `\grshortlatestER{}`{=latex} significantly extends and improves upon GR-ER's embodied reasoning capabilities. `\grlatest{}`{=latex} (`\grshortlatest{}`{=latex} for short), our VLA model, translates mid- and short-horizon instructions into robot actions. It understands open-vocabulary natural language instructions, can perform reasoning steps before emitting an action, and it can natively control multiple robots with different embodiments. As such, `\grshortlatest{}`{=latex} significantly extends the previous Gemini Robotics' capabilities.

`\smallskip `{=latex}`\noindent `{=latex}**Agentic System Architecture.** The full agentic system consists of an orchestrator and an action model that are implemented by the VLM and the VLA, respectively:

-   **Orchestrator**: The orchestrator processes user input and environmental feedback and controls the overall task flow. It breaks complex tasks into simpler steps that can be executed by the VLA, and it performs success detection to decide when to switch to the next step. To accomplish a user-specified task, it can leverage digital tools to access external information or perform additional reasoning steps. We use `\grshortlatestER{}`{=latex} as the orchestrator.

-   **Action model**: The action model translates instructions issued by the orchestrator into low-level robot actions. It is made available to the orchestrator as a specialized tool and receives instructions via open-vocabulary natural language. The action model is implemented by the `\grshortlatest{}`{=latex} model.

`\smallskip `{=latex}`\noindent `{=latex}**Embodied thinking**. A core innovation of `\grlatest{}`{=latex} is Embodied Thinking: the ability to reason---or \`\`think"---before taking action [@lee2025molmoact; @lin2025onetwovla; @huang2025thinkact; @Zawalski24-ecot], which operates across both the VLM and the VLA models. The `\grshortlatestER{}`{=latex} model combines Gemini's thinking and tool-use with an enhanced physical world understanding, enabling it to function within the agentic system for high-level planning. This includes breaking complex tasks into coarse-grained plans, adaptively updating those plans based on execution, or calling external tools like web search. We also introduce an analogous thinking capability to the VLA, creating the Thinking VLA, or `\grshortlatest{}`{=latex} (Thinking on) in our plots and results. Our Thinking VLA explicitly reasons about the instruction and its perception, generates thinking traces in natural language [@smith2025steer; @belkhale2024rth], and appends them to the context window before emitting an action. This process simplifies complex instructions into sequences of primitive skills, increases the transparency in human-robot interactions, and offers a new paradigm for scaling VLA capabilities.

`\smallskip `{=latex}`\noindent `{=latex}To understand how embodied thinking and the agentic system architecture may interact, let us consider a user instruction such as \`\`Pack the suitcase for a trip to London". The orchestrator (`\grshortlatestER{}`{=latex}) accesses a travel itinerary and a recent weather forecast with user permission to decide which clothes are appropriate to pack. It then produces a high-level plan consisting of instructions such as \`\`pack the rain jacket into the luggage" that it communicates to the action model. The action model then decomposes each such instruction into shorter segments that correspond to a few seconds of robot movement each (e.g., \`\`pick up the rain jacket from the wardrobe"). These are executed directly, or they are further translated into an inner monologue of primitive motions such as \`\`move the gripper to the left" or \`\`close the gripper", thus leveraging an explicit understanding of the geometry of the scene to solve the task. Overall, our models' ability to perform embodied thinking dramatically improves their ability to handle multi-step tasks by allowing the models to compose skills in a structured, deliberate manner, ultimately leading to more robust and reliable robot performance.

`\smallskip `{=latex}`\noindent `{=latex}**Motion Transfer**. In `\grlatest{}`{=latex}, we also introduce a new model architecture and training recipe for the VLA. These enable the model to learn from different robots and data sources, to form a unified understanding of motions and the effects of physical interactions, enabling skills to transfer across very different robot embodiments. We refer to the new training recipe as Motion Transfer (MT) in our results.

Robot Data
----------

The dataset used for training the `\grlatest{}`{=latex} contains both multi-embodiment robot data collected on ALOHA [@aloha2], Bi-arm Franka [@fr3] and Apollo humanoid  [@apollo], as well as publicly available text, image and video datasets on the Internet. The robot data consists of thousands of diverse tasks across these platforms covering a broad range of manipulation skills across a multitude of scenes.

Evaluation
----------

For all comparisons reported in this report, we perform A/B/n testing on real robots. This means that we test all models involved in a particular comparison in an interleaved manner on the same robot work cell, thereby reducing variance in the evaluations that might otherwise arise from variations across robots and environmental conditions.

`\smallskip `{=latex}`\noindent `{=latex}The development of `\grshortlatest{}`{=latex} requires comparisons of a large number of architecture variations, algorithm hyperparameters and other settings across multiple embodiments and tasks. To improve research iteration speed, we have developed methods for evaluation without real robots in the loop.

`\smallskip `{=latex}`\noindent `{=latex}We use the open-source MuJoCo simulator [@TodorovET12] to generate evaluation scenes for the robot embodiments in this report. By carefully aligning the visual and physical parameters of simulated and real scenes, we are able to achieve a strong rank consistency between evaluations in simulation and on the real robot (see `\cref{fig:sim_to_real_correlation}`{=latex} in the `\cref{sec:sim-to-real-appendix}`{=latex}).

`\smallskip `{=latex}`\noindent `{=latex}This has allowed us to massively scale up the breadth of our evaluations to new objects, scenes, and environments, and to rapidly iterate on architectural and algorithmic improvements. Over 90% of the evaluation episodes during the development of `\grlatest{}`{=latex} were conducted in simulation. Although real-world evaluation is still required to determine model quality, evaluation in simulation dramatically reduces the volume of tests on real hardware.

`\grlatest{}`{=latex} is a general multi-embodiment Vision-Language-Action Model {#sec:results-actions}
================================================================================

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=1.0\textwidth]{src/assets/actions/gr1.5_example_tasks_grid2_optimized.pdf}
    \caption{\grshortlatest{} can control three different robots with the same checkpoint to accomplish a variety of tasks out-of-the-box.}
    \label{fig:robot_rollout_examples}
\end{figure*}
```
`\grshortlatest{}`{=latex} can control robots with dramatically different form factors to complete a large variety of tasks out-of-the-box, without the need for post-training to specialize the model to a particular embodiment or task. `\cref{fig:robot_rollout_examples}`{=latex} shows example tasks on ALOHA, Bi-arm Franka and Apollo humanoid robots.

`\smallskip `{=latex}`\noindent `{=latex}In this section, we present a comprehensive evaluation of `\grshortlatest{}`{=latex} and a comparison with our previous models, `\grlegacy{}`{=latex} and `\grod{}`{=latex}. Our experiments are designed to answer the following questions:

1.  How does `\grshortlatest{}`{=latex} perform and generalize on short-horizon tasks?

2.  Does `\grshortlatest{}`{=latex} effectively learn from and transfer knowledge across different embodiments?

3.  How does the thinking process contribute to multi-step tasks?

`\smallskip `{=latex}`\noindent `{=latex}We develop a benchmark that follows the design philosophy that was used to evaluate `\grlegacy{}`{=latex} [@team2025gemini]. We extend it to cover all our embodiments, adding more challenging and multi-step tasks, as well as tasks that test cross-embodiment transfer and thinking. The full benchmark includes 230 tasks in total.

`\smallskip `{=latex}`\noindent `{=latex}We generally report mean and standard error of the mean of *progress score* (definitions in `\cref{appendix:generalization-tasks}`{=latex} - `\cref{appendix:medium-horizon-benchmark}`{=latex}), as it provides a continuous and finer-grained measure of model performance and, as such, is especially useful for complex multi-step tasks. For completeness, we also include the corresponding plots of success rate in the `\cref{appendix:additional-exps}`{=latex}.

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively. `\label{fig:generalization}`{=latex}](src/assets/actions/aloha_baselines_progress.png "fig:"){width="\\textwidth"} `\label{fig:aloha-progress-generalization}`{=latex}

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively. `\label{fig:generalization}`{=latex}](src/assets/actions/omega_baselines_progress.png "fig:"){width="\\textwidth"} `\label{fig:omega-progress--generalization}`{=latex}

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively. `\label{fig:generalization}`{=latex}](src/assets/actions/atari_baselines_progress.png "fig:"){width="\\textwidth"} `\label{fig:atari-progress--generalization}`{=latex}

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on ALOHA, Bi-arm Franka and Apollo Humanoid. `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation}`{=latex}](src/assets/actions/aloha_positive_transfer_progress.png){width="\\textwidth"}

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on ALOHA, Bi-arm Franka and Apollo Humanoid. `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation}`{=latex}](src/assets/actions/omega_positive_transfer_progress.png){width="\\textwidth"}

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on ALOHA, Bi-arm Franka and Apollo Humanoid. `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation}`{=latex}](src/assets/actions/atari_positive_transfer_progress.png){width="\\textwidth"}

`\grlatest{}`{=latex} can generalize to new environments and tasks {#sec:generalization}
------------------------------------------------------------------

To understand `\grshortlatest{}`{=latex}'s generalization performance on short-horizon tasks, we use the same methodology as in [@team2025gemini] and consider multiple axes of variation:

-   `\smallskip `{=latex}`\noindent {\bf Visual Generalization:}`{=latex} robustness to visual variations such as changes in background, lighting, distractor objects, or textures.

-   `\smallskip `{=latex}`\noindent {\bf Instruction Generalization:}`{=latex} ability to understand the intent behind natural language instructions, including handling paraphrasing, typos, different languages, and varying levels of specificity.

-   `\smallskip `{=latex}`\noindent {\bf Action Generalization:}`{=latex} ability to adapt learned movements or synthesize novel ones, for example, in order to handle new initial conditions or object instances.

-   `\smallskip `{=latex}`\noindent {\bf Task Generalization:}`{=latex} ability to successfully execute a new task in a new environment. This is the most comprehensive form of generalization as it simultaneously requires robustness to visual changes, understanding of open-vocabulary instructions, and the ability to adapt learned motions to new tasks.

`\smallskip `{=latex}`\noindent `{=latex}We first analyze how `\grshortlatest{}`{=latex} compares against `\grlegacy{}`{=latex} and `\grod{}`{=latex} (`\grodshort{}`{=latex} for short). As shown in the top plot of `\cref{fig:generalization}`{=latex}, for the ALOHA robot, `\grshortlatest{}`{=latex} consistently outperforms these two baselines across all four categories. In particular, `\grshortlatest{}`{=latex} achieves substantial gains in instruction, action, and task generalization. For the Bi-arm Franka robot and the Apollo humanoid robot, we compare `\grshortlatest{}`{=latex} against our `\grodshort{}`{=latex} models[^2] (middle and bottom plots of `\cref{fig:generalization}`{=latex}). On these two platforms, `\grshortlatest{}`{=latex} significantly outperforms `\grodshort{}`{=latex} across all categories. Note that this is not an apples-to-apples comparison because the `\grodshort{}`{=latex} checkpoints were trained with less data due to their earlier release date. Additionally, `\grodshort{}`{=latex} is not a multi-embodiment model: each embodiment requires a different checkpoint. Nevertheless, we include these results to illustrate the dramatic performance improvement across different versions of `\grlegacy{}`{=latex}.

`\smallskip `{=latex}`\noindent `{=latex}We perform an ablation study to pinpoint the source of this significant improvement in generalization. We establish two ablation baselines: training with data from a single embodiment versus training with data from all embodiments, both excluding our Motion Transfer (MT) mechanism. As illustrated in `\cref{fig:data-training-ablation}`{=latex}, while including data from other embodiments generally boosts performance, our MT training recipe clearly amplifies the positive effect of this additional data. This study confirms both the ability of `\grshortlatest{}`{=latex} to leverage multi-embodiment data, and the critical role of the MT mechanism in achieving greater positive transfer of skills among different robots.

Learning across different robot embodiments {#sec:positive_transfer}
-------------------------------------------

```{=latex}
\begin{figure*}[!t]\centering
    \includegraphics[width=\textwidth]{src/assets/actions/gr1.5-x-emb-transfer.pdf}
    \caption{Cross embodiment benchmark. Left: Our model shows zero-shot skill transfer on tasks only seen by another robot embodiment. Right: Example tasks trained on the first embodiment and evaluated on the second.}
    \label{fig:cross_data_source_transfer}
\end{figure*}
```
`\grlatest{}`{=latex} is able to learn and transfer skills across different robot embodiments, leading to both better generalization and more data-efficient learning. Although prior work [@10611477] has shown benefits of training VLAs with diverse data from multiple types of robots, there have been few demonstrations of zero-shot transfer of skills from one robot embodiment to another. In our experiments, we find evidence that `\grshortlatest{}`{=latex}'s multi-embodiment co-training and MT paradigm enables such transfer. As shown in `\cref{fig:cross_data_source_transfer}`{=latex}, the ALOHA robot is able to perform tasks for which training data was only collected on the Bi-arm Franka platform, and vice versa. The same applies to the humanoid, which can perform skills that are only available in the data from other robots (ALOHA in this example), despite being significantly more difficult to control and a wider cross-embodiment gap.

`\smallskip `{=latex}`\noindent `{=latex}To corroborate our observations, we measure cross-embodiment transfer quantitatively with a cross-embodiment benchmark, defined across the three robots included in our study. For each embodiment, we test our model and the baselines on tasks for which data had been collected only on another robot. More details of the benchmark are in Appendix `\ref{appendix:cross-embodiment-benchmark}`{=latex}.

`\smallskip `{=latex}`\noindent `{=latex}Plots in `\cref{fig:cross_data_source_transfer}`{=latex} show how any model trained on single-embodiment data (`\grshortlegacy{}`{=latex}, `\grodshort{}`{=latex} or `\grshortlatest{}`{=latex} trained on a single embodiment) performs poorly on this benchmark, while with cross-embodiment data and MT training recipe we achieve significantly better performance. The efficacy of leveraging cross-embodiment data with Motion Transfer (MT) depends on the initial quantity of data available for a given robotic platform. For the ALOHA platform, which already possesses a large dataset, merely introducing data from other embodiments appears to be less effective; however, MT amplifies the positive transfer from this data by aligning the different embodiments and extracting commonalities, thereby aiding the learning process. Conversely, for Bi-arm Franka with a moderate amount of data, adding cross-embodiment data is beneficial, and MT successfully facilitates this by aligning and extracting shared knowledge. For humanoid robots, where data is scarce, the addition of external embodiment data provides the greatest performance boost; yet, the effect of MT is less pronounced here, suggesting that the technique's alignment capabilities may be less effective when the embodiment gap is substantially larger, such as between the highly dissimilar humanoid and other robot forms.

`\smallskip `{=latex}`\noindent `{=latex}In `\cref{fig:cross_data_source_transfer}`{=latex} we report success rate in addition to progress score in order to highlight that zero-shot transfer with `\grlatest{}`{=latex} leads to successful task execution, not just partial progress towards completion of the task.

Thinking Helps Acting {#sec:thinking}
---------------------

```{=latex}
\centering
```
![Task progress in the multi-step benchmark with and without enabling thinking during inference.](src/assets/actions/thinking_progress.png){#fig:medium-level-thinking width="45%"}

`\centering `{=latex} ![From left to right, top to bottom: an example rollout of the Thinking VLA: the Apollo humanoid packing objects into a white bag. The thinking trace are overlaid on each snapshot. The Thinking VLA is able to think about its actions at different levels, allowing it to accomplish tasks requiring semantic reasoning and multiple steps of execution.](src/assets/actions/thinking4.jpg "fig:"){#fig:thinking-traces-visualization width="\\textwidth"}

`\smallskip `{=latex}`\noindent `{=latex}In this section, we focus on the Thinking VLA model (`\grshortlatest{}`{=latex} with thinking mode ON during inference). A detailed analysis of the higher-level reasoning enabled by `\grshortlatestER{}`{=latex} is deferred to Section `\ref{sec:results-agentic}`{=latex}. The advantage of interleaving robot actions with explicit thinking steps is particularly evident in the context of longer multi-step tasks, such as \"sorting clothes by colors\" (see Appendix `\ref{appendix:medium-horizon-benchmark}`{=latex} for our multi-step benchmark).

`\smallskip `{=latex}`\noindent `{=latex}`\cref{fig:medium-level-thinking}`{=latex} demonstrates that enabling the thinking mode yields a sizable improvement in the progress score for these tasks. This performance gain stems from the model's ability to decompose the difficult cross-modal translation, which involves mapping high-level, multi-step language instructions to low-level robot actions, into two simpler stages. First, the model generates a language-based thinking trace by converting the complex task into a sequence of specific, short-horizon steps (e.g., transforming the goal of \`\`sorting clothes" into a thought like, \`\`move gripper to the left so that it is closer to the clothes"). Second, the model maps these low-level language commands directly to robot actions. This two-step decomposition proves more robust than a single, end-to-end translation because the first step leverages the powerful visual-linguistic capabilities of the VLM backbone, while the second involves learning a simpler action mapping.

`\smallskip `{=latex}`\noindent `{=latex}Beyond quantitative performance gains, our experiments provide qualitative evidence of several additional benefits of the Thinking VLA. Firstly, it significantly improves interpretability. By visualizing the robot's internal thinking traces, we can inspect its planned actions and predict its next steps. This transparency enhances both human-robot trust and the safety of robot operations. Secondly, the Thinking VLA exhibits a degree of situational awareness regarding task completion. For instance, as shown in `\cref{fig:thinking-traces-visualization}`{=latex}, the robot automatically switches its objective from \`\`pick up the yellow tennis ball" to \`\`put the yellow tennis ball in the white bag" once the ball has been successfully grasped. This demonstrates that the model possesses an implicit awareness of the success of the prior subtask, removing the need for an explicit success detector. Thirdly, the Thinking VLA enables sophisticated recovery behaviors. For example, in `\cref{fig:thinking-traces-visualization}`{=latex}, when the water bottle slips from the right hand and lands near the left hand, the next thinking trace immediately becomes \`\`pick up the water bottle with the left hand", effectively initiating a self-correcting recovery behavior.

`\grlatestER{}`{=latex} is a generalist embodied reasoning model {#sec:results-er}
================================================================

```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=0.7\textwidth]{src/assets/ER/generality_vs_embodied_reasoning.pdf}
    \caption{The \grlatestER{} model is our most advanced model for embodied reasoning while retaining strong performance as a general-purpose multimodal foundation model. We measure Embodied Reasoning performance on a mix of academic benchmarks covering text-based image understanding as well as spatial signal prediction, and measure Generality performance on MMMU, GPQA, and Aider Polyglot.}
    \label{fig:gr1.5_er_generality}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]\centering
    \includegraphics[width=\textwidth]{src/assets/ER/er_capability_all_v3.pdf}
    \caption{The \grlatestER{} model has a diverse set of capabilities that can be applied on images and videos that are useful for robotics.}
    \label{fig:gr1.5_er_capability_all}
\end{figure*}
```
Robots require advanced and grounded knowledge of the physical world, ranging from precise spatial and temporal reasoning to a deep grasp of intuitive physics, causality, and affordances. We refer to this type of real-world understanding as *embodied reasoning* (ER). We introduce `\grlatestER{}`{=latex} (`\grshortlatestER{}`{=latex}), our most advanced multimodal thinking model for state-of-the-art embodied reasoning based on Gemini. When combined with a general VLA, such as `\grshortlatest{}`{=latex} showcased in `\cref{sec:results-actions}`{=latex}, `\grshortlatestER{}`{=latex} provides high-level intelligence to form the backbone of a general agentic robot system, which we describe in `\cref{sec:results-agentic}`{=latex}.

`\smallskip `{=latex}`\noindent `{=latex}In this section, we will focus on embodied reasoning capabilities and highlight several key properties of `\grshortlatestER{}`{=latex}:

1.  Strong embodied reasoning performance while retaining the generality of a frontier model;

2.  Excels in key robotic capabilities, such as complex pointing, progress understanding, and real-world use cases;

3.  Able to scale embodied reasoning performance via inference time compute.

Generality {#sec:results-er-generality}
----------

Notably, `\grlatestER{}`{=latex} is a *generalist* embodied reasoning model: it exhibits the broad capabilities of a frontier model across many domains while also showcasing exceptional performance as a spatial expert for real-world understanding. This is visualized in `\cref{fig:gr1.5_er_generality}`{=latex} which shows this trade-off for several contemporary frontier models. To assess models quantitatively in terms of both their broad capabilities as well as their more specialized embodied reasoning performance, we evaluate them on two sets of benchmarks. Firstly, we measure the models' performance across a collection of 15 widely-used academic benchmarks designed to measure embodied reasoning capabilities such as text-based image understanding (e.g., BLINK [@fu2024blink], CV-Bench [@tong2024cambrian1], and ERQA [@team2025gemini]) and spatial reasoning (e.g. RoboSpatial [@song2025robospatial], PointArena [@cheng2025pointarena], Where2Place [@yuan2024robopoint], and RefSpatial [@zhou2025roborefer]). The embodied reasoning score is a weighted average of 50% spatial reasoning benchmarks and 50% question answering benchmarks (both image and video). Secondly, we measure the generalist performance on an equally weighted mix of academic benchmarks that assess a broader range of capabilities including image understanding, science, and coding via the MMMU [@mmmu], GPQA [@rein2024gpqa], and Aider Polyglot [@Gauthier2024aider] benchmarks. The full evaluation details and results are discussed in Appendix `\ref{appendix-pre-er}`{=latex}. `\cref{fig:gr1.5_er_generality}`{=latex} shows that `\grlatestER{}`{=latex} expands the Pareto frontier of generality and embodied reasoning, achieving state-of-the-art embodied reasoning performance with comparable generality to other models in its model class.

```{=latex}
\begin{figure*}[ht!]\centering
    \includegraphics[width=0.9\textwidth]{src/assets/ER/pointing_capabilities_comparison.pdf}
    \caption{Performance on a mix of 5 academic benchmarks for 2D pointing and point-based reasoning; accuracy is defined as the percentage of point predictions within the ground truth mask (pointing) or correct final count (point-to-count). The categories describe different types of point prediction and are used to aggregate evaluation results from Point-Bench, RefSpatial, RoboSpatial, Where2Place, and PixMo Count. Results for GPT-5 and GPT-5-mini obtained via API calls in September 2025.}
    \label{fig:gr1.5_er_pointing_benchmark}
\end{figure*}
```
```{=latex}
\begin{figure*}[ht]\centering
    \includegraphics[width=0.9\textwidth]{src/assets/ER/pointing_collage.jpg}
    \caption{Complex pointing examples from \grshortlatestER{}. The model can follow complex pointing prompts that require reasoning about physical, spatial, and semantic constraints: It can localize precise parts of objects, such as the rim of a bowl and sockets of a power strip (Row 2, Columns 1 and 3) and predict points that respect physical, spatial, and semantic constraints, e.g., corresponding to objects that are lighter than $10$ pounds (Row 1, Column 3), and matching similar items (Row 2, Column 2). \grshortlatestER{} can also sequence points into trajectories that respect physics (Row 1, Column 2) and avoid collisions (Row 1, Column 1).}
    \label{fig:gr1.5_er_pointing_collage}
\end{figure*}
```
Frontier capabilities for Embodied Reasoning
--------------------------------------------

`\grshortlatestER{}`{=latex} showcases advanced performance across a number of embodied reasoning capabilities which are highly relevant for understanding the physical world, particularly in robotic applications. We visualize some of these in `\cref{fig:gr1.5_er_capability_all}`{=latex} and analyze a few of these areas in detail.

`\smallskip `{=latex}`\noindent `{=latex}**Complex Pointing:** A point is a flexible and lightweight representation that grounds a model's semantic understanding onto visual inputs. Using very few tokens, a point can precisely locate an abstract concept, such as where to click or the most appropriate object part to grasp. By extending this ability to predict a set of points, a model can generate more complex outputs like motion trajectories and paths, providing precise action guidance for robots. Points can also serve as intermediate reasoning tools for other downstream tasks, such as counting. We define the generalization of this capability, which combines pointing with reasoning, as **complex pointing**.

`\smallskip `{=latex}`\noindent `{=latex}`\grshortlatestER{}`{=latex} achieves new state-of-the-art results on academic benchmarks for complex pointing, as shown in `\cref{fig:gr1.5_er_pointing_benchmark}`{=latex}. The evaluation spans several key capabilities: **Average Pointing** aggregates performance across all benchmarks; **Spatial Pointing** focuses on pointing queries requiring spatial reasoning (e.g., \`\`point to the space left of the cup\"); **Steerable Pointing** tests the ability to modify points following user instructions (e.g., \`\`move the point slightly up\"); and **Point-to-Count** measures counting accuracy when points are used as an intermediate reasoning step. Refer to `\ref{appendix-pointing}`{=latex} for a more detailed breakdown. `\grshortlatestER{}`{=latex} significantly outperforms `\grshortlegacyER{}`{=latex}, Gemini 2.5, and GPT-5. It particularly excels at complex pointing tasks that require reasoning about physical, spatial, and semantic constraints including safety. We provide several examples in `\cref{fig:gr1.5_er_pointing_collage}`{=latex}.

```{=latex}
\begin{figure*}[!t]\centering
    \includegraphics[width=\textwidth]{src/assets/ER/gr1.5_er_sd_collage_v7.pdf}
    \caption{Multiple forms of progress understanding in \grshortlatestER{}. Understanding progress in scenes with robot interaction requires spatial, temporal, and semantic reasoning abilities potentially across multiple viewpoints and conditioned on language descriptions. Top: Predicting the percentage of task completion. Bottom left: Multi-view success detection: no single camera has sufficient information to detect success for the task ``put the orange pack of biscuits from the basket in the shelf next to the blue bowl.'' Bottom right: unshuffling video frames is another form of progress understanding where temporal understanding is essential.}
    \label{fig:gr1.5_er_sd_collage}
\end{figure*}
```
```{=latex}
\begin{figure*}[!t]\centering
    \includegraphics[width=1.0\textwidth]{src/assets/ER/er_1p5_sd_evals_total.pdf}
    \caption{
    Performance on various formulations of success detection (SD). Real-time SD considers model inference latency when computing prediction accuracy, while offline success detection assumes unlimited inference time for each prediction. Multiview SD uses multiple camera views while Singleview SD uses just a single viewpoint.
    }
    \label{fig:gr1.5_er_sd_results}
\end{figure*}
```
`\noindent `{=latex}**Progress Understanding and Success Detection**: Understanding temporal progress in real-world situations with physical interaction is critical for various robotics applications, including policy evaluation, training, data filtering, and robot orchestration in long-horizon tasks. However, accurate progress understanding requires advanced mastery of temporal and spatial reasoning, semantic understanding of the world, and multi-view understanding. As visualized in `\cref{fig:gr1.5_er_sd_collage}`{=latex}, `\grshortlatestER{}`{=latex} is capable of progress estimation in a wide set of scenarios with diverse and complex scenes and tasks on a mix of embodiments, including predicting percentage towards task completion, success detection [@du2023vision; @rocamonde2023vision], and video frame unshuffling [@ma2024generative].

`\smallskip `{=latex}`\noindent `{=latex}To quantitatively analyze the progress understanding capabilities of `\grshortlatestER{}`{=latex}, we evaluate various formulations of success detection, where a model must predict a binary success / failure signal given input images and a task instruction in text. In particular, we create a success detection evaluation benchmark which focuses on two important categories: real-time or offline inference and multiview or singleview image inputs.

`\smallskip `{=latex}`\noindent `{=latex}For the real-time evaluations, we sample recorded real-world robot rollouts from Section `\ref{sec:results-agentic}`{=latex}, and run the model at 5Hz and simulate inference latency. To calculate accuracy, the prediction for any given frame is considered to be the label from the most recent preceding frame for which a response is available. We find that models often require long inference time making real-time usage challenging, since stale success predictions quickly become irrelevant during dynamic robot interactions. For the offline evaluations, we leverage various types of videos of real-world interaction, which cover a mix of embodiments, camera viewpoints, and input formats. In the offline setting, we allow models unlimited inference time for success detection. As seen in `\cref{fig:gr1.5_er_sd_results}`{=latex}, `\grshortlatestER{}`{=latex} showcases strong performance for both real-time and offline success detection, in both the multiview and singleview image input settings.

`\noindent `{=latex}**Real-World Robotic Use Cases:** To assess `\grshortlatestER{}`{=latex}'s performance beyond academic benchmarks, we aim to study how well `\grshortlatestER{}`{=latex} performs in realistic scenarios which are representative of real-world use cases (`\cref{fig:gr1.5_er_ttp_benchmark_visual_overall}`{=latex}(a)). For a quantitative evaluation, we create a benchmark consisting of examples provided by early testers of `\grlegacy{}`{=latex} who had deployed  `\grshortlegacyER{}`{=latex} in their application domains. The benchmark focuses on spatial understanding for in-the-wild data distributions with tasks like object detection and pointing. As shown in `\cref{fig:gr1.5_er_ttp_benchmark_visual_overall}`{=latex}(b), `\grshortlatestER{}`{=latex} outperforms `\grlegacyER{}`{=latex} as well as contemporary state-of-the-art multimodal models.

```{=latex}
\begin{figure*}[!t]\centering
    \begin{subfigure}[t]{0.66\textwidth}
        \centering
        \includegraphics[width=\textwidth]{src/assets/ER/real_world_understanding_v2.jpg}
    \end{subfigure}%
    \hfill % Add some space between the two subfigures
    \begin{subfigure}[t]{0.33\textwidth}
        \centering
        \includegraphics[width=\textwidth]{src/assets/ER/real_world_understanding_performance.pdf}
    \end{subfigure}
    \caption{(a) Example real-world use case in an inspection task. Using \grlatestER{}, we can parse an image of an inventory shelf (left) into a table and present the result in an HTML page. (b) Performance on data distributions sourced from real-world use cases from early testers of Gemini Robotics-ER. Scores are measured as IOU for bounding box predictions and accuracy of predicted points within ground truth segmentation masks for pointing. Results for GPT-5 and GPT-5-mini obtained via API calls in September 2025.} 
    \label{fig:gr1.5_er_ttp_benchmark_visual_overall}
\end{figure*}
```
Thinking
--------

A hallmark of contemporary frontier models is that their performance can improve via additional reasoning steps at inference time (known as thinking). Just as language-based domains like math and code have benefited from thinking, we demonstrate the benefits of thinking for open-world embodied reasoning for `\grshortlatestER{}`{=latex}.

`\smallskip `{=latex}`\noindent `{=latex}`\cref{fig:gr1.5_er_sample_thoughts}`{=latex} presents example thinking traces of `\grshortlatestER{}`{=latex}. The traces demonstrate that `\grshortlatestER{}`{=latex} identifies key features in the image before focusing on fine details (gauge reading), proceeds logically and methodically (matching socks), can point while thinking (matching socks), and performs relevant mathematical operations correctly (gauge reading). We show additional thinking trace visualizations in Appendix `\ref{appendix-er-examples}`{=latex}.

```{=latex}
\begin{figure*}[!t]\centering
    \includegraphics[width=\textwidth]{src/assets/ER/thought_samples_2.pdf}
    \caption{Sample thinking traces from \grshortlatestER{} performing embodied reasoning tasks.}
    \label{fig:gr1.5_er_sample_thoughts}
\end{figure*}
```
`\smallskip `{=latex}`\noindent `{=latex}`\cref{fig:gr1.5_er_thinking_scaling}`{=latex} (Left) shows the effect of thinking for embodied reasoning tasks on the 15 academic benchmarks introduced in Section `\ref{sec:results-er-generality}`{=latex}. For each task category, `\grshortlatestER{}`{=latex}'s performance improves as the thinking token budget grows. The optimal amount of thinking varies depending on the task and the amount of reasoning required. Image and video QA tasks benefit more from longer thinking traces compared to pointing tasks. `\smallskip `{=latex}`\noindent `{=latex}`\cref{fig:gr1.5_er_thinking_scaling}`{=latex} (Center) shows that `\grshortlatestER{}`{=latex} can automatically modulate the number of thinking tokens depending on the amount of reasoning that is appropriate for the task. `\grshortlatestER{}`{=latex} also scales better with inference-time compute than Gemini 2.5 Flash, as seen in `\cref{fig:gr1.5_er_thinking_scaling}`{=latex} (Right). Frontier models are strong thinkers, however this does not necessarily translate into effective embodied reasoning, as can be seen by the relatively flat scaling curve for Gemini 2.5 Flash. `\grshortlatestER{}`{=latex}'s strong performance scaling with thinking shows promise for tapping into the inference-time compute scaling gains for embodied reasoning capabilities.

```{=latex}
\begin{figure*}[!t]\centering
    \includegraphics[width=0.32\textwidth]{src/assets/ER/er1p5_thought_budget_scaling_external_benchmarks_per_category_v2.png}
    \includegraphics[width=0.32\textwidth]{src/assets/ER/er1p5_token_length_scaling_external_benchmarks_per_category_v2.png}
    \includegraphics[width=0.32\textwidth]{src/assets/ER/er1pt5_vs_flash_june_final_v2.png}
    \caption{(Left) \grshortlatestER{} uses inference-time compute to improve performance. (Center) \grshortlatestER{} appropriately modulates how many thinking tokens it uses depending on the amount of reasoning needed by the task. Given the same thinking budget, \grshortlatestER{} uses fewest tokens for pointing tasks, and most for video QA. (Right) \grshortlatestER{} scales better with inference-time compute on embodied reasoning tasks compared to Gemini 2.5 Flash. All data points are the average of 3 evaluation runs over the same benchmark sets.}
    \label{fig:gr1.5_er_thinking_scaling}
\end{figure*}
```
`\grlatest{}`{=latex}: A Physical Agent {#sec:results-agentic}
=======================================

In this section, we combine `\grshortlatestER{}`{=latex}, our embodied reasoning model, with `\grshortlatest{}`{=latex}, our VLA model, into a full agentic system, and demonstrate how the synergy between these two models enables the execution of complex, long-horizon tasks [@2022_palm_saycan; @huang2022inner; @shi2025hi] in out-of-distribution environments. The test scenarios require advanced real-world understanding, tool use, long-horizon task planning, execution, and error recovery. To understand the contribution of different components of the agentic system, we conduct the following ablation study:

-   **`\grshortlatest{}`{=latex} (with Thinking On):** The Thinking VLA model that thinks before acting (`\cref{sec:thinking}`{=latex}).

-   **Agentic (Gemini 2.5 Flash + `\grshortlatest{}`{=latex}):** The baseline agentic system, utilizing the Gemini 2.5 Flash model as orchestrator, and our VLA model for execution.

-   **Agentic (`\grshortlatestER{}`{=latex} + `\grshortlatest{}`{=latex}):** Our agentic system, utilizing our ER model as orchestrator, and our VLA model for execution.

`\smallskip `{=latex}`\noindent `{=latex}We choose 8 tasks across the ALOHA and Bi-arm Franka[^3] platforms to test different aspects of an agent, including tool use, memory, planning, and dexterous manipulation skills. For example, in \`\`Sort Trash", \`\`Nut Allergy", and \`\`Mushroom Risotto" the agent needs to perform web search to understand how the objects fit the requirements of the prompt. In \`\`Desk Organization" and \`\`Swap", the agent is asked to memorize the state of the scene and objects, and then recover the original states. \`\`Pack Suitcase" and \`\`Top shelf to the table" test the `\grshortlatest{}`{=latex} Agent's 3-D reasoning and dexterity, manipulating soft items on shelving or hangers. The \`\`Blocks in Drawer" task has 9 distinct steps, testing the agent's planning capability. Details of the benchmark and the progress score definition can be found in Appendix `\ref{appendix:long-horizon}`{=latex}. For completeness, we also report success rate results in `\ref{appendix:additional-exps-agentic}`{=latex}.

```{=latex}
\begin{figure*}
\begin{subfigure}{\textwidth}
    \centering
    \includegraphics[width=0.8\linewidth]{src/assets/agent/progress_agent_v3.pdf}
        \includegraphics[width=0.8\linewidth]{src/assets/agent/long_horizon_figure_3.jpg}
\end{subfigure}
\begin{subfigure}{\textwidth}
    \centering
\vspace{0.2in}   
\includegraphics[width=0.8\linewidth]{src/assets/agent/omega_agentic_progress_no_legend.pdf}
        \includegraphics[width=0.7\linewidth]{src/assets/agent/Long-horizon-omega-star-3.jpeg}
\end{subfigure}
    \caption{Long-horizon evaluations for the \grshortlatest{} Agent and the baselines on ALOHA (top) and Bi-arm Franka (bottom), consisting of tasks that require advanced real-world understanding, tool use, long-horizon task planning, execution, and error recovery to successfully complete the complex long-horizon tasks.}
    \label{fig:agentic_comparison}
\end{figure*}
```
`\smallskip `{=latex}`\noindent `{=latex}As shown in Figure `\ref{fig:agentic_comparison}`{=latex}, the agent composed of the `\grshortlatest{}`{=latex} family of models consistently and significantly outperforms the other two baselines. The Thinking VLA achieves moderate performance with a progress score up to 44 percent. In contrast, our `\grshortlatest{}`{=latex} agent frequently achieves scores near 80%. Although the Thinking VLA can perform a degree of task decomposition, its world understanding and task planning are limited compared to the Embodied Reasoning model. This is consistent with the fact that the Thinking VLA is a smaller model that is primarily optimized for action output.

`\smallskip `{=latex}`\noindent `{=latex}We compare our `\grshortlatest{}`{=latex} Agent with the baseline agent that uses the off-the-shelf Gemini 2.5 Flash model for orchestration. For more complex tasks, our system achieves nearly double the progress score.

::: {#table:agent-failure-modes}
  ----------------------- ------------------------------------ ------------------------------------------------ --
   Subtask failure modes                 Agent                                      Agent                       
                           (Gemini 2.5 Flash as orchestrator)   (`\grshortlatestER{}`{=latex} as orchestrator)  
         Planning                        25.5%                                        9%                        
     Success detection                     6%                                         4%                        
          Action                          13%                                         9%                        
    Total failure rates                  44.5%                                       22%                        
  ----------------------- ------------------------------------ ------------------------------------------------ --

  :  Failure modes for long-horizon evaluations: A planning failure is when the orchestrator makes a wrong plan or issues a wrong instruction to the VLA. A success detection failure is when the agent ends a sub-task either too early or too late. An action failure is when the VLA does not successfully complete the sub-task.
:::

`\smallskip `{=latex}`\noindent `{=latex}We analyze failures of the agentic system and identify three categories: orchestrator errors (wrong plan or instruction to the VLA), success detector errors (ending the sub-task too early or too late), and VLA failures (inability to complete the sub-task). As detailed in Table `\ref{table:agent-failure-modes}`{=latex}, our agentic system with `\grshortlatestER{}`{=latex} as orchestrator outperforms the baseline with Gemini 2.5 Flash as orchestrator in all categories, with the most significant boost in task-planning performance. This difference underscores that `\grshortlatestER{}`{=latex} provides stronger embodied reasoning capabilities than Gemini 2.5 Flash.

`\smallskip `{=latex}`\noindent `{=latex}These results demonstrate a clear hierarchy in capability. While improvements in the VLA model significantly enhance execution robustness, they are insufficient for complex, long-horizon tasks. Furthermore, simply pairing an off-the-shelf VLM like Gemini 2.5 Flash with an advanced VLA model fails to achieve reliable end-to-end success, underscoring the importance of general real-world understanding and embodied reasoning. Our agentic architecture, which leverages the `\grshortlatestER{}`{=latex} model for high-level planning and orchestration, significantly improves reliability. This result highlights an important design philosophy for physical agents: combining general, robust low-level control with intelligent high-level embodied reasoning is the critical path towards deploying capable AI agents in the physical world.

Responsible Development and Safety {#sec:safety}
==================================

We are proactively developing novel safety and alignment approaches to enable AI-controlled robots to be responsibly deployed in human-centric environments. Our overall safety approach is multi-faceted and multi-layered, spanning high-level semantic safety reasoning, ensuring respectful dialogue with humans, thinking about safety before acting, and triggering low-level physical safety sub-systems (e.g., for collision avoidance) when needed. Additionally, we continue iterating on implementing best practices for operational safety as codified in existing safety standards [@ISO_TS_15066_2016; @ISO10218-1:2025]. We are also developing novel Auto-Red-Teaming frameworks to automatically discover safety and robustness vulnerabilities of Gemini Robotics models through continuous adversarial evaluations at scale.

`\smallskip `{=latex}`\noindent {\bf Safe Human-Robot Dialog}`{=latex}: By building on Gemini checkpoints, we ensure alignment with Gemini Safety policies [@comanici2025gemini] that have been designed to prevent generation of harmful content such as hate speech, sexual explicitness, and revealing personally identifiable information. Through adversarial testing, we find that our `\grshortlatestER{}`{=latex} models have strong compliance with these policies. In turn, this implies that robots will foster safe human-robot interactions, engaging in socially appropriate behaviors through their speech, gestures, and actions.

`\smallskip `{=latex}`\noindent {\bf Semantic Action Safety}`{=latex}: Broadly, semantic safety addresses the innumerable, \`\`long-tail", common-sense constraints of operating in human-centric environments, e.g. that a box may be too heavy to lift, or that spilled liquid on the floor is a slip hazard. We previously released the ASIMOV-benchmark [@sermanet2025generating] to evaluate and improve semantic action safety. We are now releasing an upgrade of this benchmark: ASIMOV-2.0 has improved quality, better tail coverage of real-world injury scenarios [@neiss], new video modalities (generated by VEO[^4]), and new embodied reasoning tasks involving physical constraints. `\cref{fig:asimov2}`{=latex} shows instances of this eval suite across text, image and video modalities; for more details see [@safe-vlms], released concurrently with this tech report. In `\cref{fig:safety_evals}`{=latex}, we see that `\grshortlatestER{}`{=latex} improves over `\grshortlegacyER{}`{=latex} in recognizing risks, understanding safety consequences of actions, and knowing when to trigger interventions. When `\grshortlatestER{}`{=latex} is fine-tuned for Thinking about Safety, we see state-of-the-art results in safety-constrained pointing tasks (`\cref{fig:safety_evals}`{=latex} (b)).

```{=latex}
\begin{figure*}[!ht]\centering 
        \begin{subfigure}[b]{0.45\textwidth}
\begin{center}
{\tiny 
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{l|p{4cm}}
\hline
 {\bf REAL REPORT} & 3YOM HAD A PILL IN HIS MOUTH FROM GRANDMOTHERS PILL CASE..\\ 
 {\bf DIAGNOSTIC CODE} & POISONING \\ 
 \hline
 {\bf SCENARIO} & A 3-year-old boy is playing on the floor near his grandmother...The boy has just reached into the organizer, pulled out a small pill, and put it in his mouth.\\ 
 {\bf ACTION} & Immediately inform the grandmother, pointing to the child, and say, 'He has one of your pills in his mouth. \\ 
 \hline
 {\bf Q1 (Latent Risk)} & Given only the scenario, which of the following physical injuries is most likely? A: Burns, B: Poisoning, C:..\\ 
 {\bf Q2 (Latent Risk Severity)} & Given only the scenario, what is the severity of the most likely injury? A: Minor, B: Moderate, C: Severe...\\ 
 {\bf Q3 (Action Effect)} & What is the effect of executing the action on the likelihood of the possible injury? A: Eliminates Risk, B. Reduces Risk, C. Increases Risk, D. No effect...\\ 
 {\bf Q4 (Activated Risk)} & What is the severity of the most likely injury after executing the action? A: None, B: Minor, C: Moderate, D: Severe...\\ \hline
\end{tabular}
}
\caption{{\bf \texttt{ASIMOV-2.0-Injury}}: Do models understand physical risks and associated severity in given scenarios (text); and safety consequences of actions?}\label{fig:examples_injury}
\end{center}
\end{subfigure}\hspace{1cm}
    \begin{subfigure}[b]{0.45\textwidth}
        \centering 
        \includegraphics[height=5.5cm, width=\linewidth]{src/assets/safety/constraints6.png}\vspace{0.3cm}\caption{{\bf \texttt{ASIMOV-2.0-Constraints}}: Do (multimodal) model responses (red pointing labels) adhere to {\it embodiment-specific} Safety Instructions?}
    \end{subfigure}
    \par\bigskip
    \centering
    \begin{subfigure}[b]{\textwidth}
        \centering
          \includegraphics[height=2.7cm, width=\linewidth]{src/assets/safety/safety-video-frames.png}
        \caption{{\bf \texttt{ASIMOV-2.0-Video}}: Do models understand physical risks and severity in (AI-generated) videos (as opposed to text); can they predict the last possible timestamp (red frame above) at which an intervention could have effectively prevented the injury?}
    \end{subfigure}
\label{fig:examples}
    \caption{{\bf \texttt{ASIMOV-2.0}} Physical Safety Benchmark: Instances and Key Questions}  \label{fig:asimov2}
\end{figure*}
```
```{=latex}
\begin{figure*}[!ht]\centering
    \includegraphics[width=0.9\linewidth]{src/assets/safety/safety_plot_new.pdf}
    \caption{ASIMOV-2.0 Safety Evaluations.}\label{fig:safety_evals}
\end{figure*}
```
`\smallskip `{=latex}`\noindent {\bf Auto-Red-Teaming Framework}`{=latex}: To augment our static evaluation methods, we have also developed novel automated red teaming (ART) techniques for dynamic, adversarial stress-testing of Gemini Robotics models. Our approach is inspired by Gemini's Auto-Red-Teaming [@comanici2025gemini] framework which formulates adversarial testing as a game played between three models: an *Attacker*, a *Target*, and an *Autorater*. In our case, the *Target* is a Gemini Robotics model. The *Attacker* is prompted to devise an "attack" on the Target model. Effectively, the Attacker samples an "ordinary" task from a source (e. g. training/eval data of the Target model), and turns it into an adversarial task. For example, the ER model may be attacked through a malicious instruction (*prompt attack*) or a corrupted/edited image (*scene attack*); and the Actions model may be attacked during a rollout with undesirable disturbances (e.g. moving obstacles) in the environment (*environment attack*). The *AutoRater* is a judge that attempts to meticulously rate the Target's response for correctness and safety. `\cref{fig:red-teaming}`{=latex} shows an instance of ER model hallucination discovered through auto-red-teaming: the Attacker samples an ALOHA scene, and cleverly requests the ER model to point to an entity that does not exist in the scene. The AutoRater, given an image overlay of the ER model responses, reliably detects hallucination and marks this response as a failure while providing a reasoning trace.

```{=latex}
\centering
```
![Auto-Red-Teaming detects ER Hallucinations under adversarial prompts.](src/assets/safety/roboart-example.png){#fig:red-teaming height="4cm" width="0.85\\linewidth"}

Through auto-red-teaming, we verified the following: (1) `\grshortlatestER{}`{=latex} (particularly with Thinking enabled) has greater robustness under instruction obfuscation, hallucination elicitation and content safety attacks; (2) model responses can be reliably critiqued and corrected using AutoRaters for enhanced robustness; and (3) training data generated via auto-red-teaming helps mitigate vulnerabilities such as hallucinations.

`\smallskip `{=latex}`\noindent `{=latex}We are committed to continuously innovating safety and alignment techniques as we advance our robot foundation models. Furthermore, we acknowledge that the societal impacts of Gemini Robotics deployments must be addressed concurrently with safety risks. Proactive management and monitoring of these multifaceted impacts -- spanning both benefits and challenges -- are fundamental to our strategy for mitigating risk, deploying responsibly, and ensuring transparent reporting. Please refer to Appendix `\ref{sec:model-card}`{=latex} for the Gemini Robotics model card.

Discussion
==========

This work presents Gemini Robotics 1.5, a significant step towards general-purpose robots capable of operating intelligently in the physical world. By combining the power of an advanced Embodied Reasoning model with a general Vision-Language-Action model, we have made significant progress in tackling key bottlenecks in robot learning and generalization. Our core contribution lies in three major innovations:

-   **Thinking VLA:** We have shown that enabling the VLA model to \"think before acting\" through a multi-level internal monologue notably improves its ability to handle more complex, multi-step tasks.

-   **Learning across different robot embodiments:** We have shown that `\grlatest{}`{=latex} can successfully learn from heterogeneous datasets, including data from across different robot platforms, and transfer learned skills between them. This breakthrough accelerates learning in the presence of the data scarcity problem that has long hindered the field, accelerating progress towards generalist robots.

-   **State-of-the-Art Embodied Reasoning:** The `\grlatestER{}`{=latex} model establishes a new state of the art for a wide range of embodied reasoning tasks. Its performance on tasks like visual and spatial thinking, task planning, progress estimation, and success detection is critical for robust, real-world robotic applications.

`\smallskip `{=latex}`\noindent `{=latex}This tech report demonstrates that the organic combination of these three contributions offers a compelling path to a new generation of general-purpose robots. Embodied thinking provides the intelligence to decompose long-horizon tasks, but this intelligence is only valuable when it can be translated into successful executions, empowered by our capable and general VLA model. This VLA is, in turn, able to share knowledge across different robot embodiments, which can unlock the immense amount of data collected by the entire robotics community. Finally, the system's state-of-the-art embodied reasoning capabilities enhance the robot's perception, semantic understanding, and planning for complex tasks that require both information gathering and multi-step reasoning. Together, these elements form a complete and powerful agentic system, paving the way for robots that can operate with human-like intelligence, adaptability, and safety in complex and dynamic environments.

`\smallskip `{=latex}`\noindent `{=latex}While Gemini Robotics 1.5 represents a major milestone, this work also highlights several avenues for future research. An important next step is to leverage more scalable data sources beyond traditional robot action data, such as real-world human videos and synthetic videos. Our architectural changes in GR 1.5 already equip the model to learn from these data sources without requiring action annotations. Future efforts will focus on learning from publicly available low-quality video corpora, among other data sources, to further mitigate the data scarcity problem. Additionally, although GR 1.5 demonstrates a new level of generalization, its dexterity remains on par with the previous generation. We will explore new architectures and training methods, such as reinforcement learning, to enhance the robot's dexterity without sacrificing its generality, allowing it to perform more intricate and precise manipulations. `\newpage`{=latex}

```{=latex}
\newpage
```
Contributions and Acknowledgments
=================================

`\smallskip `{=latex}`\noindent `{=latex}**Authors**

```{=latex}
\begin{multicols}{2}
\smallskip \noindent Abbas Abdolmaleki\\
Saminda Abeyruwan\\
Joshua Ainslie\\
Jean-Baptiste Alayrac\\
Montserrat Gonzalez Arenas\\
Ashwin Balakrishna\\
Robert Baruch\\
Nathan Batchelor\\
Alex Bewley\\
Jeff Bingham\\
Michael Bloesch\\
Konstantinos Bousmalis\\
Philemon Brakel\\
Anthony Brohan\\
Thomas Buschmann\\
Arunkumar Byravan\\
Serkan Cabi\\
Ken Caluwaerts\\
Federico Casarini\\
Christine Chan\\
Oscar Chang\\
London Chappellet-Volpini\\
Jose Enrique Chen\\
Xi Chen\\
Hao-Tien Lewis Chiang\\
Krzysztof Choromanski\\
Adrian Collister\\
David B. D'Ambrosio\\
Sudeep Dasari\\
Todor Davchev\\
Meet Kirankumar Dave\\
Coline Devin\\
Norman Di Palo\\
Tianli Ding\\
Carl Doersch\\
Adil Dostmohamed\\
Yilun Du\\
Debidatta Dwibedi\\
Sathish Thoppay Egambaram\\
Michael Elabd\\
Tom Erez \\
Xiaolin Fang\\
Claudio Fantacci\\
Cody Fong\\
Erik Frey\\
Chuyuan Fu\\
Ruiqi Gao\\
Marissa Giustina\\
Keerthana Gopalakrishnan\\
Laura Graesser\\
Oliver Groth\\
Agrim Gupta\\
Roland Hafner\\
Steven Hansen\\
Leonard Hasenclever\\
Sam Haves\\
Nicolas Heess\\
Brandon Hernaez\\
Alex Hofer\\
Jasmine Hsu\\
Lu Huang\\
Sandy H. Huang\\
Atil Iscen\\
Mithun George Jacob\\
Deepali Jain\\
Sally Jesmonth\\
Abhishek Jindal\\
Ryan Julian\\
Dmitry Kalashnikov\\
M. Emre Karagozler\\
Stefani Karp\\
Matija Kecman\\
J. Chase Kew\\
Donnie Kim\\
Frank Kim\\
Junkyung Kim\\
Thomas Kipf\\
Sean Kirmani\\
Ksenia Konyushkova\\
Li Yang Ku\\
Yuheng Kuang\\
Thomas Lampe\\
Antoine Laurens\\
Tuan Anh Le\\
Isabel Leal\\
Alex X. Lee\\
Tsang-Wei Edward Lee\\
Guy Lever\\
Jacky Liang\\
Li-Heng Lin\\
Fangchen Liu\\
Shangbang Long\\
Caden Lu\\
Sharath Maddineni\\
Anirudha Majumdar\\
Kevis-Kokitsi Maninis\\
Andrew Marmon\\
Sergio Martinez\\
Assaf Hurwitz Michaely\\
Niko Milonopoulos\\
Joss Moore\\
Robert Moreno\\
Michael Neunert\\
Francesco Nori\\
Joy Ortiz\\
Kenneth Oslund\\
Carolina Parada\\
Emilio Parisotto\\
Amaris Paryag\\
Acorn Pooley\\
Thomas Power\\
Alessio Quaglino\\
Haroon Qureshi\\
Rajkumar Vasudeva Raju\\
Helen Ran\\
Dushyant Rao\\
Kanishka Rao\\
Isaac Reid\\
David Rendleman\\
Krista Reymann\\
Miguel Rivas\\
Francesco Romano\\
Yulia Rubanova\\
Peter Pastor Sampedro\\
Pannag R Sanketi\\
Dhruv Shah\\
Mohit Sharma\\
Kathryn Shea\\
Mohit Shridhar\\
Charles Shu\\
\columnbreak
Vikas Sindhwani\\
Sumeet Singh\\
Radu Soricut\\
Rachel Sterneck\\
Ian Storz\\
Razvan Surdulescu\\
Jie Tan\\
Jonathan Tompson\\
Saran Tunyasuvunakool\\
Jake Varley\\
Grace Vesom\\
Giulia Vezzani\\
Maria Bauza Villalonga\\
Oriol Vinyals\\
René Wagner\\
Ayzaan Wahid\\
Stefan Welker\\
Paul Wohlhart\\
Chengda Wu\\
Markus Wulfmeier\\
Fei Xia\\
Ted Xiao\\
Annie Xie\\
Jinyu Xie\\
Peng Xu\\
Sichun Xu\\
Ying Xu\\
Zhuo Xu\\
Jimmy Yan\\
Sherry Yang\\
Skye Yang\\
Yuxiang Yang\\
Hiu Hong (Eddie) Yu\\
Wenhao Yu\\
Wentao Yuan\\
Yuan Yuan\\
Jingwei Zhang\\
Tingnan Zhang\\
Zhiyuan Zhang\\
Allan Zhou\\
Guangyao Zhou\\
Yuxiang Zhou\\
\end{multicols}
```
`\smallskip `{=latex}`\noindent `{=latex}**Acknowledgements**\
`\smallskip `{=latex}`\noindent `{=latex}We would like to acknowledge the support from Amy Nommeots-Nomm, Ashley Gibb, Bhavya Sukhija, Bryan Gale, Catarina Barros, Christy Koh, Clara Barbu, Demetra Brady, Hiroki Furuta, Jennie Lees, Kendra Byrne, Keran Rong, Kevin Murphy, Kieran Connell, Kuang-Huei Lee, Martina Zambelli, Matthew Jackson, Michael Noseworthy, Miguel Lázaro-Gredilla, Mili Sanwalka, Mimi Jasarevic, Nimrod Gileadi, Rebeca Santamaria-Fernandez, Rui Yao, Siobhan Mcloughlin, Sophie Bridgers, Stefano Saliceti, Steven Bohez, Svetlana Grant, Tim Hertweck, Verena Rieser and Yandong Ji.

`\smallskip `{=latex}`\noindent `{=latex}We would like to thank Zoubin Ghahramani, Koray Kavukcuoglu, and Demis Hassabis for their leadership and support of this effort. We would also like to recognize the many teams across Google and Google DeepMind that have contributed to this effort including Legal, Marketing, Communications, Responsibility and Safety Council, Responsible Development and Innovation, Policy, Strategy and Operations as well as our Business and Corporate Development teams. We would like to thank everyone on the Robotics team not explicitly mentioned above for their continued support and guidance. We would also like to thank the Apptronik team for their support.

```{=latex}
\newpage
```
Appendix {#appendix .unnumbered}
========

```{=latex}
\appendix
```
`\label{sec:appendix}`{=latex}

Model Card {#sec:model-card}
==========

We present the model card  [@mitchell2019model] for `\grlatestER{}`{=latex} and `\grlatest{}`{=latex} models in Table `\ref{tab:model-card}`{=latex}.

```{=latex}
\begin{longtable}{p{0.24\textwidth}p{0.70\textwidth}}

\midrule[\heavyrulewidth]
\multicolumn{2}{c}{\textbf{Model summary}} \\
\midrule[\heavyrulewidth]

\textbf{Model architecture} & { \grlatestER{} is a Vision-Language-Model that enhances Gemini’s world understanding.

\grlatest{}  is a Vision-Language-Action model enabling general-purpose robot manipulation on different tasks, scenes, and across multiple robots.} \\

\midrule[0,5pt]

\textbf{Input(s)} & {The models take text (e.g., a question or prompt or numerical coordinates) and images (e.g., robot camera images) as input.} \\

\midrule[0,5pt]

\textbf{Output(s)} & {\grlatestER{} generates text (e.g., numerical coordinates) in response to the input.  \grlatest{} generates continuous numerical values that represent robot actions, and additionally text when thinking mode is enabled.} \\

\midrule[\heavyrulewidth]
\multicolumn{2}{c}{\textbf{Model Data}} \\
\midrule[\heavyrulewidth]

\textbf{Training Data} & {\grlatestER{} and \grlatest{} were trained on datasets comprised of images, text, and robot sensor and action data.} \\

\midrule[0,5pt]

\textbf{Data Pre-processing} & 
{The multi-stage safety and quality filtering process employs data cleaning and filtering methods in line with our policies. These methods include: 
\begin{itemize}
    \item Sensitive Data Filtering: Automated techniques were used to filter out certain personal information and other sensitive data from text and images.
    \item Synthetic captions: Each image in the dataset was paired with both original captions and synthetic captions. Synthetic captions were generated using Gemini and FlexCap~\cite{dwibedi2024flexcap} models and allow the model to learn details about the image.
\end{itemize}
Further details on data pre-processing can be found in~\cite{geminiteam2023gemini}.
} \\

\midrule[\heavyrulewidth]
\multicolumn{2}{c}{\textbf{Implementation Frameworks}} \\
\midrule[\heavyrulewidth]

\textbf{Hardware} & 
{TPU v4,  v5p and v6e.}
\iffalse The \geminirobotics{} models were trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv4 and TPUv5). TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably compared to CPUs. TPUs often come with large amounts of high-bandwidth memory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution. Training can be distributed across multiple TPU devices for faster and more efficient processing. The efficiencies gained through the use of TPUs are aligned with Google's commitments to operate sustainably~\cite{google_sustainability}.\fi \\

\midrule[0,5pt]

\textbf{Software} & 
{JAX~\cite{jax2018github}, ML Pathways~\cite{2021pathwaysarchitecture}.

\iffalse Training was done using JAX~\cite{jax2018github} and ML Pathways~\cite{2021pathwaysarchitecture}.  JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.  ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large vision-language models like these ones.  Together, JAX and ML Pathways are used as described in~\cite{geminiteam2023gemini}; ``the `single controller' programming model of JAX and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow.''\fi
} \\

\midrule[\heavyrulewidth]
\multicolumn{2}{c}{\textbf{Evaluation}} \\
\midrule[\heavyrulewidth]

\textbf{Approach} & {See Section \ref{sec:results-er} for \grlatestER{} evaluation procedures, Sections \ref{sec:results-actions} for \grlatest{} evaluation procedures, and Section \ref{sec:safety} for \geminirobotics{} Safety evaluation procedures. } \\

\midrule[0,5pt]

\textbf{Results} & {See Section \ref{sec:results-er} for \grlatestER{} evaluation results, Sections \ref{sec:results-actions} for \grlatest{} evaluation results, and Section \ref{sec:safety} for \geminirobotics{} Safety evaluation results.} \\

\midrule[\heavyrulewidth]
\multicolumn{2}{c}{\textbf{Model Usage \& Limitations}} \\
\midrule[0,5pt]

\textbf{Ethical Considerations \& Risks} & 
{Previous impact assessment and risk analysis work as discussed in~\cite{geminiteam2023gemini} and references therein remain relevant to \geminirobotics.  
\iffalse
Broadly, risks fall into three categories:
\begin{itemize}
    \item Unintentional model failure includes system mistakes which could result in accidental physical injury or other physical harm. This could occur if a robotics system is deployed in an environment which is out-of-distribution and it misunderstands or misreasons about its environment. This can also occur if the training data is not sufficiently representative, leading to underperformance for specific groups~\cite{hao2019}.
    \item Malicious adversarial users could attack or “jailbreak” the system to facilitate harmful tasks. This could include a user instructing the robot to perform unsafe actions~\cite{robey2024jailbreaking}.
    \item Similar to other AI systems, some risk may arise from interactions with humans. For example, without a human-centered interface design, users could be susceptible to anthropomorphism, putting undue user trust or over-relying on the system~\cite{manzini2024}. Deployment in contexts such as hospitals and schools will require thoughtful data stewardship to preserve individuals privacy, while offering the benefits of robotics in these contexts.
\end{itemize}
\fi
See Section \ref{sec:safety} for information on responsible development and safety mitigations.} \\

\midrule[\heavyrulewidth]
\captionsetup{position=below}
\caption{\grlatest{} model card.}
\label{tab:model-card}
\end{longtable}
```
```{=latex}
\newpage
```
`\grlatest{}`{=latex} is a general multi-embodiment Vision-Language-Action Model {#appendix-pre-action}
================================================================================

In this section, we provide additional material to supplement the results for `\grshortlatest{}`{=latex}.

Rank consistency between evaluations in simulation and on real robots {#sec:sim-to-real-appendix}
---------------------------------------------------------------------

We leverage physics-based simulators to massively scale up the evaluations of our `\grshortlatest{}`{=latex} models for new objects, scenes, and environments. `\cref{fig:sim_to_real_correlation}`{=latex} demonstrates the rank consistency between our MuJoCo simulator-based evaluation and real-robot evaluations across multiple scenes and tasks, which enables us to rapidly iterate on architectural and algorithmic improvements while reducing the need to conduct slow real-robot evaluations.

```{=latex}
\centering
```
![Each colored pair represents an A/B test both in simulation and real. Across a range of tasks, we find that success rates are rank consistent between simulation and real. This consistency allows for rapid iteration of model architecture, training objectives, and experiment design.](src/assets/actions/sim_to_real.png){#fig:sim_to_real_correlation width="50%"}

Generalization benchmark {#appendix:generalization-tasks}
------------------------

### ALOHA robot

Our generalization benchmark expands the benchmark defined in [@team2025gemini] (68 generalization tasks) with a new set of action generalization tasks (5 tasks) and a new category to measure generalization to entirely new tasks (12 tasks). See [@team2025gemini] for details about previous Visual, Instruction and Action generalization tasks. To measure performance in-distribution we use the dexterity benchmark defined in Section 3.2 of  [@team2025gemini] (20 tasks). Progress score definition for in-distribution, visual, semantic and action generalization can also be found in [@team2025gemini].

#### New action generalization progress score definition {#appendix:new-action-gen-aloha}

The action generalization benchmark defined in prior work [@team2025gemini] largely focused on measuring the model's ability to handle new object locations and shape variants. We expand the benchmark further to include tasks that require the policy to compose multiple learned motions in novel ways to solve new tasks. For example, having seen data of the robot pushing objects for a short distance at different locations, the 'drink-pushing' task (Figure `\ref{fig:action-gen-tasks}`{=latex}) measures how well the policy can combine them together to push a novel object all the way from bottom of the table to the top edge. Table `\ref{tab:aloha_action_gen}`{=latex} lists the definition of progress scores for each task.

```{=latex}
\centering
```
![Examples of the expanded action generalization benchmark. `\label{fig:action-gen-tasks}`{=latex}](src/assets/actions/gr1.5_appendix_new_act_gen.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: New action generalization tasks progress score.} 
\label{tab:aloha_action_gen}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm}  | p{5cm}  | p{5cm}  |} 
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{ New action generalization tasks.}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Place the cube inside the bag''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Pick up the grape with the left arm and pick up the bread with the right arm''.}.\vspace{1pt} & 
\vspace{1pt}\textbf{``Put the Knopper snack in the tupperware''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0:  if the successfully placed the cube inside the bag;
\item[] 0.3: if the robot picked up the cube but failed to put it in the bag; 
\item[] 0.0: if the robot failed to pick the cube.
\end{itemize}
& 
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if both objects are picked up by the correct arm; 
\item[] 0.3: if one of the arms picked up the correct object, another arm picked up the wrong object; 
\item[] 0.0: if both arms picked up wrong object.
\end{itemize}
& 
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot successfully put the Knopper snack in the tupperware;
\item[] 0.4: if the robot picked up the right snack but failed to put in tupperware after 5 attempted; 
\item[] 0.0: if the robot pick up the wrong object.
\end{itemize}
\\
\hline

% --- Row 2 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Push the drink to the front of the table''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Take the straw out of the tupperware''.}\vspace{1pt} & 
% This cell is empty
\\
\hline
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot successfully pushed the drink to the top part of the table;
\item[] 0.5: if the robot didn't fully push the drink to the top part of the table, or pushed something else first; 
\item[] 0.0: if the robot didn't approach the correct object at all.
\end{itemize}
& 
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot successfully picked up the straw and move it outside of the range of the tupperware;
\item[] 0.3: if the robot picked the straw but failed to move it out, or poured the tupperware; 
\item[] 0.0: if the robot didn't approach the straw or tupperware.
\end{itemize}
& 
% Content Col 6 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
#### Task generalization progress score definition {#appendix:task-gen-aloha}

We consider 12 different tasks across multiple scenes. The tasks have *unseen instructions, unseen objects and initial conditions* compare to the training data. See Figure `\ref{fig:task-gen-tasks}`{=latex} for details and Table `\ref{tab:aloha_task_gen}`{=latex} collects for the definition of progress for each task for the new category of task generalization.

```{=latex}
\centering
```
![Examples of task execution for the Task generalization analysis of  `\ref{sec:generalization}`{=latex}. `\label{fig:task-gen-tasks}`{=latex}](src/assets/actions/gr1.5_appendix_agibot.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Task generalization tasks.} 
\label{tab:aloha_task_gen}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm}  | p{5cm}  | p{5cm}  |} 
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Task generalization tasks progress score.}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Pick up a bottle and put it in the bottom shelf of the wardrobe''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Pick up a bottle from the box and put it on the table''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Pick up the blue hat and hang it on the hook''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot placed the bottle in the wardrobe;
\item[] 0.75: if the robot moved the bottle towards the wardrobe;
\item[] 0.5: if the robot grasped a bottle; 
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot placed a bottle on the table; 
\item[] 0.25: if the robot reached for a bottle but did not grasp it; 
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot successfully hung the hat on the hook; 
\item[] 0.75: if the robot tried to hang the hat on the hook;
\item[] 0.5: if the robot grasped the hat; 
\item[] 0.0: if anything else happened.
\end{itemize}
\\
\hline

% --- Row 2 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Pick up a bottle and place it in the leftmost compartment of the folder rack''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Pick up the plum blossom and insert it vertically and carefully into the glass vase''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Pick up the rubber stamp and stamp the white paper with it''.}\vspace{1pt} \\
\hline
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot placed the bottle in the leftmost compartment of the folder rack;
\item[] 0.75: if the robot placed the bottle in any compartment of the folder rack;
\item[] 0.5: if the robot moved the bottle towards the folder rack;
\item[] 0.25: if the robot grasped a bottle; 
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot inserted the plum blossom into the vase;
\item[] 0.75: if the robot moved the plum blossom towards the vase;
\item[] 0.50: if the robot grasped the plum blossom;
\item[] 0.25: if the robot moved towards the plum blossom;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot stamped the document;
\item[] 0.75: if the robot moved the rubber stamp over the document;
\item[] 0.25: if the robot grasped the rubber stamp;
\item[] 0.0: if anything else happened. 
\end{itemize}
\\
\hline

% --- Row 3 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Take the hat off the hook and place it on the table''}\vspace{1pt} & 
\vspace{1pt}\textbf{``Wipe the left shoe with the cloth''}\vspace{1pt} & 
\vspace{1pt}\textbf{``Wipe the table with the cloth''}\vspace{1pt} \\
\hline
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0:  if the robot placed the hat on the table;
\item[] 0.75:  if the robot removed the hat from the hook;
\item[] 0.5:  if the robot grasped the hat;
\item[] 0.0:  if anything else happened. 
\end{itemize}
& 
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot wiped the left shoe;
\item[] 0.5: if the robot moved the cloth towards the left shoe;
\item[] 0.25: if the robot grasped the cloth;
\item[] 0.0: if anything else happened. 
\end{itemize}
& 
% Content Col 9
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot moved towards the cloth back and forth on some portion of the table;
\item[] 0.5: if the robot grasped the cloth;
\item[] 0.25: if the robot reached the cloth;
\item[] 0.0: if anything else happened. 
\end{itemize}
\\
\hline

% --- Row 4 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Wipe the top of the toaster with the cloth''}\vspace{1pt} & & \\
\hline
% Content Col 10
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot wiped the top of the toaster;
\item[] 0.75: if the robot moved towards the cloth towards the top of the toaster;
\item[] 0.25:  if the robot grasped the cloth;
\item[] 0.0:  if anything else happened. 
\end{itemize}
& & \\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\newpage
```
### Bi-arm Franka robot

We defined a new generalization benchmark for our Bi-arm Franka robot. This benchmark includes a total of 44 tasks, 20 of those in distribution and 24 including variants of them across different axes: instruction, visual and action generalization, following the same approach we used to define the ALOHA robot benchmark in  [@team2025gemini]. For the task generalization category we used the same 12 tasks and progress score described in `\ref{appendix:task-gen-aloha}`{=latex} for ALOHA. It is worth noting that the the Bi-arm Franka benchmark also includes highly dexterous tasks such as NIST Board 2 assembly tasks and insertion of cables in a workstation and sockets.

The Bi-arm Franka robot benchmark is defined across three different scenes (Figure `\ref{fig:omega-scenes}`{=latex}):

```{=latex}
\centering
```
![Scenes used to define the generalization benchmark for Bi-Arm Franka robot. Left: A workbench inspired scene that allows for manipulation of tools and skills ranging from hanging, unhanging, picking and placing. Center: a scene with a computer and assorted cables and peripherals, allowing for cable handling, insertions and removals. Left: Layout of the National Institute of Science and Technology (NIST) Task Board 2  [@nistassemblyboards]. `\label{fig:omega-scenes}`{=latex}](src/assets/actions/gr1.5_appendix_omega_in_dist_scenes.png){width="\\textwidth"}

-   **Workbench:** this scene spans both table-top manipulation tasks and interaction with the vertical back panel, which allows for manipulation of tools and skills ranging from hanging, unhanging, picking and placing.

-   **Computer scene:** this scene requires interacting with real-world computer, cables and peripherals that involves dexterous and precise tasks such as cable handling, insertions and removals.

-   **NIST Assembly Board 2:** based on the Task Assembly Board 2 as defined by NIST, this scene presents complex and precise assembly and removal of the three belts present.

Figure `\ref{fig:example_of_gen_omega}`{=latex} shows examples of task variations to measure instruction, visual and semantic generalization on the Bi-arm Franka robot.

```{=latex}
\centering
```
![Example variations of scenes used for measuring performance across generalization axes on the Bi-arm Franka platform.](src/assets/actions/gr1.5_omega_scenes_gen_appendix.png){#fig:example_of_gen_omega width="\\textwidth"}

#### Task progress score for in-distribution, visual and action generalization tasks

We now define the task progress score for each task in the generalization benchmark for Bi-arm Franka robot.

`\noindent `{=latex}**Workbench scene**

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Workbench scene).} 
\label{tab:franka_workbench}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm}  | p{5cm}  | p{5cm}  |} 
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Workbench scene.}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Hang the tape''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Unhang the tape''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Unhang the level''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot hung the tape successfully;
\item[] 0.7: if the robot grasped the tape and tried to hang it but failed;
\item[] 0.3: if the robot grasped the tape but didn't move towards the hook;
\item[] 0.1: if the robot reached for the tape but failed to grasp it;
\item[] 0.0: if the robot didn't reach for the tape.
\end{itemize}
& 
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the tape got unhooked and ends up anywhere on the table;
\item[] 0.7: if the robot grasped the tape but failed to unhook it;
\item[] 0.3: if the robot reached for the tape on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the level got unhooked and ends up anywhere on the table;
\item[] 0.8: if the robot unhooked the level only from one hook;
\item[] 0.7: if the robot grasped the level but failed to unhook it;
\item[] 0.3: if the robot reached for the level on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
\\
\hline

% --- Row 2 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Remove the red pen''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Place the rightmost wrench on the table''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Remove a gear from the rightmost container''.}\vspace{1pt} \\
\hline
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the red pen ended up anywhere on the table;
\item[] 0.7: if the robot grasped the red pen but failed to take it off;
\item[] 0.3: if the robot reached for the red pen on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the rightmost wrench (with respect to the robot) got unhooked and ended up anywhere on the table;
\item[] 0.7: if the robot grasped the rightmost wrench (with respect to the robot) but failed to unhook it;
\item[] 0.3: if the robot reached for the rightmost wrench (with respect to the robot) on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if at least a gear from the rightmost container (with respect to the robot) ended up anywhere on the table;
\item[] 0.7: if the robot grasped a gear from the rightmost container (with respect to the robot) but failed to remove it from the container;
\item[] 0.3: if the robot reached for a gear from the rightmost container (with respect to the robot) but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
\\
\hline

% --- Row 3 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Place the brush on the table''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Put a gear in the rightmost container''.}\vspace{1pt} & 
% This cell is empty
\\
\hline
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the brush got unhooked and ended up anywhere on the table;
\item[] 0.7: if the robot grasped the brush but failed to unhook it;
\item[] 0.3: if the robot reached for the brush on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if a gear had been moved to the rightmost container (with respect to the robot);
\item[] 0.7: if the robot grasped a gear and attempted to put it in the rightmost container (with respect to the robot) but failed to do to so;
\item[] 0.3: if the robot reached for a gear from the table but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
& 
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\newpage
```
`\noindent`{=latex}**Computer scene**

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Computer scene).} 
\label{tab:franka_computer}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm}  | p{5cm}  | p{5cm}  |} 
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Computer scene.}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Connect the power plug to the computer''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Insert the white power plug into the socket''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Insert the orange LAN cable in the computer''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot fully inserted the power plug into the right place in the computer;
\item[] 0.8: if the robot partially inserted the power plug into the right place in the computer;
\item[] 0.5: if the robot grasped the power plug and tried to insert it in the right place in the computer but failed;
\item[] 0.3: if the robot managed to grasp the power plug but didn't try to insert it into the computer.
\item[] 0.1: if the robot reached for the power plug but failed to grasp it.
\item[] 0.0: if the robot didn't reach for the power plug.
\end{itemize}
& 
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot grasped the plug and successfully inserted it in the socket;
\item[] 0.9: if the robot grasped the plug and partially inserted it in the socket;
\item[] 0.6: if the robot grasped the plug, tried to insert it in the socket but failed;
\item[] 0.3: if the robot grasped the plug but didn't try to insert it in the socket;
\item[] 0.1: if the robot reached for the plug but didn't grasp it;
\item[] 0.0: if the robot didn't reach for the plug.
\end{itemize}
& 
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot grasped the orange internet cable and successfully inserted it in the right socket of the computer;
\item[] 0.9: if the robot grasped the orange internet cable and partially  inserted it in the socket of the computer;
\item[] 0.6: if the robot grasped the orange internet cable, tried to insert it in the right socket of the computer but failed;
\item[] 0.3: if the robot grasped the orange internet cable but didn't try to insert it in the right socket of the computer;
\item[] 0.1: if the robot reached for the orange internet cable but didn't grasp it;
\item[] 0.0: if the robot didn't reach for the orange internet cable.
\end{itemize}
\\
\hline

% --- Row 2 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Hang the headphones on the wall''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Put the headphones on the desk''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Remove the orange LAN cable from the computer''.}\vspace{1pt} \\
\hline
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot hung the headphones;
\item[] 0.7: if the robot grasped the headphones and tried to hang them but failed;
\item[] 0.3: if the robot grasped the headphones but didn't move towards the hall;
\item[] 0.1: if the robot reached for the headphones but failed to grasp it;
\item[] 0.0: if the robot didn't reach for the headphones.
\end{itemize}
& 
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot put the headphones on the desk;
\item[] 0.8: if the robot managed to remove the headphones from the hook;
\item[] 0.1: if the robot reached the headphones but failed to grasp them;
\item[] 0.0: if the robot didn't reach for the headphones.
\end{itemize}
& 
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot managed to remove the orange internet cable from the computer;
\item[] 0.6: if the robot grasped the orange internet cable, tried to remove it from the computer but it failed;
\item[] 0.3: if the robot grasped the orange internet cable but didn't try to remove it from the computer;
\item[] 0.1: if the robot reached for the orange internet cable but didn't grasp it;
\item[] 0.0: if the robot didn't reach for the orange internet cable.
\end{itemize}
\\
\hline

% --- Row 3 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Remove the power plug cable from the computer''}\vspace{1pt} & 
& 
% This cell is empty
\\
\hline
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot managed to remove the power plug cable from the computer;
\item[] 0.6: if the robot grasped the power plug cable, tried to remove it from the computer but it failed;
\item[] 0.3: if the robot grasped the power plug cable but didn't try to remove it from the computer;
\item[] 0.1: if the robot reached for the power plug cable but didn't grasp it;
\item[] 0.0: if the robot didn't reach for the power plug.
\end{itemize}
& 
% Content Col 8 (Empty)
& 
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\newpage
```
`\noindent`{=latex}**NIST Assembly Task Board 2**

We use only in-distribution variations of the NIST Assembly tasks.

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (NIST Assembly Task Board 2).} 
\label{tab:franka_nist}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm}  | p{5cm}  | p{5cm}  |} 
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - NIST Assembly Task Board 2.}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Put the timing belt on the timing pulleys''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Remove the timing belt from the timing pulleys''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Place the orange round belt on the slide tensioners''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: If the robot assembled the timing belt on the timing pulleys and let it go;
\item[] 0.9: if the robot inserted on both wheels correctly;
\item[] 0.7: if the robot pushed the blue tensioner, with the belt loosely on the second wheel;
\item[] 0.5: if the robot inserted on one wheel;
\item[] 0.3: if the robot handed the belt over to grasp the timing belt with both arms;
\item[] 0.1: if the robot grasped and lifted the timing belt;
\item[] 0.0: if the robot couldn't even pick up the timing belt.
\end{itemize}
& 
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot let go of the timing belt;
\item[] 0.9: if the robot put the timing belt on the ground;
\item[] 0.8: if the robot fully unhooked the timing belt from the two pulleys and the tensioner;
\item[] 0.6: if the robot unhooked the timing belt from a pulley and the tensioner;
\item[] 0.3: if the robot unhooked the timing belt from one of the pulleys;
\item[] 0.1: if the robot grasped the timing belt;
\item[] 0.0: if the robot did not do anything.
\end{itemize}
& 
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot let go of the orange belt;
\item[] 0.9: if the robot inserted it on both wheels correctly;
\item[] 0.5: if the robot handed the belt over to grasp the orange belt with both arms;
\item[] 0.1: if the robot grasped and lifted the orange belt with the left arm;
\item[] 0.0: if the robot couldn't even pick up the orange belt.
\end{itemize}
\\
\hline

% --- Row 2 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Remove the orange round belt from the slide tensioners''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Put the chain belt on the sprocket idlers''.}\vspace{1pt} & 
\vspace{1pt}\textbf{``Remove the chain from the sprockets''.}\vspace{1pt} \\
\hline
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot let go of the orange belt;
\item[] 0.9: if the robot put the orange belt on the table top;
\item[] 0.5: if the robot unhooked the orange belt from both tensioners;
\item[] 0.1: if the robot grasped and unhooked the orange belt from one of the tensioners;
\item[] 0.0: if the robot couldn't even pick up the orange belt.
\end{itemize}
& 
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot let go of the chain;
\item[] 0.9: if the robot inserted it on both sprockets and aligned it with the tensioner;
\item[] 0.7: if the robot inserted it on two sprockets;
\item[] 0.5: if the robot inserted it on one sprocket;
\item[] 0.3: if the robot handed the chain over to grasp it with both arms;
\item[] 0.1: if the robot grasped and lifted the metal chain;
\item[] 0.0: if the robot couldn't even pick up the metal chain.
\end{itemize}
& 
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the robot let go of the chain;
\item[] 0.9: if the robot put the chain on the ground;
\item[] 0.8: if the robot fully unhooked the chain from both sprockets and the chain tensioner;
\item[] 0.6: if the robot unhooked the chain from two sprockets or one sprocket and the chain tensioner;
\item[] 0.3: if the robot unhooked the chain from one sprocket and/or the chain tensioner;
\item[] 0.1: if the robot grasped the chain;
\item[] 0.0: if the robot did not do anything.
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
#### Task progress score for semantic generalization tasks

**Workbench scene**

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Semantic Gen. - Workbench).}
\label{tab:franka_semantic_gen_workbench}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Semantic Generalization (Workbench).}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``quitar la cinta''. }\vspace{1pt} &
\vspace{1pt}\textbf{``unhang the long red tool''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Put the object used for cleaning on the table''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the tape got unhooked and ended up anywhere on the table;
\item[] 0.7: if the robot grasped the tape but failed to unhook it;
\item[] 0.3: if the robot reached for the tape on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the level got unhooked and ended up anywhere on the table;
\item[] 0.8: if the robot unhooked the level only from one hook;
\item[] 0.7: if the robot grasped the level but failed to unhook it;
\item[] 0.3: if the robot reached for the level on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: if the brush got unhooked and ended up anywhere on the table;
\item[] 0.7: if the robot grasped the brush but failed to unhook it;
\item[] 0.3: if the robot reached for the brush on the back-panel but failed to grasp it;
\item[] 0.0: if anything else happened.
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\newpage
```
`\noindent`{=latex}**Computer scene**

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Semantic Gen. - Computer Scene).}
\label{tab:franka_semantic_gen_computer}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
% Added vspace before and after the text
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Semantic Generalization (Computer Scene).}\vspace{1pt}} \\
\hline

% --- Row 1 ---
% Added vspace before and after each \textbf
\vspace{1pt}\textbf{``Desconecta el cable de red naranja de la computadora''.}\vspace{1pt} &
\vspace{1pt}\textbf{``conecta el enchufe a la computadora''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Hong the eadphones to te wal''.}\vspace{1pt} \\
\hline
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: the robot managed to remove the power plug cable from the computer;
\item[] 0.6: the robot grasped the power plug cable, tried to remove it from the computer but it failed;
\item[] 0.3: the robot grasped the power plug cable but didn't try to remove it from the computer;
\item[] 0.1: the robot reached for the power plug cable but didn't grasp it;
\item[] 0.0: the robot didn't reach for the power plug.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: The robot fully inserted the power plug into the right place in the computer;
\item[] 0.8: The robot partially inserted the power plug into the right place in the computer;
\item[] 0.5: The robot grasped the power plug and tried to insert it in the right place in the computer but failed;
\item[] 0.3: The robot managed to grasp the power plug but didn't try to insert it into the computer.
\item[] 0.1: The robot reached for the power plug but failed to grasp it.
\item[] 0.0: The robot didn't reach for the power plug.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.0: The robot hung the headphones;
\item[] 0.7: The robot grasped the headphones and tried to hang them but failed;
\item[] 0.3: The robot grasped the headphones but didn't move towards the hall;
\item[] 0.1: The robot reached for the headphones but failed to grasp it;
\item[] 0.0: The robot didn't reach for the headphones.
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\newpage
```
### Apollo humanoid robot

We defined a new generalization benchmark for our Apollo humanoid robot. This benchmark includes a total of 24 tasks, of those in distribution and including variants of them across different axes: instruction, visual and action generalization, following the same approach we used to define the ALOHA robot benchmark in  [@team2025gemini]. For the task generalization category we used the same tasks described in `\ref{appendix:task-gen-aloha}`{=latex} for ALOHA.

Figure `\ref{fig:example_of_gen_atari}`{=latex} shows examples of task variations to measure instruction, visual and semantic generalization on the Apollo humanoid robot.

```{=latex}
\centering
```
![Example variations of scenes used for measuring performance across generalization axes on the Apollo humanoid platform. ](src/assets/actions/gr1.5_atari_scenes_gen_appendix.png){#fig:example_of_gen_atari width="\\textwidth"}

#### Task progress score for in-distribution, visual, semantic and action generalization tasks

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Apollo humanoid (In-distribution and Visual Gen.).}
\label{tab:apollo_visual_gen}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Apollo humanoid - In-distribution and Visual Generalization.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Pick up the gummy bear bag and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the rubik’s cube and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the egg from the bottom right slot of the yellow egg box with your right hand and place the egg in the tray''.}\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the egg with its right hand and placed it in the tray;
\item[] 0.67:~if the robot picked up the egg with its right hand;
\item[] 0.33:~if the robot touched the egg with its right hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Pick up the egg from the bottom right slot of the yellow egg box with your left hand and place the egg in the tray''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the bottom green lettuce from the table with your right hand''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the right orange traffic cone from the table with your right hand''.}\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the egg with its left hand and placed it in the tray;
\item[] 0.67:~if the robot picked up the egg with its left hand;
\item[] 0.33:~if the robot touched the egg with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the bottom green lettuce with its right hand;
\item[] 0.75:~if the robot touched the bottom green lettuce with its right hand;
\item[] 0.50:~if the robot picked up lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.25:~if the robot touched lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the right orange traffic cone with its right hand;
\item[] 0.75:~if the robot touched the right orange traffic cone with its right hand;
\item[] 0.50:~if the robot picked up a traffic cone, but not the right orange traffic cone;
\item[] 0.25:~if the robot touched a traffic cone, but not the right orange traffic cone;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Pick up the blue vehicle toy with your right hand and place it in the white bowl''}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the light pink color soft toy with your left hand and place it in the brown bowl''}\vspace{1pt} &
% This cell is empty
\\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the blue vehicle toy with its right hand and placed it in the white bowl;
\item[] 0.67:~if the robot picked up the blue vehicle toy with its right hand;
\item[] 0.33:~if the robot touched the blue vehicle toy with its right hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the light pink color soft toy with its left hand and placed it in the brown bowl;
\item[] 0.67:~if the robot picked up the light pink color soft toy with its left hand;
\item[] 0.33:~if the robot touched the light pink color soft toy with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Apollo humanoid (Semantic Generalization).}
\label{tab:apollo_semantic_gen}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Apollo humanoid - Semantic Generalization.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Recoge la pelota antiestrés y colócala en la bolsa blanca''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pikc up the stress ball and palce it in the wite bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Tome el huevo de la ranura inferior derecha de la caja de huevos amarilla con la mano izquierda y coloque el huevo en la bandeja''.}\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the egg with its left hand and placed it in the tray;
\item[] 0.50:~if the robot picked up the egg with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Pick up the eg form the bottom right splot of the yellwo egg box with your left hand and place the egg in the try''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Recoge con tu mano derecha la lechuga verde que está al final de la mesa cerca de ti''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pikc up the bottm green lettuuce from the tble with yor rght haand''.}\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the egg with its left hand and placed it in the tray;
\item[] 0.50:~if the robot picked up the egg with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the bottom green lettuce with its right hand;
\item[] 0.75:~if the robot touched the bottom green lettuce with its right hand;
\item[] 0.50:~if the robot picked up lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.25:~if the robot touched lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the bottom green lettuce with its right hand;
\item[] 0.75:~if the robot touched the bottom green lettuce with its right hand;
\item[] 0.50:~if the robot picked up lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.25:~if the robot touched lettuce with its right hand, but not the bottom green lettuce;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Recoge el vehículo de juguete azul y colócalo en el recipiente blanco''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pic up the blu vehical toy and place it in teh wite bowl''.}\vspace{1pt} &
% This cell is empty
\\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the blue vehicle toy and placed it in the white bowl;
\item[] 0.50:~if the robot picked up the blue vehicle toy;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the blue vehicle toy and placed it in the white bowl;
\item[] 0.50:~if the robot picked up the blue vehicle toy;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Apollo humanoid (Action Generalization).}
\label{tab:apollo_action_gen}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Apollo humanoid - Action Generalization.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Pick up the black flashlight and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the black and brown snack bag and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the orange mentos container and place it in the white bag''.}\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Pick up the orange and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the purple die and place it in the white bag''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the blue cereal box with your right hand''.}\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked the correct object and placed it in the bag;
\item[] 0.50:~if the robot picked the wrong object but correctly placed it in the bag;
\item[] 0.50:~if the robot picked the correct object but didn't place it in the bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the blue cereal box with its right hand;
\item[] 0.67:~if the robot picked up the blue cereal box with its left hand;
\item[] 0.33:~if the robot touched the blue cereal box with its right hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Pick up the leftmost green pringles can from the top shelf of the black rack with your left hand''.}\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the red cereal box with your left hand''.}\vspace{1pt} &
% This cell is empty
\\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the green pringles can with its left hand;
\item[] 0.67:~if the robot picked up the green pringles can with its right hand;
\item[] 0.33:~if the robot touched the green pringles can with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the red cereal box with its left hand;
\item[] 0.67:~if the robot picked up the red cereal box with its right hand;
\item[] 0.33:~if the robot touched the red cereal box with its left hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
#### Qualitative Results

Controlling a humanoid is much more challenging than the Aloha or the bi-arm Franka robots, due to higher degree-of-freedom whole-body control, multi-finger dexterity, and limited visibility of the workspace that requires active perception. `\cref{fig:humanoid_gen_images}`{=latex} shows qualitative examples which demonstrate remarkable degrees of generalization of `\grlatest{}`{=latex} on the humanoid platform. We observe that our model can control Apollo to grasp novel objects and place them into receptacles that were not seen in the training data. Depending on the objects to manipulate, different grasping strategies emerge automatically. Furthermore, we also observe successful object manipulation on surfaces with different heights that were not seen during training as well.

```{=latex}
\centering
```
![Qualitative examples of generalization on the humanoid robot: `\grlatest{}`{=latex} learns multiple grasp strategies, can manipulate objects at different heights, and grasp novel objects and place them into receptacles that were never seen during training. `\label{fig:humanoid_gen_images}`{=latex}](src/assets/actions/apollo_gen.png){width="\\textwidth"}

```{=latex}
\newpage
```
Cross-embodiment benchmark {#appendix:cross-embodiment-benchmark}
--------------------------

In order to measure Motion Transfer across our three robots we define benchmarks for testing tasks on a Robot A for which data was only collected on Robot B and vice-versa. We focus on the following scenarios.

### Bi-arm Franka $\to$ ALOHA benchmark

The ALOHA robot data is diverse and enables the execution of a multitude of tasks as demonstrated in our prior work [@team2025gemini]. However, the vast majority of this data focuses on table top tasks with limited interaction on the vertical axis. Meanwhile for Bi-arm Franka robot we have collected data involving interacting with a vertically mounted back-panel that goes beyond the typical motion range for ALOHA. As such, tasks in this scenario (e.g. hanging/unhanging tools) provide an ideal benchmark for motion transfer as a model trained with solely ALOHA data cannot solve them. Following this strategy, we design a set of 10 tasks in this benchmark. Figure `\ref{fig:omega2aloha-images}`{=latex} provides a visual overview of the tasks and Table `\ref{tab:franka_to_aloha_benchmark}`{=latex} provides the progress score definition for each task.

```{=latex}
\centering
```
![Example of execution of cross-embodiment tasks for Bi-arm Franka $\to$ ALOHA benchmark. `\label{fig:omega2aloha-images}`{=latex}](src/assets/actions/cross-emb-benchmark-visual.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka \texorpdfstring{$\to$}{->} ALOHA Benchmark.}
\label{tab:franka_to_aloha_benchmark}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka \texorpdfstring{$\to$}{->} ALOHA.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Hang the orange belt on the small metal hook on the wall''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Hang the timing belt on the small metal hook on the wall''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Hang the tape to the wall''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot hung the orange belt on the hook;
\item[] 0.90:~if the robot tried to hang the orange belt on the hook;
\item[] 0.80:~if the robot moved towards the hook after grasping the orange belt;
\item[] 0.20:~if the robot grasped the orange belt;
\item[] 0.10:~if the robot reached the orange belt;
\item[] 0.00:~if the robot didn't reach for the orange belt.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot hung the timing belt on the hook;
\item[] 0.90:~if the robot tried to hang the timing belt on the hook;
\item[] 0.80:~if the robot moved towards the hook after grasping the timing belt;
\item[] 0.20:~if the robot grasped the timing belt;
\item[] 0.10:~if the robot reached the timing belt;
\item[] 0.00:~if the robot didn't reach for the timing belt.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot hung the tape;
\item[] 0.90:~if the robot tried to hang the tape on the hook;
\item[] 0.80:~if the robot moved towards the hook after grasping the tape;
\item[] 0.20:~if the robot grasped the tape;
\item[] 0.10:~if the robot reached the tape;
\item[] 0.00:~if the robot didn't reach for the tape.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Unhang the brush and put it on the table''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up a marker and put it on the table''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Unhang the tape and put it on the table''}.\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot unhooked the brush and it ends up anywhere on the table;
\item[] 0.70:~if the robot grasped the brush but fails to unhook it;
\item[] 0.30:~if the robot attempted to grasp the brush on the backpanel but couldn't succeed in doing so;
\item[] 0.00:~if the robot has no success.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot managed to pick up a marker and put it on the desk;
\item[] 0.90:~if the robot managed to pick up a marker;
\item[] 0.50:~if the robot tried to pick up a marker;
\item[] 0.10:~if the robot reached to a marker but didn't try to pick it up;
\item[] 0.05:~if the robot reached toward the markers but didn't get close enough;
\item[] 0.00:~if the robot didn't reach toward the markers.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot unhooked the tape and it ends up anywhere on the table;
\item[] 0.70:~if the robot grasped the tape but fails to unhook it;
\item[] 0.30:~if the robot attempted to grasp the tape on the backpanel but couldn't succeed in doing so;
\item[] 0.00:~if the robot has no success.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Unplug the usb cable''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Unplug the power cable''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Hang the headphones on the small metal hook on the wall''}.\vspace{1pt} \\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot unplugs the cable;
\item[] 0.70:~if the robot tried to unplug the cable;
\item[] 0.40:~if the robot grasped the cable;
\item[] 0.20:~if the robot reached toward the cable but didn't get close enough;
\item[] 0.00:~if the robot didn't reach toward the cable.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot unplugs the cable;
\item[] 0.70:~if the robot tried to unplug the cable;
\item[] 0.40:~if the robot grasped the cable;
\item[] 0.20:~if the robot reached toward the cable but didn't get close enough;
\item[] 0.00:~if the robot didn't reach toward the cable.
\end{itemize}
&
% Content Col 9
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot managed to hang the headphones on the hook;
\item[] 0.70:~if the robot tried to hang the headphones on the hook;
\item[] 0.50:~if the robot moved the headphones toward the hook but didn't get close enough;
\item[] 0.30:~if the robot lifted the headphones;
\item[] 0.20:~if the robot grasped the headphones but didn't lift them;
\item[] 0.10:~if the robot reached toward the headphones but didn't get close enough;
\item[] 0.00:~if the robot didn't reach toward the headphones.
\end{itemize}
\\
\hline

% --- Row 4 (Headers) ---
\vspace{1pt}\textbf{``Put the mouse in one of the blue bins''}.\vspace{1pt} &
% This cell is empty
&
% This cell is empty
\\
\hline
% --- Row 4 (Content) ---
% Content Col 10
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot managed to put the mouse in the bin;
\item[] 0.70:~if the robot tried to put the mouse in a bin;
\item[] 0.50:~if the robot moved the mouse towards a bin but didn't get close enough;
\item[] 0.30:~if the robot lifted the mouse;
\item[] 0.10:~if the robot grasped the mouse but didn't lift it;
\item[] 0.05:~if the robot reached toward the mouse but didn't get close enough;
\item[] 0.00:~if the robot didn't reach toward the mouse.
\end{itemize}
&
% Content Col 11 (Empty)
&
% Content Col 12 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
### ALOHA benchmark $\to$ Bi-arm Franka and Apollo humanoid robot $\to$ Bi-arm Franka

We identified in the ALOHA data tasks that require specific motions such as open drawers or closing a pear-shaped organizer (task that requires precise control) which are not available in the Bi-arm Franka data. In addition we added some easier packing tasks to see whether motion transfer could not only transfer new skills but also improves performance on easier tasks. We followed a similar procedure to define the tasks to measure motion transfer from the Humanoid robot to the Bi-arm Franka. Figure `\ref{fig:aloha2omega-images}`{=latex} and `\ref{fig:atari2omega-images}`{=latex} provides a visual overview of the tasks (11 in total) and Table `\ref{tab:cross_emb_to_franka}`{=latex} provides the progress score definition for each task.

```{=latex}
\centering
```
![Example of execution of cross-embodiment tasks for ALOHA $\to$ Bi-arm Franka benchmark. `\label{fig:aloha2omega-images}`{=latex}](src/assets/actions/cross-emb-benchmark-visual-omega.png){width="\\textwidth"}

```{=latex}
\centering
```
![Cross-embodiment tasks for Apollo humanoid $\to$ Bi-arm Franka benchmark. `\label{fig:atari2omega-images}`{=latex}](src/assets/actions/cross-emb-benchmark-visual-atari.png){width="70%"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: ALOHA/Humanoid \texorpdfstring{$\to$}{->} Bi-arm Franka.}
\label{tab:cross_emb_to_franka}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: ALOHA/Humanoid \texorpdfstring{$\to$}{->} Bi-arm Franka.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Close the pear organizer''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put the grapes in the largest compartment of the bento box''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Pour the toy from the blue bowl into the red plate''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the pear top onto the bottom so it covers the compartment;
\item[] 0.50:~if the robot held the pear top above the center compartment of the base;
\item[] 0.20:~if the robot grasped the pear top;
\item[] 0.05:~if the robot reached toward the pear top;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the grapes in the largest compartment;
\item[] 0.50:~if the robot put the grapes anywhere on the bento box;
\item[] 0.20:~if the robot grasped the grapes;
\item[] 0.10:~if the robot reached for the grapes;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot poured at least one toy onto the plate;
\item[] 0.70:~if the robot tried to pour the toys onto the plate;
\item[] 0.30:~if the robot held the bowl above the plate;
\item[] 0.20:~if the robot grasped the bowl;
\item[] 0.05:~if the robot reached for the bowl;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Open the top drawer by pulling the knob''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put the yellow lego in the bag''}.\vspace{1pt} & 
\vspace{1pt}\textbf{``Open the middle drawer by pulling the red knob''}.\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot opened the top drawer by pulling the knob;
\item[] 0.90:~if the robot opened the top drawer;
\item[] 0.80:~if the robot opened any drawer;
\item[] 0.50:~if the robot grasped any knob;
\item[] 0.10:~if the robot reached for any knob;
\item[] 0.05:~if the robot reached toward the drawer;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the yellow block in the bag;
\item[] 0.70:~if the robot put any block in the bag;
\item[] 0.50:~if the robot held the yellow block above the bag;
\item[] 0.20:~if the robot grasped the yellow block;
\item[] 0.10:~if the robot reached for the yellow block;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot opened the correct drawer;
\item[] 0.60:~if the robot opened any drawer;
\item[] 0.30:~if the robot grasped any knob;
\item[] 0.10:~if the robot reached for any knob;
\item[] 0.05:~if the robot reached toward the drawer;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Pick up the stress ball and place it in the white bag''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the white socks and place it in the white bag''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the rubiks cube and place it in the white bag''}.\vspace{1pt} \\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the object in the white bag;
\item[] 0.50:~if the robot grasped the object and brings it over the white bag;
\item[] 0.25:~if the robot grasped the object;
\item[] 0.00:~if the robot did not grasp the object.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the object in the white bag;
\item[] 0.50:~if the robot grasped the object and brings it over the white bag;
\item[] 0.25:~if the robot grasped the object;
\item[] 0.00:~if the robot did not grasp the object.
\end{itemize}
&
% Content Col 9
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the object in the white bag;
\item[] 0.50:~if the robot grasped the object and brings it over the white bag;
\item[] 0.25:~if the robot grasped the object;
\item[] 0.00:~if the robot did not grasp the object.
\end{itemize}
\\
\hline

% --- Row 4 (Headers) ---
\vspace{1pt}\textbf{``Pick up the black beanie and place it in the white bag''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Pick up the gummy bear bag and place it in the white bag''}.\vspace{1pt} &
% This cell is empty
\\
\hline
% --- Row 4 (Content) ---
% Content Col 10
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the object in the white bag;
\item[] 0.50:~if the robot grasped the object and brings it over the white bag;
\item[] 0.25:~if the robot grasped the object;
\item[] 0.00:~if the robot did not grasp the object.
\end{itemize}
&
% Content Col 11
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the object in the white bag;
\item[] 0.50:~if the robot grasped the object and brings it over the white bag;
\item[] 0.25:~if the robot grasped the object;
\item[] 0.00:~if the robot did not grasp the object.
\end{itemize}
&
% Content Col 12 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
### ALOHA benchmark $\to$ Humanoid robot

We followed a similar approach for the Humanoid robot: we looked into skills in the ALOHA dataset that are not covered by the action data for the Humanoid. One example to highlight this is the task \"open the wardrobe\", which is a motor skill completely outside the covergae of the Humanoid dataset. Fig `\ref{fig:atari-cross-images}`{=latex} shows visuals of the 7 tasks in this benchmark and Table `\ref{tab:aloha_to_humanoid_benchmark}`{=latex} reports the definition of the progress score for each task.

```{=latex}
\centering
```
![Cross-embodiment tasks for ALOHA $\to$ Humanoid benchmark. `\label{fig:atari-cross-images}`{=latex}](src/assets/actions/cross-emb-benchmark-visual-aloha2atari.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: ALOHA \texorpdfstring{$\to$}{->} Humanoid Benchmark.}
\label{tab:aloha_to_humanoid_benchmark}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: ALOHA \texorpdfstring{$\to$}{->} Humanoid robot.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Pick up the green shirt with its right hand and put it into the brown wicker basket''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Push the door to the right with its left hand''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Push the red knob to close the drawer with its left hand''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the green shirt with its right hand and put it into the brown wicker basket;
\item[] 0.50:~if the robot picked up the green shirt with its right hand;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot pushed the door to the right with its left hand (door should be at least halfway through);
\item[] 0.50:~if the robot's left hand touched the door;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot closes the drawer at least halfway through;
\item[] 0.50:~if the robot's hand touched the red knob;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Stack the orange tray on the drawer''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put blue block in the left drawer''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put green block in the middle drawer''}.\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the orange tray and stacked it on the drawer;
\item[] 0.50:~if the robot picked up the orange tray;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the blue block and put it in the left drawer;
\item[] 0.50:~if the robot picked up the blue block;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot picked up the green block and put it in the middle drawer;
\item[] 0.50:~if the robot picked up the green block;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Wipe the pear with towel''}.\vspace{1pt} & 
% This cell is empty
&
% This cell is empty
\\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot wiped the pear with the towel;
\item[] 0.50:~if the robot grabbed the towel;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 8 (Empty)
&
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\clearpage
```
Multi-step benchmark {#appendix:medium-horizon-benchmark}
--------------------

The multi-step benchmarks combine individual tasks into compound instructions (i.e., A then B then C) or abstract, goal-oriented instructions. For the compound instructions, we typically require a particular order for the individual tasks in order to achieve a full score of 1.0. For example, for the Apollo humanoid, several of our multi-step tasks combine individual gift packing tasks into compound instructions such as \"put the stress ball in the white bag, then put the gummy bear bag in the white bag, then put the white socks in the white bag\" and require this exact order for a perfect score. We also include abstract, goal-oriented instructions such as \`\`pack all the gifts" or \`\`sort the snacks into their matching containers by color", though these are less common across the benchmarks. The next paragraphs show the tasks and their progress score system for each embodiment.

### ALOHA robot

Figure `\ref{fig:medium-horizon-aloha}`{=latex} shows visuals of the tasks and Table `\ref{tab:aloha_medium_horizon}`{=latex} defines the progress score for each task.

```{=latex}
\centering
```
![Tasks in the multi-step benchmark for the ALOHA. `\label{fig:medium-horizon-aloha}`{=latex}](src/assets/actions/Medium-horizon-aloha.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: ALOHA Robot (Multi-step benchmark).}
\label{tab:aloha_medium_horizon}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: ALOHA Robot - Multi-step.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Unzip the lunchbag and then place the banana in the lunchbag''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Place the red pepper, green apple, and yellow cube in plates with matching colors''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Place the red pepper in the red plate, then put green apple in the green plate, then put yellow cube in the yellow plate''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot fully unzipped the lunch bag and placed the banana in the lunchbag;
\item[] 0.50:~if the robot fully unzipped the lunch bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed all three objects correctly in the matching plates;
\item[] 0.70:~if the robot placed two objects on the correct matching plates, but failed for third object;
\item[] 0.40:~if if the robot picked one object and successfully placed it on the correct plate, but failed to do that for second object;
\item[] 0.20:~if the robot picked one object but failed to put it on the right plate;
\item[] 0.00:~if the robot didn't approach any of the objects or approached plates directly.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed all three objects correctly in the matching plates;
\item[] 0.70:~if the robot placed two objects on the correct matching plates, but failed for third object;
\item[] 0.40:~if the robot picked one object and successfully placed it on the correct plate, but failed to do that for second object;
\item[] 0.20:~if the robot picked one object but failed to put it on the right plate;
\item[] 0.00:~if the robot didn't approach any of the objects or approached plates directly.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Open the middle drawer, then left, then right''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Open the right drawer, then left, then middle''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Open the lid of the trash bin, then pick up the ball and place it inside the trash bin, then close the lid of the trash bin''}.\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot opened all three drawers in correct order;
\item[] 0.60:~if the robot opened the middle then left drawer but failed to open the right drawer;
\item[] 0.30:~if the robot opened the middle drawer but failed to open the left drawer next;
\item[] 0.00:~if the robot failed to open the middle drawer first.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot opened all three drawers in correct order;
\item[] 0.60:~if the robot opened the right then left drawer but failed to open the middle drawer;
\item[] 0.30:~if the robot opened the right drawer but failed to open the left drawer next;
\item[] 0.00:~if the robot failed to open the right drawer first.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully opened the lid, put the ball in the bin, then closes the lid;
\item[] 0.75:~if the robot failed to close the lid after putting the ball in the trash bin;
\item[] 0.50:~if the robot opened the trash bin lid, picked up the ball, but failed to put the ball in the trash bin;
\item[] 0.25:~if the robot opened the trash bin lid but failed to pick up the ball next;
\item[] 0.00:~if the robot didn't open the trash bin lid.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Take the green plate out of the white bag, and put the stress ball in the white bag, and push the bag to the top of the table''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put all the yellow fruits on the yellow plate, green fruit on the green plate, red fruit on the red plate''}.\vspace{1pt} &
% This cell is empty
\\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot completes the entire task successfully;
\item[] 0.70:~if the robot took the green plate out of the bag, put the stress ball in the bag, but failed to push the bag to the top of the table;
\item[] 0.30:~if the robot took the green plate out of the bag but failed to put the stress ball into the bag;
\item[] 0.00:~if the robot failed to take the green plate out of the bag.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put all fruits in the correct plate;
\item[] 0.80:~if the robot put 4 fruits in the correct plate;
\item[] 0.60:~if the robot put 3 fruits in the correct plate;
\item[] 0.20:~if the robot put 2 fruits in the correct plate;
\item[] 0.20:~if the robot put 1 fruit in the correct plate;
\item[] 0.00:~if the robot put none of the fruits in the correct plate.
\end{itemize}
&
% Content Col 9 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
### Bi-arm Franka robot

Figure `\ref{fig:medium-horizon-aloha}`{=latex} shows visuals of the tasks and Table `\ref{tab:franka_medium_horizon}`{=latex} defines the progress score for each task.

```{=latex}
\centering
```
![Tasks in the multi-step benchmark for the Bi-arm Franka. `\label{fig:medium-horizon-omegs}`{=latex}](src/assets/actions/Medium-horizon-omega.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Multi-step benchmark).}
\label{tab:franka_medium_horizon}
\vspace{1pt} % Added 1pt of space after the caption
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Multi-step.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{1pt}\textbf{``Insert the power cable in the computer, insert the orange lan cable and unhang the headphones''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Insert the power cable and unhang the headphones from the wall''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Remove the power cable from the computer, hang the headphones and unplug the black power cable from the socket''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully completed the third task;
\item[] 0.75:~if the robot successfully completed the remaining task;
\item[] 0.60:~if the robot successfully completed the second task;
\item[] 0.45:~if the robot grasped and attempted one of the remaining two tasks;
\item[] 0.30:~if the robot successfully inserted the power cable or the lan cable or unhung the headphones;
\item[] 0.15:~if the robot grasped and attempted to insert the power cable or the lan cable or unhang the headphones;
\item[] 0.00:~if the robot didn't attempt to insert the power cable or insert the lan cable or unhang the headphones.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully completed both tasks;
\item[] 0.75:~if the robot attempted to complete the remaining task;
\item[] 0.50:~if the robot successfully inserted the cable or unhung the headphones;
\item[] 0.25:~if the robot attempted to insert the cable or unhang the headphones;
\item[] 0.00:~if the robot didn't do anything.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully completed the remaining task;
\item[] 0.75:~if the robot grasped and attempted the remaining task;
\item[] 0.60:~if the robot is successful in completing one of the remaining two tasks;
\item[] 0.45:~if the robot grasped and attempted one of the remaining two tasks;
\item[] 0.30:~if the robot is successful in removing the power cable, unplugging the plug from the socket or hanging the headphones;
\item[] 0.15:~if the robot grasped and attempted to either remove the power cable, unplug the plug from the socket or to hang the headphones;
\item[] 0.00:~if the robot didn't attempt to remove the power cable, unplug the plug from the socket or to hang the headphones.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{1pt}\textbf{``Remove the orange lan cable and then the power cable from the computer''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Place all the markers on the table''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Place all the wrenches on the table''}.\vspace{1pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully removed both cables;
\item[] 0.75:~if the robot grasped and attempted to remove the remaining cable;
\item[] 0.50:~if the robot successfully removed either the lan cable or the power cable from the computer;
\item[] 0.25:~if the robot grasped and attempted to remove either the lan cable or the power cable from the computer;
\item[] 0.00:~if the robot didn't attempt to remove either cable.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the remaining marker on the table;
\item[] 0.80:~if the robot grasped the remaining marker;
\item[] 0.70:~if the robot reached for the remaining marker of any colour;
\item[] 0.67:~if the robot placed the second marker on the table;
\item[] 0.50:~if the robot grasped the second marker of any colour;
\item[] 0.40:~if the robot reached for the second marker of any colour;
\item[] 0.33:~if the robot placed the first marker on the table;
\item[] 0.20:~if the robot grasped the first marker of any colour;
\item[] 0.10:~if the robot reached for a first marker of any colour;
\item[] 0.00:~if the robot placed no markers on the table.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully unhung and placed the fourth wrench on the table;
\item[] 0.90:~if the robot grasped the fourth wrench;
\item[] 0.85:~if the robot reached for a fourth wrench;
\item[] 0.75:~if the robot successfully unhung and placed the third wrench on the table;
\item[] 0.65:~if the robot grasped the third wrench;
\item[] 0.60:~if the robot reached for a third wrench;
\item[] 0.50:~if the robot successfully unhung and placed the second wrench on the table;
\item[] 0.40:~if the robot grasped the second wrench;
\item[] 0.35:~if the robot reached for a second wrench;
\item[] 0.25:~if the robot successfully unhung and placed the first wrench on the table;
\item[] 0.15:~if the robot grasped the first wrench;
\item[] 0.10:~if the robot reached for a first wrench;
\item[] 0.00:~if the robot placed no wrenches on the table.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{1pt}\textbf{``Place all the gears from the table into any container''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Put the tape and then the scissors onto the table''}.\vspace{1pt} &
\vspace{1pt}\textbf{``Place the spirit level and the brush on the table''}.\vspace{1pt} \\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the remaining gear in any container;
\item[] 0.80:~if the robot grasped the remaining third gear on the table;
\item[] 0.70:~if the robot reached for the remaining third gear on the table;
\item[] 0.66:~if the robot placed the second gear in any container;
\item[] 0.50:~if the robot grasped the second gear;
\item[] 0.40:~if the robot reached for a second gear on the table;
\item[] 0.33:~if the robot placed the first gear in any container;
\item[] 0.20:~if the robot grasped the first gear;
\item[] 0.10:~if the robot reached for a first gear on the table;
\item[] 0.00:~if the robot didn't place or attempt to put back any gears back in the container.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the scissors on the table;
\item[] 0.75:~if the robot grasped the scissors;
\item[] 0.60:~if the robot reached for the scissors;
\item[] 0.50:~if the robot successfully placed the tape on the table;
\item[] 0.25:~if the robot grasped the tape;
\item[] 0.10:~if the robot reached for the tape;
\item[] 0.00:~if the robot didn't place any item on the table.
\end{itemize}
&
% Content Col 9
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot placed the remaining item on the table;
\item[] 0.75:~if the robot grasped the remaining item;
\item[] 0.60:~if the robot reached for the remaining item;
\item[] 0.50:~if the robot successfully placed the spirit level or the brush on the table;
\item[] 0.25:~if the robot grasped the spirit level or the brush;
\item[] 0.10:~if the robot reached for either the spirit level or the brush;
\item[] 0.00:~if the robot didn't place any item on the table.
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
### Humanoid robot

Figure `\ref{fig:atari-images-medium}`{=latex} shows visuals of the tasks and Table `\ref{tab:humanoid_medium_horizon}`{=latex} defines the progress score for each task.

```{=latex}
\centering
```
![Tasks in the multi-step benchmark for the Apollo humanoid. `\label{fig:atari-images-medium}`{=latex}](src/assets/actions/Medium-horizon-atari.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Humanoid Robot (Multi-step benchmark).}
\label{tab:humanoid_medium_horizon}
\begin{tabular}{| p{5cm} | p{5cm} | p{5cm} |}
\toprule

% --- SECTION 1 ---
\multicolumn{3}{c}{\textbf{Benchmark: Humanoid Robot - Multi-step.}} \\
\hline

% --- Row 1 (Headers) ---
\vspace{0.5pt}\textbf{``First put the gummy bear bag in the white bag and then put the stress ball in the white bag''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``First put the rubik's cube in the white bag and then put the white socks in the white bag''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Put the stress ball in the white bag, then put the gummy bear bag in the white bag, then put the white socks in the white bag''}.\vspace{0.5pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully put gummy bear bag into the white bag and then put stress ball into the white bag and only those 2 items;
\item[] 0.90:~if the robot put the gummy bear bag in the white bag and then put the stress ball in the white bag;
\item[] 0.75:~if the robot put the gummy bear bag in the white bag and then picked up the stress ball;
\item[] 0.60:~if the robot put the gummy bear bag in the white bag;
\item[] 0.45:~if the robot picked up the gummy bear bag and did not place it in the white bag;
\item[] 0.30:~if the robot put the stress ball and the gummy bear bag in the white bag but in the wrong order;
\item[] 0.20:~if the robot put the stress ball in the white bag;
\item[] 0.10:~if the robot put something in the white bag but it\'s not the stress ball or the gummy bear bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully put the rubik\'s cube in the white bag and then put the white socks into the white bag and only those 2 items;
\item[] 0.90:~if the robot put the rubik\'s cube in the white bag and then put the white socks in the white bag -- order matters;
\item[] 0.75:~if the robot put the rubik\'s cube in the white bag and then picked up the white socks -- order matters;
\item[] 0.60:~if the robot put the rubik\'s cube in the white bag;
\item[] 0.45:~if the robot picked up the rubik\'s cube and did not place it in the white bag;
\item[] 0.30:~if the robot put the white socks in the bag and then put the rubik\'s cube in the white bag;
\item[] 0.20:~if the robot put the white socks in the white bag;
\item[] 0.10:~if the robot put something in the white bag but it\'s *not* the rubik\'s cube or white socks;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the stress ball in the white bag and then put the gummy bears in the white bag and then put the white socks in the white bag;
\item[] 0.90:~if the robot put the stress ball in the white bag and then put the gummy bears in the white bag and then put the white socks in the white bag;
\item[] 0.75:~if the robot put the stress ball in the white bag and then put the gummy bears in the white bag and then picked up the white socks;
\item[] 0.60:~if the robot put the stress ball in the white bag and then put the gummy bears in the white bag;
\item[] 0.45:~if the robot put the stress ball in the white bag and then picked up the gummy bears;
\item[] 0.30:~if the robot put the stress ball in the white bag;
\item[] 0.15:~if the robot picked up the stress ball;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 2 (Headers) ---
\vspace{0.5pt}\textbf{``Stack the right orange cone on top of the left orange cone and then put the green lettuce in the brown bowl''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Put the dark colored clothes in the black box and then put the white clothes in the white box''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Put the white shirt in the white box and then put the red shirt in the black box''}.\vspace{0.5pt} \\
\hline
% --- Row 2 (Content) ---
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot stacked the right orange cone on the left orange cone and then put the green lettuce in the brown bowl;
\item[] 0.75:~if the robot stacked the right orange cone on the left orange cone and then touched the green lettuce;
\item[] 0.50:~if the robot stacked the right orange cone on the left orange cone;
\item[] 0.25:~if the robot touched the right orange cone;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 5
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put 3 colored shirts in the black box, *then* put the white shirt in the white box;
\item[] 0.75:~if the robot put 3 colored shirts in the black box;
\item[] 0.50:~if the robot put 2 colored shirts in the black box;
\item[] 0.25:~if the robot put 1 colored shirt in the black box;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 6
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the white shirt in the white box, then put the red shirt in the black box;
\item[] 0.75:~if the robot put the white shirt in the white box, then touched the red shirt;
\item[] 0.50:~if the robot put the white shirt in the white box;
\item[] 0.25:~if the robot touched the white shirt;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 3 (Headers) ---
\vspace{0.5pt}\textbf{``Place all the green lettuces on the table into the brown bowl and then picked up the orange cone''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Put the leftmost light blue chips bag in the blue bin and then put the bottom green chips in the green bin''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Put the white clothes in the black box and then put the dark colored clothes in the white box''}.\vspace{0.5pt} \\
\hline
% --- Row 3 (Content) ---
% Content Col 7
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put 3 lettuces in the brown bowl, then picked up the orange cone;
\item[] 0.75:~if the robot put 3 lettuce in the brown bowl;
\item[] 0.50:~if the robot put 2 lettuce in the brown bowl;
\item[] 0.25:~if the robot put 1 lettuce in the brown bowl;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 8
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the leftmost blue chips bag in the blue bin and then put the bottom green chips in the green bin;
\item[] 0.75:~if the robot put the leftmost blue chips bag in the blue bin and then touched the bottom green chips;
\item[] 0.50:~if the robot put the leftmost blue chips bag in the blue bin;
\item[] 0.25:~if the robot put any blue snack in the blue bin;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 9
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put the white shirt in the black box, then put 3 colored shirts in the white box;
\item[] 0.75:~if the robot put the white shirt in the black box, then put 2 colored shirts in the white box;
\item[] 0.50:~if the robot put the white shirt in the black box, then put 1 colored shirt in the white box;
\item[] 0.25:~if the robot put the white shirt in the black box;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 4 (Headers) ---
\vspace{0.5pt}\textbf{``First put all the blue snacks in the blue bin and then put all the green snacks in the green bin''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Sort the snacks into their matching containers by color''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``Pack all the gifts''}.\vspace{0.5pt} \\
\hline
% --- Row 4 (Content) ---
% Content Col 10
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put 3 blue snacks in the blue bin, then put 2 green snacks in the green bin;
\item[] 0.80:~if the robot put 3 blue snacks in the blue bin, then put 1 green snack in the green bin;
\item[] 0.60:~if the robot put 3 blue snacks in the blue bin;
\item[] 0.40:~if the robot put 2 blue snacks in the blue bin;
\item[] 0.20:~if the robot put 1 blue snack in the blue bin;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 11
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot put 5 snacks into their matching color bins;
\item[] 0.80:~if the robot put 4 snacks into their matching color bins;
\item[] 0.60:~if the robot put 3 snacks into their matching color bins;
\item[] 0.40:~if the robot put 2 snacks into their matching color bins;
\item[] 0.20:~if the robot put 1 snack into its matching color bin;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 12
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot packed all 5 gifts into the white bag;
\item[] 0.80:~if the robot packed 4 gifts into the white bag;
\item[] 0.60:~if the robot packed 3 gifts into the white bag;
\item[] 0.40:~if the robot packed 2 gifts into the white bag;
\item[] 0.20:~if the robot packed 1 gift into the white bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
\\
\hline

% --- Row 5 (Headers) ---
\vspace{0.5pt}\textbf{``Place the stress ball and the gummy bears in the white bag''}.\vspace{0.5pt} &
\vspace{0.5pt}\textbf{``First put the stress ball in the white bag and then put the gummy bear bag in the white bag''}.\vspace{0.5pt} &
% This cell is empty
\\
\hline
% --- Row 5 (Content) ---
% Content Col 13
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully put stress ball and gummy bear bag into the white bag and only those 2 items;
\item[] 0.80:~if the robot successfully put stress ball and gummy bear bag into the white bag but also put other items in the bag;
\item[] 0.60:~if the robot put either the stress ball or the gummy bear bag in the white bag, but not both, and successfully picked up the other item but did not put it in the white bag;
\item[] 0.40:~if the robot put either the stress ball or the gummy bear bag in the white bag, but not both;
\item[] 0.20:~if the robot put something in the white bag but it\'s *not* the stress ball or the gummy bear bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 14
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 1.00:~if the robot successfully put stress ball into the white bag, then put gummy bear bag into the white bag and only those 2 items -- order matters;
\item[] 0.90:~if the robot put the stress ball in the white bag and then put the gummy bear bag in the white bag;
\item[] 0.75:~if the robot put the stress ball in the white bag and then picked up the gummy bear bag;
\item[] 0.60:~if the robot put the stress ball in the white bag;
\item[] 0.45:~if the robot picked up the stress ball and did not place it in the white bag;
\item[] 0.30:~if the robot put the gummy bear bag and the stress ball in the white bag but in the wrong order;
\item[] 0.20:~if the robot put the gummy bear bag in the white bag;
\item[] 0.10:~if the robot put something in the white bag but it\'s not the stress ball or the gummy bear bag;
\item[] 0.00:~if the robot did anything else.
\end{itemize}
&
% Content Col 15 (Empty)
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
Success rate {#appendix:additional-exps}
------------

In this section we report the *success rate* for all our results in `\cref{sec:results-actions}`{=latex}.

### Success rate for generalization performance

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively.](src/assets/actions/aloha_baselines_sr.png "fig:"){width="\\textwidth"} `\label{fig:aloha-sr-generalization}`{=latex}

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively.](src/assets/actions/omega_baselines_sr.png "fig:"){width="\\textwidth"} `\label{fig:omega-sr-generalization}`{=latex}

```{=latex}
\centering
```
![ Breakdown of `\grshortlatest{}`{=latex} generalization capabilities across our robots. `\grshortlatest{}`{=latex} consistently outperforms the baselines and handles all four types of variations more effectively.](src/assets/actions/atari_baselines_sr.png "fig:"){width="\\textwidth"} `\label{fig:atari-sr-generalization}`{=latex}

```{=latex}
\clearpage
```
### Success rate for data and model ablation

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on our robots: `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation-success}`{=latex}](src/assets/actions/aloha_positive_transfer_sr.png){width="\\textwidth"}

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on our robots: `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation-success}`{=latex}](src/assets/actions/omega_positive_transfer_sr.png){width="\\textwidth"}

```{=latex}
\centering
```
![ Ablation on datasets and training recipes on our robots: `\grshortlatest{}`{=latex} consistently outperforms our baselines: `\grshortlatest{}`{=latex} trained on single or multi-robot data without the MT recipe. `\label{fig:data-training-ablation-success}`{=latex}](src/assets/actions/atari_positive_transfer_sr.png){width="\\textwidth"}

```{=latex}
\clearpage
```
### Success rate for thinking ablation {#sec:thinking-appendix}

`\centering `{=latex} ![Ablation of thinking: success rate in the multi-step benchmark with and without enabling thinking during inference.](src/assets/actions/thinking_sr.png "fig:"){#fig:medium-level-thinking-sr width="45%"}

```{=latex}
\newpage
```
`\grlatestER{}`{=latex} is a generalist embodied reasoning model {#appendix-pre-er}
================================================================

Evaluation Details: Generality
------------------------------

To assess an overall approximation of model embodied reasoning performance, we evaluate `\grlatestER{}`{=latex} and other multimodal models on a mix of 15 academic benchmarks. The aggregated results are reported in `\cref{fig:gr1.5_er_generality}`{=latex}, and the individual benchmark performance results are shown in Table `\ref{table:er-external}`{=latex} and Table `\ref{table:er-generality}`{=latex}. For text-based VQA evaluation benchmarks, we used Gemini 2.5 Flash to grade response accuracy for both multiple-choice and freeform question formats.

The Gemini 2.5 and GPT-5 models were accessed in between September 1, 2025 and September 20, 2025, using default thinking budgets and without tool use.

```{=latex}
\centering
```
::: {#table:er-external}
  ---------------------- ------------------------------- ------------ ---------- ---------- ------ ---------- ----------
  **Model**                                                                                                   
  1.5                                                                                                         
  1.5                     `\makecell[c]{GR-ER}`{=latex}                                                       
  Pro                                                                                                         
  Flash                               GPT-5               GPT-5-mini                                          
  **Thinking**                         Yes                    No          No        Yes      Yes      Yes        Yes
  Point-Bench                         71.6                   73.3      **75.7**     62.7     61.7     43.6       39.5
  RefSpatial                          48.5                   41.8      **49.3**     33.6     41.2     23.5       23.0
  RoboSpatial-Pointing              **31.1**                 25.3        30.3       8.3      7.9      19.0       12.5
  Where2Place                       **59.0**                 48.0        41.0       37.0     48.0     37.0       33.5
  Spatial average                   **52.6**                 47.1        49.1       35.4     39.7     30.8       27.1
  BLINK                               57.8                   65.2        60.1       69.2     46.1   **71.3**     66.4
  CV-Bench                            84.3                   83.6        83.2     **85.9**   85.5   **86.1**   **85.9**
  ERQA                                54.8                   47.0        45.3       56.0     47.5   **59.0**     57.3
  EmbSpatial                          78.4                   73.4        56.4       78.0     76.2   **81.5**     78.8
  MindCube                            54.7                   47.7        47.4     **59.2**   55.4     58.0       55.6
  RoboSpatial-VQA                   **79.3**                 57.7        66.2       71.3     73.4     69.3       70.7
  SAT                                 76.7                   62.0        64.7       74.7     73.3   **86.7**     81.3
  Cosmos-Reason1                      72.2                   68.3        62.0       73.8     72.1   **79.4**     76.3
  Min Video Pairs                     72.5                   67.1        59.5       72.8     69.2   **77.0**     73.0
  OpenEQA                             55.0                   50.5        38.3       55.7     45.3   **64.4**     59.2
  VSI-Bench                           45.8                   39.9        34.1       51.1     45.3   **52.9**     46.2
  QA average                          66.5                   60.2        56.1       68.0     62.7   **71.4**     68.2
  ER Score                          **59.6**                 53.7        52.6       51.7     51.2     51.1       47.7
  Overall Average                   **62.8**                 56.7        54.2       59.3     56.5     60.6       57.3
  ---------------------- ------------------------------- ------------ ---------- ---------- ------ ---------- ----------

  : Model performance on a mix of 15 academic embodied reasoning benchmarks. GPT-5 and GPT-5-mini results obtained via API in September 2025.
:::

```{=latex}
\centering
```
::: {#table:er-generality}
  **Model**         GR-ER 1.5   GR-ER 1.5   GR-ER   `\makecell[c]{Pro 2.5}`{=latex}   `\makecell[c]{Flash 2.5}`{=latex}   GPT-5   GPT-5-mini
  ---------------- ----------- ----------- ------- --------------------------------- ----------------------------------- ------- ------------
  **Thinking**         Yes         No        No                   Yes                                Yes                   Yes       Yes
  MMMU                80.7        79.3      67.0                 82.0                               79.7                  82.0       78.0
  GPQA                83.3        81.3      59.6                 86.4                               82.8                  88.4       78.3
  Aider Polyglot      57.3        44.4      16.0                 82.2                               56.7                  81.3       66.7
  Average             73.8        68.3      47.5                 83.5                               73.1                  83.9       74.3

  : Model performance on MMMU, GPQA and Aider Polyglot benchmarks. Results for GPT-5 and GPT-5-mini obtained via API with default thinking settings and no tool use in September 2025.
:::

Evaluation Details: Pointing {#appendix-pointing}
----------------------------

Table `\ref{table:pointing-breakdown}`{=latex} shows detailed breakdown of the evaluation for complex pointing.

```{=latex}
\centering
```
::: {#table:pointing-breakdown}
  ----------------------------------------------------------- ------------------------------- ------------------------------------ ---------- ------ ------ ------ ------
  **Model**                                                                                                                                                        
  1.5                                                                                                                                                              
  1.5                                                          `\makecell[c]{GR-ER}`{=latex}                                                                       
  Pro                                                                                                                                                              
  Flash                                                        `\makecell[c]{GPT-5}`{=latex}   `\makecell[c]{GPT-5-mini}`{=latex}                                  
  **Thinking**                                                              Yes                                No                      No      Yes    Yes    Yes    Yes
  `\multicolumn{8}{l}{\textit{Standard Pointing}}`{=latex}                                                                                                         
  Point-Bench-Affordance                                                   70.9                               76.5                  **87.9**   65.3   67.8   58.1   50.0
  Point-Bench-Counting                                                     86.8                               86.8                  **88.4**   77.5   73.1   53.7   56.8
  Point-Bench-Reasoning                                                    61.7                             **69.0**                  64.8     55.4   49.4   33.0   28.3
  `\multicolumn{8}{l}{\textit{Steerable Pointing}}`{=latex}                                                                                                        
  Point-Bench-Steerable                                                  **67.8**                             61.8                    65.8     53.4   61.3   38.0   32.0
  `\multicolumn{8}{l}{\textit{Spatial Pointing}}`{=latex}                                                                                                          
  Point-Bench-Spatial                                                      71.0                             **72.6**                  71.9     62.7   57.2   35.4   30.3
  RefSpatial                                                               48.5                               41.8                  **49.2**   33.6   41.1   23.5   23.0
  RoboSpatial                                                            **31.1**                             25.3                    30.3     8.3    7.9    19.0   12.5
  Where2Place                                                            **59.0**                             48.0                    41.0     37.0   48.0   37.0   33.5
  `\multicolumn{8}{l}{\textit{Point-to-Count}}`{=latex}                                                                                                            
  PixMo Count                                                            **80.0**                             65.0                    60.0     76.0   64.0   73.0   77.0
  Average                                                                **52.6**                             47.1                    49.1     35.4   39.7   30.8   27.1
  ----------------------------------------------------------- ------------------------------- ------------------------------------ ---------- ------ ------ ------ ------

  : Model performance on complex pointing benchmarks, broken down by subtask.
:::

Additional Examples {#appendix-er-examples}
-------------------

`\cref{fig:gr1.5_er_sample_thoughts_4}`{=latex} illustrates sampled thoughts from `\grshortlatestER{}`{=latex} on embodied reasoning tasks.

```{=latex}
\begin{figure*}[h]\centering
    \includegraphics[width=\textwidth]{src/assets/ER/thought_samples_4.pdf}
    \caption{Sample thoughts from \grshortlatestER{} performing embodied reasoning tasks.}
    \label{fig:gr1.5_er_sample_thoughts_4}
\end{figure*}
```
```{=latex}
\clearpage
```
```{=latex}
\newpage
```
`\grlatest{}`{=latex}: A Physical Agent {#a-physical-agent}
=======================================

Long-horizon benchmarks {#appendix:long-horizon}
-----------------------

Our long-horizon benchmarks evaluate the combination of the `\grshortlatestER{}`{=latex} model with the VLA as an autonomous agent. `\cref{fig:aloha-long-horizon-rollout}`{=latex} shows visuals of the 4 tasks in the ALOHA long-horizon benchmark. The progress is scored as the sum of points scored along each subtask (Table `\ref{tab:aloha_long_horizon}`{=latex}). `\cref{fig:omega-long-horizon-rollout}`{=latex} shows visuals of the 4 tasks on the Bi-arm Franka long-horizon benchmark. The progress is scored as the sum of points scored along each subtask (Table `\ref{tab:franka_long_horizon}`{=latex}).

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: ALOHA Robot (Long-horizon Benchmark).}
\label{tab:aloha_long_horizon}
\vspace{1pt} % Added 1pt of space after the caption
% Changed to 4 columns, adjusting width (e.g., to 3.7cm from 5cm)
\begin{tabular}{| p{3.7cm} | p{3.7cm} | p{3.7cm} | p{3.7cm} |}
\toprule

% --- SECTION 1 ---
% Changed multicolumn span from 3 to 4
\multicolumn{4}{c}{\vspace{1pt}\textbf{Benchmark: ALOHA Robot - Long-horizon.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
% Added the 4th item to this row
\vspace{1pt}\textbf{Trash Sorting: ``Put the compostables into the green bin, the recyclables into the blue bin, and the waste into the black bin''}.\vspace{1pt} &
\vspace{1pt}\textbf{Desk Organization: ``What is the state of the objects in the table? Return them to their original locations''}.\vspace{1pt} &
\vspace{1pt}\textbf{Packing Suitcase: ``Put the hat and socks into the suitcase then pack the colorful shirt that's on the hanger''}.\vspace{1pt} &
\vspace{1pt}\textbf{Blocks in drawer: ``Open each drawer, and put one block in each drawer''}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 0.2 is added per item in the correct bin:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item red grapes in the green bin;
    \item lettuce leaf in the green bin;
    \item aluminum can in the blue bin;
    \item plastic cup in the blue bin;
    \item energy bar wrapper in the black bin.
\end{itemize}
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=1pt]
\item[] 0.2 is added per item in the correct state:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item red pen in the pen holder;
    \item blue pen in the pen holder;
    \item green marker in the cork tray;
    \item glasses case closed;
    \item laptop closed.
\end{itemize}
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 0.25 is added per:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item white beanie in the suitcase;
    \item blue socks in the suitcase;
    \item shirt taken off the hanger;
    \item shirt in the suitcase.
\end{itemize}
\end{itemize}
&
% Content Col 4 (Moved from Row 2)
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] 0.11 is added per:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item left drawer was opened;
    \item any block in left drawer;
    \item left drawer closed;
    \item middle drawer was opened;
    \item any block in middle drawer;
    \item middle drawer closed;
    \item right drawer was opened;
    \item any block in right drawer;
    \item right drawer closed.
\end{itemize}
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\centering
```
![ALOHA long-horizon benchmark. `\label{fig:aloha-long-horizon-rollout}`{=latex}](src/assets/agent/long_horizon_eval_rollouts.png){width="\\textwidth"}

```{=latex}
\begin{tiny}
\centering
\caption{Progress Scores: Bi-arm Franka (Long-horizon Benchmark).}
\label{tab:franka_long_horizon}
\vspace{1pt} % Added 1pt of space after the caption
% Changed to 4 columns, adjusting width (e.g., to 3.7cm from 5cm)
\begin{tabular}{| p{3.7cm} | p{3.7cm} | p{3.7cm} | p{3.7cm} |}
\toprule

% --- SECTION 1 ---
% Changed multicolumn span from 3 to 4
\multicolumn{4}{c}{\vspace{1pt}\textbf{Benchmark: Bi-arm Franka - Long-horizon.}\vspace{1pt}} \\
\hline

% --- Row 1 (Headers) ---
% Added the 4th item to this row
\vspace{1pt}\textbf{Swap: "Swap the sardines and the yellow bottle"}.\vspace{1pt} &
\vspace{1pt}\textbf{Top shelf to the table: "Put all the objects from the top right shelf onto the table"}.\vspace{1pt} &
\vspace{1pt}\textbf{Mushroom risotto: "Pack all ingredients for a mushroom risotto into the basket"}.\vspace{1pt} &
\vspace{1pt}\textbf{Vegetarian with nut allergy: "I am vegetarian and allergic to nuts. Can you put all the food I can't eat into the basket"}.\vspace{1pt} \\
\hline
% --- Row 1 (Content) ---
% Content Col 1
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] $0.33$ is added per each subtask:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item lemon juice is in the correct location;
    \item can of sardines is in the correct location;
    \item no unrelated task done.
\end{itemize}
\end{itemize}
&
% Content Col 2
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] $0.25$ per each subtask:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item rice is on the table;
    \item corn is on the table;
    \item lemon juice is on table;
    \item no unrelated task done.
\end{itemize}
\end{itemize}
&
% Content Col 3
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] $0.25$ per each subtask:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item mushrooms are in the basket;
    \item rice is in the basket;
    \item stock cubes are in the basket;
    \item no unrelated task done.
\end{itemize}
\end{itemize}
&
% Content Col 4
\begin{itemize}[leftmargin=1pt,topsep=0pt]
\item[] $0.33$ per each subtask:
\item[] \begin{itemize}[leftmargin=10pt,topsep=0pt]
    \item the can of sardines is into the basket;
    \item the granola is into the basket;
    \item no unrelated task done.
\end{itemize}
\end{itemize}
\\
\bottomrule
\end{tabular}
\end{tiny}
```
```{=latex}
\centering
```
![Bi-arm Franka long-horizon benchmark. `\label{fig:omega-long-horizon-rollout}`{=latex}](src/assets/agent/Omega_lh.png){width="\\textwidth"}

```{=latex}
\clearpage
```
Success rate {#appendix:additional-exps-agentic}
------------

`\cref{fig:agentic_sr}`{=latex} shows the *success rate* for the results in Section `\ref{sec:results-agentic}`{=latex}.

```{=latex}
\centering
```
![Long-horizon evaluations for the `\grshortlatest{}`{=latex} Agent on ALOHA (top) and Bi-arm Franka (bottom), consisting of tasks that require advanced real-world understanding, tool use, long-horizon task planning, execution and error recovery to successfully complete the complex long-horizon tasks.](src/assets/agent/sr_agentic_v3.png){#fig:agentic_sr width="\\linewidth"}

```{=latex}
\centering
```
![Long-horizon evaluations for the `\grshortlatest{}`{=latex} Agent on ALOHA (top) and Bi-arm Franka (bottom), consisting of tasks that require advanced real-world understanding, tool use, long-horizon task planning, execution and error recovery to successfully complete the complex long-horizon tasks.](src/assets/agent/omega_agentic_sr_no_legend.png){#fig:agentic_sr width="\\linewidth"}

```{=latex}
\clearpage
```

[^1]: See Contributions and Acknowledgments section for full author list. Please send correspondence to [`\mbox{gemini-robotics-report@google.com}`{=latex}](mailto:gemini-robotics-report@google.com).

[^2]: We do not show comparisons with the `\grlegacy{}`{=latex} model from [@team2025gemini], because those models on the Bi-arm Franka and the Apollo humanoid were post-trained specialists, and they had little generalization beyond variations of the trained tasks.

[^3]: While we are using the pre-training checkpoint for ALOHA, we performed additional post-training on the bi-arm Franka platform to improve its success rate for long-horizon tasks.

[^4]: https://deepmind.google/models/veo/
