---
author:
- Srihas Yarlagadda$^{\dag*\text{1}}$`\enskip`{=latex} Amey Agrawal$^{\text{*1}}$`\enskip`{=latex} Elton Pinto$^{\text{*1}}$`\enskip`{=latex} Hakesh Darapaneni$^{\dag\text{1}}$`\enskip`{=latex} Mitali Meratwal$^{\dag\text{1}}$`\enskip`{=latex} Shivam Mittal$^{\dag\text{1}}$`\enskip`{=latex} Pranavi Bajjuri$^{\dag\text{1}}$`\enskip`{=latex} Srinivas Sridharan$^{\text{2}}$`\enskip`{=latex} Alexey Tumanov$^{\text{1}}$
bibliography:
- main.bib
title: "`\\sysname`{=latex}: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation"
---

```{=latex}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
```
```{=latex}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
```
```{=latex}
\def\UrlBreaks{\do\/\do-}
```
```{=latex}
\renewcommand{\sectionautorefname}{\S\@gobble}
```
```{=latex}
\renewcommand{\subsectionautorefname}{\S\@gobble}
```
```{=latex}
\renewcommand{\subsubsectionautorefname}{\S\@gobble}
```
```{=latex}
\newcommand{\sysname}{\textsc{Maya}\xspace}
```
```{=latex}
\newcommand{\syssearchname}{\textsc{\sysname-Search}\xspace}
```
```{=latex}
\newcommand{\mycaption}[2]{\caption{\textbf{#1} {#2}}}
```
```{=latex}
\newcommand{\myx}{$\times$\xspace}
```
```{=latex}
\newcommand{\csl}{custom specification languages\xspace}
```
```{=latex}
\newcommand{\semg}{semantic gap\xspace}
```
```{=latex}
\newcommand{\ie}{\textit{i.e.,}\xspace}
```
```{=latex}
\newcommand{\eg}{\textit{e.g.,}\xspace}
```
```{=latex}
\newcommand{\vheading}[1]{\noindent\textbf{#1.}}
```
```{=latex}
\newcommand{\sref}[1]{\S\ref{#1}}
```
```{=latex}
\newcommand{\ra}[1]{\renewcommand{\arraystretch}{#1}}
```
```{=latex}
\newcommand{\evalpe}{5}
```
```{=latex}
\newcommand{\evaltc}{56}
```
```{=latex}
\newcommand{\evalcg}{2}
```
```{=latex}
\newcommand{\ctc}{74}
```
```{=latex}
\newcommand{\greencheck}{\textcolor{codegreen}{\checkmark}}
```
```{=latex}
\newcommand{\redcross}{\textcolor{red}{$\times$}}
```
```{=latex}
\newcommand{\bluecross}{\textcolor{blue}{$\times$}}
```
```{=latex}
\newcommand{\greenuparrow}{\textcolor{codegreen}{$\uparrow$}}
```
```{=latex}
\newcommand{\greendownarrow}{\textcolor{codegreen}{$\downarrow$}}
```
```{=latex}
\newcommand{\reduparrow}{\textcolor{red}{$\uparrow$}}
```
```{=latex}
\newcommand{\reddownarrow}{\textcolor{red}{$\downarrow$}}
```
```{=latex}
\newcommand{\recipe}{\textit{recipe}\xspace}
```
```{=latex}
\newcommand{\recipes}{\textit{recipes}\xspace}
```
```{=latex}
\newcommand{\cuda}{CUDA\xspace}
```
```{=latex}
\newcommand{\amped}{AMPeD\xspace}
```
```{=latex}
\newcommand{\calc}{Calculon\xspace}
```
```{=latex}
\newcommand{\prot}{Proteus\xspace}
```
```{=latex}
\newcommand\blfootnote[1]{%
  \begingroup
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}
```
```{=latex}
\acmYear{2026}
```
```{=latex}
\copyrightyear{2026}
```
```{=latex}
\setcopyright{cc}
```
```{=latex}
\setcctype[4.0]{by}
```
```{=latex}
\acmConference[EUROSYS '26]{European Conference on Computer Systems}{April 27--30, 2026}{Edinburgh, Scotland Uk}
```
```{=latex}
\acmBooktitle{European Conference on Computer Systems (EUROSYS '26), April 27--30, 2026, Edinburgh, Scotland Uk}
```
```{=latex}
\acmDOI{10.1145/3767295.3769366}
```
```{=latex}
\acmISBN{979-8-4007-2212-7/26/04}
```
```{=latex}
\affiliation{ $^{\text{1}}$Georgia Institute of Technology\enskip
$^{\text{2}}$NVIDIA Inc. \country{}}
```
[^1]

```{=latex}
\maketitle
```
```{=latex}
\renewcommand{\shortauthors}{Yarlagadda, Agrawal, Pinto et al.}
```
# Introduction {#sec:Introduction}

<figure id="fig:banner">
<figure id="fig:banner:tradeoff">
<img src="figures/maya_trilemma_new.png" />
<figcaption>Trade-off Space.</figcaption>
</figure>
<figure id="fig:banner:overview">
<img src="figures/maya_banner_flow.png" />
<figcaption>overview.</figcaption>
</figure>
<figcaption>Existing deep learning training performance modeling systems struggle with a tradeoff between fidelity, usability, and generality. Naive analytical models lack fidelity, while the advanced modeling approaches make a tradeoff between usability and generality. breaks this tradeoff through a novel device emulation approach, achieving all three simultaneously.</figcaption>
</figure>

```{=latex}
\begin{figure*}[htbp]


    \begin{subfigure}{0.69\textwidth}

        \includegraphics[width=\textwidth]{figures/h100_strong_scaling_parallel_coordinates_edited.pdf}

        \caption{Configuration Shifts Across Cluster Sizes}
        \label{fig:motivation:strongscalingconfigs:configs}
    \end{subfigure}
    \hfill
    \begin{subfigure}{0.29\textwidth}

        \includegraphics[width=\textwidth]{figures/h100_strong_scaling_confusion_matrix.pdf}

        \caption{Cross-Deployment Inefficiency Matrix}
    \label{fig:motivation:strongscalingconfigs:costofmisconfig}
    \end{subfigure}
    \caption{Sensitivity of optimal training configurations to cluster size for GPT-3 18.4B on H100 GPUs. As GPU counts increase, configurations shift fundamentally --- from memory-efficient combinations of tensor and pipeline parallelism in smaller clusters to higher data-parallel degrees in larger clusters. The cross-deployment cost matrix highlights that deploying configurations tuned for one cluster size on another can lead to inefficiencies, increasing costs by up to \ctc{}\% due to suboptimal resource use. These results emphasize the necessity of scenario-specific configuration tuning, a key challenge \sysname addresses through precise performance modeling.}
\label{fig:motivation:strongscalingconfigs}
\end{figure*}
```
Large foundation models like ChatGPT [@openai2023chatgpt] and Sora [@sora] have demonstrated human-level performance across various natural language and visual tasks. The development of these models critically depends on scaling both model sizes and training corpora [@kaplan2020scaling]. Consequently, the computational demands for training have reached staggering proportions. For instance, training the Llama-3 405B model required 54 days on 16,000 accelerators [@dubey2024llama], a setup that would cost over \$250 million on the Microsoft Azure public cloud [@dubey2024llama].

Training at this scale requires sophisticated system optimizations. Researchers have developed techniques spanning distributed training strategies (tensor, pipeline, expert parallelism) [@megatron; @singh2023hybrid], compute optimizations (kernel fusion, pipeline interleaving) [@megatron], and memory optimizations (activation checkpointing, gradient accumulation) [@megatron; @zero; @korthikanti2023reducing]. Engineers meticulously craft *training recipes* that combine these techniques to maximize hardware utilization. However, the vast array of techniques and their associated parameters create a combinatorial explosion in the configuration space.

Furthermore, these recipes need to be tailored for each deployment scenario. As depicted in `\Cref{fig:motivation:strongscalingconfigs:configs}`{=latex}, even small changes in the deployment scenario can require significant alterations to the configuration. Applying a recipe optimized for one scenario to another can degrade efficiency by up to `\ctc{}`{=latex}%, as shown in `\Cref{fig:motivation:strongscalingconfigs:costofmisconfig}`{=latex}.

These challenges prop up the need for efficient runtime modeling that can evaluate training strategies without requiring actual hardware deployment. Naive analytical models cannot capture the complex characteristics of these distributed training workloads, leading to inaccurate prediction. To address this challenge, several advanced runtime modeling systems [@duan2023proteus; @bang2023vtrain; @lu2023distsim; @zhu2020daydream; @isaev2023calculon; @moolchandani2023amped; @santhanam2021distir] have been proposed. However, these systems suffer from a fundamental limitation: they cannot operate directly on the user code, and require translating the workloads into *`\csl`{=latex}*.

This translation process introduces two fundamental critical challenges. Consider that a user wants to optimize a GPT-3 model run using one of the existing tools. To employ Proteus [@duan2023proteus], an engineer must translate the original PyTorch workload into a \`\`Strategy Tree\" format --- requiring hundreds of lines of specialized code [@proteusgithub] that explicitly encode parallelization patterns, communication topology, and memory optimizations. Any detail omitted or simplified during this manual translation leads to prediction errors (`\Cref{fig:eval:fidelity:pointcloud}`{=latex}), resulting in up to `\evaltc`{=latex}% higher training costs (`\Cref{fig:eval:fidelity:cost}`{=latex}). We term this loss of implementation detail during the translation process as the ***semantic gap***.

Second, system designers face an inherent ***generality-usability tradeoff***. Systems prioritizing generality like Proteus employ expressive but complex specifications, while systems optimizing for usability like Calculon [@isaev2023calculon] and AMPed [@moolchandani2023amped] offer simpler interfaces but only support specific frameworks like Megatron-LM, limiting their applicability.

We observe that while training systems are complex, they interact with accelerators through a narrow, well-defined interface of device APIs. Moreover, training workloads exhibit a fundamental property: control flow (executed on CPUs) rarely depends on specific numerical computation results (executed on accelerators). This decoupling is pervasive in modern training --- data-parallelism processes different data shards with identical control flow, and even techniques like gradient accumulation and mixed precision training maintain deterministic control patterns. While this excludes certain architectures with data-dependent control flow (e.g., some MoE implementations), these represent a small fraction of workloads found in the wild [@zhu2020daydream; @geoffrey2021habitat; @duan2023proteus; @isaev2023calculon; @lu2023distsim; @flexflow; @unger2022unity].

We present `\sysname`{=latex}, a transparent runtime modeling system that exploits these insights through *transparent device emulation*. Rather than requiring workload translation, `\sysname `{=latex}intercepts and emulates all accelerator API interactions from unmodified training code, creating the illusion of actual device execution while capturing complete workload behavior. This approach eliminates both the semantic gap and the generality-usability tradeoff. Our evaluation demonstrates prediction error below `\evalpe{}`{=latex}% across diverse configurations.

`\noindent `{=latex}In summary, we make the following contributions:

-   We identify intrinsic limitations of existing runtime modeling approaches for DL training workloads, specifically the semantic gap and generality-usability tradeoff arising from custom specification languages.

-   We propose `\sysname`{=latex}, a transparent and flexible runtime modeling system that emulates workload execution on accelerated compute clusters.

-   We demonstrate that `\sysname `{=latex}can predict the end-to-end runtime of workloads with \< `\evalpe{}`{=latex}% error across a variety of models and training configurations.

-   Finally, we demonstrate the efficacy of our system for finding optimal training recipes, reducing training cost by up to `\evaltc{}`{=latex}% compared to existing systems.

```{=latex}
\begin{table*}[htbp!]\small

    \begin{tabular}{lcccccccc}
    \toprule
     &  & \textbf{Domain Specific Simulators} & \textbf{Analytical Models} \\
    \cmidrule(l{2pt}r{2pt}){3-6}\cmidrule(l{2pt}r{2pt}){7-9}
    & \textbf{\sysname} & Proteus & vTrain & DistSim & Daydream & Calculon & AMPed & DistIR \\
    \noalign{\hrule height 0.5pt}
    \rowcolor{gray!15}
         \textit{System Properties}\\\hline Deployment-Free Prediction  & \greencheck & \greencheck & \redcross & \redcross & \redcross & \greencheck & \greencheck & \greencheck \\
Transparent -- \textit{No Code Modifications} & \greencheck & \redcross & \redcross & \redcross & \redcross & \redcross & \redcross & \redcross \\
Workload Agnostic & \greencheck & \greencheck & \redcross & \greencheck & \greencheck & \redcross & \redcross & \greencheck \\
\noalign{\hrule height 0.5pt}
    \rowcolor{gray!15}
         \textit{Modeling Domain}\\\hline
    Data Parallel & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck \\
    Tensor Parallel & \greencheck & \greencheck & \greencheck & \greencheck & \redcross & \greencheck & \greencheck & \greencheck \\
     Pipeline Parallel & \greencheck & \greencheck & \greencheck & \greencheck & \redcross & \greencheck & \greencheck & \greencheck \\
    Sequence Parallel & \greencheck & \redcross & \redcross & \redcross & \redcross & \greencheck & \redcross & \redcross \\
    Pipeline Interleaving & \greencheck & \greencheck & \redcross & \redcross & \redcross & \greencheck & \redcross & \greencheck \\
    Distributed Optimizer & \greencheck & \greencheck & \redcross & \redcross & \redcross & \greencheck & \redcross & \greencheck \\
    Activation Recomputation & \greencheck & \greencheck & \redcross & \redcross & \redcross & \greencheck & \redcross & \redcross \\
    Gradient Accumulation & \greencheck & \redcross & \redcross & \redcross & \redcross & \greencheck & \redcross & \greencheck \\

    \bottomrule
    \end{tabular}
    \caption{
    Comparison of \sysname with existing performance modeling approaches. \sysname uniquely combines deployment-free prediction, transparency, and workload agnosticism, supporting a broad range of parallelism and optimization strategies across training configurations. Competing systems are limited in either coverage and often require code modifications.}
    \label{tab:features_systems}
\end{table*}
```
# Background {#sec:background}

Deep learning training (DLT) workloads have grown to unprecedented scales, with state-of-the-art models now containing billions of parameters. For instance, training the Llama-3 405B required an estimated 864 exaflop-days of compute [@dubey2024llama]. Training such large-scale deep learning models requires parallelizing computation across hundreds or thousands of accelerator devices. This parallelization employs techniques such as Data Parallelism (DP), which replicates the model across devices; Tensor Parallelism (TP), which partitions individual layers; and Pipeline Parallelism (PP), which splits the model into stages. However, parallelization alone is insufficient; achieving high efficiency requires carefully balancing compute, memory, and communication bottlenecks.

`\vheading{Balancing Resource Utilization}`{=latex} To address these bottlenecks, researchers have proposed various techniques that trade off different resources (Table `\ref{tab:knob_impacts}`{=latex}). For instance, tensor parallelism can reduce memory pressure by partitioning layers across devices, but increases communication overhead due to frequent all-reduce operations between partitioned layers. Pipeline parallelism [@gpipe; @pipedream] introduces pipeline bubbles that reduce compute efficiency, but enables parallelization at a comparatively low communication cost and memory pressure. The Zero Redundancy Optimizer (ZeRO) [@zero] shards model parameters, gradients, and optimizer states across workers, reducing memory pressure at the cost of increased communication. Activation checkpointing [@shah2020memory] performs additional recomputation to reduce memory usage, dropping activations after the forward pass.

Efficient hardware utilization requires careful composition of these techniques based on the model architecture, training parameters, and available resources. Each technique introduces additional tunable parameters that affect this balance. For example, the interleaved 1F1B pipelining schedule [@pipedream] reduces pipeline bubbles by assigning multiple micro-batches to each pipeline stage, but requires careful tuning of micro-batch counts to balance communication overlap. Similarly, ZeRO offers different sharding stages that provide varying tradeoffs between memory and communication.

`\vheading{Composing Training Recipes}`{=latex} The vast configuration space generated by these techniques, combined with their interdependencies, makes tuning DLT workloads challenging. `\Cref{fig:motivation:strongscalingconfigs}`{=latex} illustrates optimal configurations for training the same model with varying numbers of accelerators. With limited resources (16 devices), a combination of tensor and pipeline parallelism alleviates memory pressure. When scaling to 128 devices, the reduced per-device memory requirement allows leveraging data parallelism instead of tensor parallelism, avoiding additional communication overhead. Our experiments show that using the optimal configuration for 16 devices results in a 1.74`\myx `{=latex}higher cost when applied to 128 devices, compared to the optimal configuration. This performance sensitivity makes it impractical to rely solely on heuristics or previous experience for configuration selection.

***Takeaway:** DLT workloads employ a rich set of parallelization and optimization techniques with unique resource tradeoffs. These techniques must be carefully composed to maximize hardware utilization.*

::: {#tab:knob_impacts}
  **Resource Load**                   Compute                      Memory                      Network
  -------------------------- -------------------------- ---------------------------- ----------------------------
  Data Parallel               `\reddownarrow `{=latex}   `\greendownarrow `{=latex}     `\reduparrow `{=latex}
  Tensor Parallel             `\reddownarrow `{=latex}   `\greendownarrow `{=latex}     `\reduparrow `{=latex}
  Pipeline Parallel           `\reddownarrow `{=latex}   `\greendownarrow `{=latex}    `\reduparrow  `{=latex}
  Sequence Parallel           `\reddownarrow `{=latex}   `\greendownarrow `{=latex}     `\reduparrow `{=latex}
  Pipeline Interleaving       `\greenuparrow `{=latex}   `\greendownarrow `{=latex}     `\reduparrow `{=latex}
  Distributed Optimizer                **--**            `\greendownarrow `{=latex}     `\reduparrow `{=latex}
  Activation Recomputation    `\reddownarrow `{=latex}   `\greendownarrow `{=latex}             **--**
  Gradient Accumulation       `\reddownarrow `{=latex}   `\greendownarrow `{=latex}   `\greendownarrow `{=latex}

  : Effect of configuration knobs on compute utilization, memory load, and network load in a fixed global batch size setting. This table highlights the trade-offs associated with each knob: while some configurations increase compute utilization, others may reduce memory or network load, illustrating the balancing act required to optimize large-scale training jobs.
:::

Given these challenges, we pose the following question: *Can we transparently, accurately, and efficiently predict the performance of arbitrary DLT configurations without access to target hardware?* Answering this question is crucial for enabling rapid exploration of the configuration space to identify resource-efficient training recipes.

# Challenges & Key Idea {#sec:motivation}

As modern deep learning training (DLT) workloads scale to unprecedented levels, it has become increasingly important to optimize their resource allocation, cost, and environmental impact. Runtime performance prediction systems vastly aid such optimization efforts; however, the complexity of modern DLT workloads (involving clusters with hundreds-thousands of GPUs) and the use of domain-specific optimizations makes developing such systems quite challenging.

At a high level, state-of-the-art runtime modeling systems [@duan2023proteus; @isaev2023calculon; @bang2023vtrain; @lu2023distsim; @zhu2020daydream; @geoffrey2021habitat; @moolchandani2023amped; @santhanam2021distir] follow a four-phase approach to predict DLT workload performance (`\Cref{fig:simanatomy}`{=latex}):

1.  **Workload Specification:** Translate the DLT job into a framework-specific representation, capturing computational flow of the workload.

2.  **Kernel Decomposition:** Decompose the workload into an execution graph containing the kernels.

3.  **Kernel Runtime Prediction:** Estimate execution times for the individual kernels using analytical models or historical profiling data.

4.  **Distributed Execution Simulation:** Model the end-to-end execution, accounting for inter-device communication and synchronization.

There is, however, an inherent flaw in this approach --- it lacks transparency. Users have to explicitly encode their workload in a custom out-of-band specification language, which then has to be decomposed into a kernel execution graph. This undermines the efficacy of such systems due to two interrelated issues: (1) the custom specification language may be insufficiently expressive or difficult to use (`\sref{sub:usability-tradeoff}`{=latex}), and (2) the workload specification may not accurately represent the true workload (`\sref{sub:semantic-gap}`{=latex}). Further, even if it were possible to devise a language that is both expressive and easy-to-use, lack of transparency makes such systems *fragile* and *inflexible*. Users are required to revisit existing specifications and devise new ones as DLT workloads evolve. Therefore, there is a clear need for a transparent, user-friendly, and accurate runtime modeling system.

## The Generality-Usability Tradeoff {#sub:usability-tradeoff}

Existing systems attempt to maintain fidelity while navigating this semantic gap through two primary approaches. On one end of the spectrum, systems like DistIR [@santhanam2021distir] and Proteus [@duan2023proteus] opt for highly expressive but complex representation formats. While these can capture intricate details of the workload, they require users to translate their jobs into hundreds of lines of specialized code [@distirgithub; @proteusgithub]. This imposes a substantial burden on users and, consequently, introduces opportunities for translation errors that can compromise prediction accuracy.

On the other end, systems like VTrain [@bang2023vtrain] and Calculon [@isaev2023calculon] prioritize usability by providing simpler interfaces where users only need to specify configuration parameters. However, this simplicity comes at the cost of generality. These systems are tightly coupled to specific workload implementations, such as Megatron-LM [@megatron], limiting their applicability to a narrow range of use cases.

As a result, there is a tension between usability and generality in current approaches. Systems that strive for broad applicability often sacrifice ease of use, while those focusing on user-friendliness sacrifice generality.

## Semantic Gap in Workload Representation {#sub:semantic-gap}

Existing runtime prediction frameworks require users to manually encode their workloads either using out-of-band `\csl `{=latex}or through static configuration knobs. This approach makes it more likely for a *semantic gap* to manifest between the actual workload and its abstract representation, leading to inaccurate predictions. For instance, users may inadvertently omit complex hardware-specific optimizations or subtle system interactions. Additionally, many DLT workloads exhibit complex runtime behaviors that are non-trivial to represent. This is an inherent \`\`garbage in, garbage out" problem --- inaccurate workload specifications lead to inaccurate predictions.

## Illustrative Example

Consider the scenario presented in `\cref{fig:motiv:example}`{=latex}. AMPed restricts users to a fixed set of operators with carefully curated analytical models. While these analytical predictions can be composed to produce final runtime estimates, the rigid modeling language introduces significant approximation errors. In contrast, Proteus provides an expressive intermediate representation (IR) that allows users to encode diverse parallelism schemes through strategy trees. However, this flexibility comes at a cost --- users may inadvertently model features incorrectly, and verifying that the translated model accurately represents the original program is a challenging and error-prone process.

These limitations are apparent when attempting to evaluate a new framework optimization. A pertinent example is DualPipe [@dualpiperepo], a pipeline parallelism schedule utilized in the training of DeepSeek-R1 [@deepseek]. This schedule differs from the usual interleaved 1F1B schedule proposed by Megatron-LM in that it increases overlapping by running pairs of micro-batches bidirectionally. A static analysis approach would require relevant calculations for the forward and backward passes --- specifically those accounting for the pipeline bubble --- to be rewritten to reflect the increased overlap. On the other hand, expressing this schedule with Proteus would require a custom graph transformation pass on the strategy tree to introduce additional compute/communication nodes, which is a manual and cumbersome process.

![The user workflow across three systems. With AMPed (top), the user provides a declarative configuration specifying high-level parameters which are then fed into a predefined analytical model in the system. If a new model architecture or performance optimization is introduced, the system is rendered unusable. In Proteus (middle), the user must manually translate their entire model into a custom format and write a separate \`\`Strategy Tree" to explicitly define the parallelization strategy. With `\sysname `{=latex}(bottom), the user runs their original, unmodified training script, and the system automatically captures a low-level execution trace through transparent emulation, requiring no user intervention. ](figures/maya_example.png){#fig:motiv:example width="\\linewidth"}

## Solution: Transparent Device Emulation

To address the absence of transparent abstractions, we propose a novel solution that leverages the unique characteristics of DLT workloads to achieve generality and ease of use without sacrificing fidelity: *transparent device emulation*. Instead of having the user specify their workload in an obtuse specification language, we intercept and emulate all application interactions with the accelerator. The system then simulates cluster behavior from these intercepted traces, yielding a performance prediction.

An emulation-driven approach is viable for two key reasons. First, despite the complexity of DLT workloads, these systems interact with accelerators through narrow-waist, well-defined accelerator APIs. This allows `\sysname `{=latex}to mimic the functionality of device management API calls such as `cudaMalloc`, `cudaSetDevice`, and `cublasSetMatrix`, creating the illusion that the user application is running on the actual device. Second, DLT applications exhibit a critical decoupling between control flow (executed on the CPU) and computation (executed on the accelerator device), with the former rarely depending on the actual computation results. This separation allows us to emulate device execution without affecting the application's control flow. We simply save metadata for each compute operation but skip their actual execution, enabling rapid and resource-efficient tracing.

There are several benefits to this approach. First, by transparently intercepting accelerator interactions, `\sysname `{=latex}is able to **accurately capture** the entire workload behavior **without requiring any changes to application code**. This on its own addresses the generality-usability trade-off and semantic gap. Users can now model DLT workloads without an intermediate workload encoding step, significantly lowering the barrier to adoption. The transparency of `\sysname `{=latex}makes it highly adaptable and resilient to the ever-evolving landscape of DLT workloads.

Second, the detailed workload trace from emulation can be used to produce high-fidelity predictions (`\sref{sec:eval:fidelity}`{=latex}). Downstream processing and simulation can accurately represent low-level behavior, making it easy to identify bottlenecks and generate reliable predictions. Further, each component of the `\sysname `{=latex}stack --- emulation, trace processing, runtime estimation, and simulation --- **is pluggable and can be tuned separately**, enabling improved overall accuracy of runtime predictions and better flexibility, all while remaining transparent.

Finally, `\sysname `{=latex}is fundamentally *not coupled* to a specific framework or model, allowing it to seamlessly integrate with existing workflows. It can also be used to build more sophisticated systems such as configuration search (`\sref{sec:eval:opti}`{=latex}).

To summarize, `\sysname `{=latex}is a transparent, user-friendly, and accurate performance modeling system designed to predict DLT workload behavior without requiring access to accelerator hardware. The key insight behind `\sysname`{=latex}'s design is that by operating at the narrow interface between training frameworks and accelerator devices, we can eliminate the fundamental trade-offs between modeling accuracy, ease of use, and generality.

```{=latex}
\begin{figure*}
    \small
    \includegraphics[width=0.7\linewidth]{figures/maya_simulator_anatomy.pdf}
    \caption{Comparison of modeling approaches: Traditional systems require explicit workload specification and several complex heuristics steps to obtain a kernel level execution graph apt for simulation. On the other hand, \sysname directly captures the computation graph at a lower-level through transparent device emulation.}
    \label{fig:simanatomy}
\end{figure*}
```
![`\sysname `{=latex}architecture: (1) Unmodified training code is executed through a virtual runtime that emulates the device drivers given emulation specs, (2) Worker traces capturing device API calls are merged into a unified trace, (3) Kernels in the unified trace are annotated with predicted runtimes using pre-trained estimators, (4) An event-driven simulator processes the annotated traces using cluster specifications to produce performance predictions.](figures/maya_hld.png){#fig:hld width="0.55\\linewidth"}

# `\sysname `{=latex}: System Design {#sec:design}

```{=latex}
\begin{figure*}

    \includegraphics[width=\linewidth]{figures/trace_flow.pdf}
    \caption{\small \sysname's trace processing pipeline: Starting with unmodified user code, the device emulator captures raw traces containing API calls, kernel launches, and synchronization events across multiple GPU streams. The trace collator merges these per-GPU traces and resolves collective operations, creating a unified job-level trace. The kernel runtime estimator then annotates compute operations with predicted durations. Finally, the event-driven simulator processes this trace to model the complex interactions between compute operations, synchronization events, and communication collectives across streams and devices, producing an accurate timeline of execution.}
    \label{fig:traceflow}
\end{figure*}
```
The foundation of `\sysname `{=latex}is a transparent device emulator that functions as the interface between unmodified training workloads and the modeling pipeline. This component interposes on the accelerator device APIs, virtualizing device interactions while maintaining execution fidelity. The emulator captures a precise trace of device operations --- compute kernels, memory operations, and synchronization events --- completely on the CPU. This yields detailed execution traces while remaining fully transparent to the training application, which executes as if on a real accelerator.

These raw execution traces then flow through a trace collation and analysis pipeline that reconstructs the distributed execution pattern. The collator combines traces from multiple workers, resolving dependencies across both space (between workers) and time (within execution streams) by identifying collective communication operations, which are crucial for modeling distributed training workloads. The performance estimation phase then augments this execution trace with runtime predictions. Since the emulator captures operation metadata but does not execute compute kernels, `\sysname `{=latex}employs a combination of machine learning and analytical models to predict operation runtimes.

The final phase uses event-driven simulation to model end-to-end execution. The simulator processes the annotated trace according to a specified hardware configuration, modeling complex execution dependencies within and across workers in a distributed training workload. This captures critical performance characteristics like pipeline bubbles and compute-communication overlap that emerge from the interaction between device operations. The output is a comprehensive simulation report that encompasses metrics such as batch execution time, communication time & memory usage.

`\sysname`{=latex}'s architecture enables it to capture the full complexity of modern DLT optimizations while providing high-fidelity performance predictions. By operating on unmodified user code and eliminating the requirement for accelerator hardware during prediction, `\sysname `{=latex}offers a unique combination of transparency and efficiency. In the rest of this section, we provide the design details of each component in `\sysname `{=latex}(Figure `\ref{fig:hld}`{=latex}).

## Transparent Accelerator Emulation

We make a key observation on the nature of deep learning training (DLT) workloads: the CPU-side control flow of the application is fundamentally decoupled from the computation executed on accelerator devices. Since there is minimal feedback to the control flow from the results of device operations, it is possible to emulate device behavior without materializing output values. `\sysname`{=latex}'s emulator exploits this characteristic --- turning compute operations into no-ops while carefully managing device state and dependencies.

To achieve transparency, the emulator *intercepts* calls to device APIs without requiring modifications to the training application. Most device operations, particularly compute kernels, are transformed into no-ops that record metadata about the operation and return immediately. However, `\sysname `{=latex}must still precisely track device state to ensure correct execution. For instance, when applications query device state through APIs like `cudaMemGetInfo`, `\sysname `{=latex}returns carefully constructed responses that mimic device behavior, allowing frameworks like PyTorch to make memory management decisions as they would on real hardware.

The design of this architecture addresses three key challenges in maintaining execution fidelity despite no-op execution: (1) maintaining the semantic meaning of API sequences, (2) tracking both physical and virtual resources, and (3) handling distributed dependencies. These challenges guide `\sysname`{=latex}'s semantically-aware emulation layer.

`\vheading{Context-aware Operation Modeling}`{=latex} Several device operations gain meaning only when considered within the context of a broader sequence of API calls, requiring careful state tracking. In CUDA, for instance, `cudaStreamWaitEvent()` synchronizes ops across compute streams based on events recorded by `cudaEventRecord()`. We maintain a map of device state to model these dependencies correctly, even though the underlying operations don't execute.

A similar treatment is required for operations involving opaque libraries like cuBLAS and cuDNN, where configurations are built incrementally. For instance, a cuBLAS matrix multiplication involves a sequence of setup calls (`cublasSetMatrix()`, `cublasSetStream()`) before the actual computation (`cublasGemmEx()`). `\sysname `{=latex}tracks these stateful API sequences to construct the complete operation metadata, essential for modeling performance-critical operations like matrix multiplications and convolutions.

`\vheading{Resource Tracking}`{=latex} `\sysname `{=latex}maintains a dynamic mapping of both physical and virtual resources during emulation. For memory management, `\sysname `{=latex}tracks allocations and deallocations, allowing it to simulate real hardware constraints and detect errors such as out-of-memory (OOM) conditions and invalid memory accesses. In unified memory configurations, `\sysname `{=latex}tracks tensor locations across host and device spaces and resolves ambiguity in API calls like `cudaMemcpyAsync` to accurately model workload behavior.

In addition to physical resources, `\sysname `{=latex}creates and manages virtual resources and handles that are returned to the application; examples of this include device handles, CUDA streams, and CUDA events. Any misconfiguration or user error --- such as using an invalid stream or an uninitialized descriptor --- is identified and flagged by `\sysname `{=latex}using each handle's state. Through detailed accounting of both physical and virtual resources, `\sysname `{=latex}provides a realistic foundation to emulate device behavior and potential failure scenarios; this is a key benefit unique to emulated tracing.

`\vheading{Inter-Device Dependencies}`{=latex} In distributed deep learning, collective communication operations are used to synchronize data across devices. To emulate them accurately, `\sysname `{=latex}captures the full lifecycle of these collectives. Each worker initializes a communicator using an API like `ncclCommInitRank`, which assigns ranks and defines the communication topology for operations like `ncclAllReduce`. This setup enables `\sysname `{=latex}to accurately track data dependencies and the role of each worker device within the collective operation.

Once initialized, these communicators facilitate data transfers that often run on dedicated streams to achieve compute-communication overlap. For example, in `ncclAllReduce`, each device concurrently performs compute tasks on one stream while executing collective communication on another. `\sysname `{=latex}models this behavior by simultaneously tracking communication and compute streams, enabling it to accurately capture blocking dependencies and the resulting overlap between computation and data transfer. For more complex parallelism patterns, such as 3D parallelism, `\sysname `{=latex}tracks multiple communicators operating across different workload dimensions --- assigning unique identifiers to each one and logging associated events in the trace. Just like compute operations, there is no need to actually share data between worker CPU processes since the control flow does not depend on the result of the collective; this obviates the need for IPC and synchronization in the emulator.

This approach works particularly well for DLT workloads due to their predictable, repetitive nature. The training loop typically executes the same sequence of operations repeatedly in each iteration, with control-flow decisions rarely depending on specific numerical results from device computation. By exploiting this characteristic while carefully maintaining device state, `\sysname `{=latex}can accurately model workloads without executing device operations. The emulator produces detailed traces that capture the full complexity of device interactions while remaining lightweight and efficient.

## Trace Collection and Analysis

`\vheading{Worker Trace Generation}`{=latex} `\sysname `{=latex}captures detailed execution traces for each worker in the distributed training job. Rather than just logging API calls, we maintain rich context about each operation. For compute kernels, we record essential metadata including input/output tensor shapes, data types, and memory layouts --- information critical for runtime prediction. For instance, in a transformer layer's attention computation, we track matrix dimensions and sparsity patterns that significantly impact performance.

Each trace entry also includes precise timing of CPU-side operations between kernel launches, capturing essential host overhead and dispatch latency. We achieve this by measuring wall-clock deltas between API calls during emulation. This is particularly important when operating with state-of-the-art devices like NVIDIA H100s, where dispatch overhead can be significant, especially in workloads with many small kernels.

`\vheading{Trace Collation}`{=latex} A key challenge in analyzing distributed training workloads is reconstructing the global execution pattern from individual worker traces. While individual workers are aware of the number of participants in a collective operation, they do not have visibility into *which* workers are involved or the topology of the communication graph. The collator identifies collective operations (like `ncclAllReduce`) and matches them across workers using communicator IDs and sequence numbers. This allows us to reconstruct and model the full communication pattern faithfully.

`\label{sec:deduplication}`{=latex} `\vheading{Optimization: Worker Deduplication}`{=latex} An insight that enables `\sysname `{=latex}to efficiently scale to large distributed workloads is that many workers in DLT perform identical work. For instance, in data parallel training, each worker executes the same computation on different data shards. We exploit this pattern through dynamic worker deduplication. During the first training iteration, we profile all workers to establish operation patterns. We compute rolling hashes of operation sequences, allowing us to identify workers performing redundant computation. Upon detecting duplicates, we terminate redundant workers and continue profiling only the unique ranks. The trace collator later reconstructs the full execution pattern using these profiled ranks.

This optimization is particularly effective for large-scale training jobs. For example, in a 64-GPU job with 8-way TP and 8-way DP, we only need to profile a single worker since tensor and data parallel workers exhibit identical behavior.

## End-to-end Simulator

The emulator trace contains metadata for each operation but lacks execution times, since operations are emulated and not dispatched to actual hardware. To produce an end-to-end performance estimate from this trace, the simulator pipeline i) predicts per-operation runtimes from metadata recorded during emulation, and ii) conducts a discrete-event simulation of cluster behavior.

`\vheading{Kernel Runtime Estimation}`{=latex} `\sysname`{=latex}'s kernel runtime estimators are pluggable components that estimate latency and bandwidth for *individual* compute and collective operations. Users can provide any runtime estimator of their choosing for any kernel type (eg. Habitat [@geoffrey2021habitat], GPU-Mangrove [@braun2020mangrove], static-analysis based approaches [@alavani2018predstatic]).

By default, `\sysname `{=latex}uses random forest regressors trained on profiling data from kernel microbenchmarks, similar to prior approaches [@zhu2020daydream; @vidur]. For collective operations, the reference estimators leverage profiling data of intra-host and inter-host link characteristics, considering varying data sizes and the topology of participating devices. Please refer to Appendix `\ref{appendix:predictions}`{=latex} for more details.

To facilitate easy onboarding of new operations, `\sysname `{=latex}offers a transparent *profiling* mode that dispatches operations on real hardware (rather than emulating them), logging each operation's arguments and observed runtime. This enables us to progressively build and integrate prediction models from production workloads.

`\vheading{Resource Model}`{=latex} The simulator models both host and accelerator resources. Each host machine is represented by a dispatch queue that processes operations and manages device interactions. Each accelerator manages multiple execution streams and models concurrent operation processing. Synchronization operations like `cudaDeviceSynchronize` and `cudaStreamWaitEvent` are modeled using blocking waits in the corresponding streams. Host-side computation and launch overheads are also modeled as blocking operations in the dispatch queue using measurements from the emulation phase. `\sysname `{=latex}shares its overall discrete-event simulation approach with prior work [@zhu2020daydream; @santhanam2021distir] - however, we can capture fine-grained dependencies at the CUDA API granularity owing to the detailed traces collected via emulation. For more details, please refer to Algorithm `\ref{alg:simulator}`{=latex}, Appendix `\ref{appendix:simulator_core}`{=latex}.

`\vheading{Network Model}`{=latex} Network operations are implemented using a global waitmap where participating devices register themselves, blocking their respective streams until all workers join. This waitmap can capture pipeline bubbles and effectively model compute-comms overlap --- data dependencies manifest as stalls on the corresponding accelerator stream, while any concurrent compute streams can proceed to the next event (`\autoref{fig:traceflow}`{=latex}). This behavior is described in more detail in Algorithm `\ref{alg:waitmaps}`{=latex}, Appendix `\ref{appendix:simulator_core}`{=latex}.

After all participants join, the on-the-wire duration of each collective operation is a black-box prediction from the corresponding kernel runtime estimator. This abstracts any topology-dependent runtime effects into a single discrete event that is separate from other dataflow dependencies in the simulator, allowing network operations to be modeled in isolation. This allows users to choose between profiled collective data from their target cluster (nccl-tests), or network simulators like ASTRA-sim [@won2023astra].

By reproducing the behavior of accelerator primitives, the simulator provides an accurate representation of cluster behavior. Operation-level modeling ensures we can capture fine-grained behavior while remaining general; new computational optimizations can be captured without additional effort. While our implementation targets CUDA devices, the design generalizes to other accelerators.

# Workload Tuning with `\syssearchname`{=latex}

`\sysname`{=latex}'s transparent emulation enables efficient exploration of the vast configuration space of DLT configs. While prior approaches require explicit modeling of each optimization technique, our emulation-based design naturally captures the impact of any configuration change through its low-level tracing. We leverage this capability to build an automated configuration search system that can rapidly evaluate different training recipes without requiring GPU resources.

The key insight is that by operating at the accelerator API level, `\sysname `{=latex}can accurately predict the performance impact of configuration changes without needing to understand their semantic meaning. This allows us to treat configuration search as a black-box optimization problem, evaluating arbitrary combinations of parallelization strategies and system optimizations through lightweight emulation. The system takes as input a configuration space specification (defining the parallelization strategies and optimization knobs to explore), a resource specification (describing the target GPU cluster), and the training script. It then orchestrates concurrent trials that use `\sysname `{=latex}to evaluate different configurations, continuously refining the search based on predicted performance using standard hyperparameter optimization techniques such as Bayesian optimization [@baysianoptimization] or Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [@cma].

## Concurrent Trial Scheduling

While `\sysname`{=latex}'s emulation engine provides a cheap way to evaluate different configurations, the search process can still be prohibitively slow if done sequentially. This necessitates careful resource management to enable the concurrent evaluation of multiple configurations. We solve this problem by developing a CPU scheduler that distributes the emulation of concurrent trials across CPU cores.

Since `\sysname `{=latex}relies on wall-clock measurements for host-side overheads, concurrent trials would be affected by interference if they contend for CPU resources. To avoid this, we i) pin individual worker processes to CPU cores, and ii) run each emulated worker to completion before switching.

Second, we tackle memory pressure through careful process management. Each emulated GPU rank initially requires a complete copy of the user libraries (for instance, PyTorch runtime stack), which can quickly exhaust system memory when running several concurrent trials. We address this using Python's *forkserver* mechanism to maintain a single copy of user libraries across workers, reducing memory footprint.

```{=latex}
\begin{figure*}[t!]

    \includegraphics[width=0.8\textwidth]{figures/all_fidelity_clouds.pdf}
    \caption{Runtime prediction accuracy comparison across different scales and hardware. We evaluate GPT3-2.7B (left) and GPT3-18.4B (right) models on V100 and H100 clusters. For each hardware setup, we plot the predicted vs actual per-iteration runtime for the top 100 valid configurations ranked by measured performance. \sysname consistently achieves high prediction fidelity across model sizes and hardware setups compared to existing approaches, with most predictions falling within 5\% of measured values. \todo{regen the plot with maya name}}
    \label{fig:eval:fidelity:pointcloud}
\end{figure*}
```
## Fidelity-Preserving Trial Pruning

While concurrent execution improves throughput, we can further accelerate the search by intelligently pruning or skipping configurations that are guaranteed to perform worse than already evaluated ones. The key challenge is determining when such skips preserve prediction fidelity without missing potentially optimal configurations.

We develop a domain-aware trial scheduler that leverages known relationships between training configurations. For instance, if a configuration with activation recomputation enabled leads to out-of-memory (OOM) errors, we can safely skip evaluating the same configuration with recomputation disabled, as it will necessarily consume more memory and thus OOM. These relationships form a partial ordering over configurations based on their resource consumption.

The trial scheduler maintains a history of evaluated configurations and employs a set of conservative tactics to identify configurations that are dominated by previously seen ones. Pruning using this configuration history is *fidelity-preserving*, meaning that no potentially optimal configuration is skipped while still achieving significant reductions in search time.

# Implementation

`\sysname`{=latex}'s CUDA emulator is implemented as a shared library ($\sim$`<!-- -->`{=html}2,500 lines in C++) that intercepts GPU-related API calls through dynamic linking. We use `LD_PRELOAD` to inject our library at runtime, replacing symbols for the CUDA runtime API, driver API, and related libraries (cuBLAS, cuDNN, NCCL) with our implementations. This is similar to prior work on GPU virtualization [@singularity]. The event-driven simulator is implemented in Python ($\sim$`<!-- -->`{=html}3,000 lines) using a priority queue to process operation timings. The simulator includes specialized handlers for different operation types (compute, memory transfers, synchronization) and a topology-aware network model for accurate collective operation simulation. The configuration search system extends Ray Tune [@ray-tune] with domain-specific optimizations. The system exposes a simple Python API to integrate `\syssearchname `{=latex}in less than 15 lines of code changes and abstracts the complexity of emulation and trial management.

# Evaluation {#sec-eval}

In this section, we present a comprehensive evaluation of `\sysname `{=latex}to demonstrate its effectiveness in predicting LLM training performance and optimizing deployment configurations. Our evaluation aims to answer these key questions:

1\. How accurate is `\sysname `{=latex}in predicting end-to-end runtime of training workloads across models of various sizes, different configurations and deployment environments? (`\sref{sec:eval:fidelity}`{=latex})

2\. Can `\sysname `{=latex}effectively optimize DLT workload deployment while navigating large configuration spaces? (`\sref{sec:eval:opti}`{=latex})

3\. How can `\sysname `{=latex}scale to large clusters?(`\sref{sec:eval:hyperscale}`{=latex})

To answer these questions, we conduct a series of experiments, where we compare `\sysname`{=latex}'s predictions against real-world measurements and existing state-of-the-art runtime modeling systems. Finally, we also present ablation studies to evaluate the effectiveness and scalability of different components of `\sysname `{=latex}(`\sref{sec:eval:ablation}`{=latex}).

```{=latex}
\begin{figure*}[htbp]

   \includegraphics[width=0.8\textwidth]{figures/normalized_cost_comparison_all.pdf}
   \caption{Cost impact of prediction accuracy on configuration selection. We evaluate GPT3-2.7B and GPT3-18.4B models across V100 and H100 clusters, showing the cost of each system's selected configuration normalized to the optimal configuration's cost. \sysname consistently identifies configurations within 2\% of optimal cost, while baseline systems can result in up to 56\% higher cost. \todo{regen the plot with maya name}}
   \label{fig:eval:fidelity:cost}
\end{figure*}
```
## Experimental Setup

`\vheading{Baselines}`{=latex} We compare `\sysname `{=latex}against a variety of state-of-the-art runtime modeling systems. We consider two analytical modeling frameworks -- Calculon [@isaev2023calculon] and AMPed [@moolchandani2023amped], and one domain specific simulator, Proteus [@duan2023proteus]. We omit vTrain [@bang2023vtrain], DistSim [@lu2023distsim] and Daydream [@zhu2020daydream] from comparison due to unavailability of their source code. While DistIR [@santhanam2021distir] is available publicly, it only supports modeling the training performance for simple MLP workloads.

`\vheading{Models}`{=latex} In order to facilitate a direct comparison, we conduct our experiments on the GPT-3 [@brown2020language] family of models -- the only workload natively supported by our baselines AMPed and Calculon. We use Megatron-LM [@megatron] GPT-3 [@brown2020language] 2.7B, 18.4B and 145.6B models in our experiments, with fixed global batch sizes of 256, 512 and 12k respectively (unless otherwise mentioned). The training scripts use HuggingFace Accelerate [@hfaccelerate], Pytorch 2.1.0 [@ansel2024pytorch] and bfloat16 mixed precision. We also verify `\sysname `{=latex}on a host of models using FSDP and `torch.compile` (Table `\ref{tab:framework_generality}`{=latex}).

`\vheading{Hardware}`{=latex} We evaluate the performance of `\sysname `{=latex}in three different scenarios -- a 64 GPU NVIDIA H100 DGX [@h100] cluster, a 16 GPU V100 [@v100] DGX cluster, and a node containing 8 A40 GPUs. Each DGX-H100 server has 8 NVIDIA H100 GPUs with 80GB of High Bandwidth Memory (HBM). GPUs within a server are connected with NVLINK4.0 providing 900GBps bidirectional bandwidth. GPUs across servers are connected via Ethernet with RoCE offering 400Gbps per GPU pair.

The V100 DGX servers are equipped with 8 GPUs with 40GB HBM memory capacity. Intra-node NVLINK connectivity is in an asymmetric cubemesh topology [@v100] with 300GBps links. These machines are connected using a 100GBps Infiniband [@infiniband] link. The A40 node uses pairwise NVLINK 4.0 between GPUs. Finally, we run the `\sysname `{=latex}prediction pipeline on a CPU-only node (AMD 7513 EPYC, 128 cores, 504GB RAM) for configuration search, an AMD 9334 EPYC processor with 64 cores and 750GB memory for the scaling experiments.

`\vheading{Configuration Space}`{=latex} We analyze `\sysname`{=latex}'s performance on a rich configuration space ($\sim$`<!-- -->`{=html}2000 points for each hardware cluster) formed by the composition of eight different configuration parameters -- mapping to different parallelization strategies and memory/compute optimizations. A summary of all the config knobs and their impact on compute system utilization is listed in `\Cref{tab:knob_impacts}`{=latex}. All baseline systems do not support every optimization parameter shown in `\Cref{tab:features_systems}`{=latex} and we skip these unsupported configs. Furthermore, we omit `\calc `{=latex}and `\amped `{=latex}baselines for the Volta architecture because they do not support modeling bfloat16.

## Prediction Quality {#sec:eval:fidelity}

We first evaluate `\sysname`{=latex}'s accuracy in predicting the end-to-end runtime of training workloads across various models, configurations, and deployment setups. We compare `\sysname `{=latex}against state-of-the-art: `\prot`{=latex}, `\calc`{=latex}, and `\amped`{=latex}.

`\vheading{Accuracy Across Configurations}`{=latex} `\Cref{fig:eval:fidelity:pointcloud}`{=latex} illustrates the prediction quality of `\sysname `{=latex}across the top one hundred configurations for training GPT3 models on four different deployment setups. `\sysname `{=latex}consistently predicts the end-to-end runtime with high fidelity across all configurations and deployment setups. While `\prot `{=latex}achieves comparable fidelity on V100 GPUs, it only supports a subset of configuration knobs, limiting its ability to identify top-performing configurations. Moreover, `\prot`{=latex}'s performance degrades significantly on H100 GPUs, with predictions often deviating by an order of magnitude. This is particularly surprising since `\prot `{=latex}performs explicit profiling of kernel execution times on actual GPUs as opposed to all the other systems considered in this experiment. As shown in `\Cref{fig:eval:fidelitycdf}`{=latex}, `\amped `{=latex}[^2] consistently overestimates execution time by 2-3$\times$. Despite `\calc `{=latex}and `\amped `{=latex}being specialized for GPT3 training with Megatron-LM, they exhibit significantly higher prediction error compared to `\sysname`{=latex}. `\sysname `{=latex}achieves remarkable fidelity, predicting runtimes within 1% error margin for $\sim$`<!-- -->`{=html}65% of configurations on 8 V100 GPUs. This extends to larger deployments, with `\sysname `{=latex}maintaining 10% error margin for $\sim$`<!-- -->`{=html}90% of configurations even at 64 H100s.

![Cumulative distribution of prediction errors across configurations. `\sysname `{=latex}achieves less than 1% prediction error for 65% of configurations on 8`\myx`{=latex}V100 cluster. `\sysname `{=latex}achieves sub 10% prediction error for 90% of configurations on 64`\myx`{=latex}H100 cluster, while baseline systems show 10-1000% errors. `\todo{regen the plot with maya name}`{=latex}](figures/combined_error_cdf.png){#fig:eval:fidelitycdf width="\\linewidth"}

`\vheading{Impact on Configuration Selection}`{=latex} `\Cref{fig:eval:fidelity:cost}`{=latex} demonstrates how prediction accuracy directly impacts the identification of optimal training configurations. The graph shows the normalized cost (relative to the optimal configuration) of the best configuration selected by each system on actual deployment. `\sysname `{=latex}consistently identifies configurations within 2% of optimal cost across all scenarios, showcasing its ability to effectively navigate complex configuration spaces. In contrast, `\prot `{=latex}selects configurations 5-17% more costly than optimal, with the gap widening for larger models and GPU counts. `\calc`{=latex}'s consistent underestimation leads to configurations with 10-15% higher costs, while `\amped`{=latex}'s overestimation results in configurations up to 56% more expensive than optimal. `\sysname`{=latex}'s exceptional prediction accuracy across diverse model sizes, GPU configurations, and optimization strategies directly translates to identifying highly efficient training configurations, enabling significant savings in computational resources and associated costs for large-scale deep learning training workloads.

`\vheading{Breakdown of Prediction Error}`{=latex} The end-to-end prediction error can broadly be attributed to i) the prediction error of individual kernel runtimes, and ii) loss of detail in the emulation and simulation phases. To better characterize these errors, we compare against an *oracle prediction* --- this is a modified version of `\sysname `{=latex}that uses profiled (actual) per-kernel runtimes instead of predicted values from a regressor. The results obtained on a single-node and multi-node V100 setup are summarized in Table `\ref{tab:oracle_fidelity}`{=latex}.

::: {#tab:framework_generality}
  -------------------------------------------------------------------------------------------------------------------------
  **Framework**   **Optimizations**                           **Models**
  --------------- ------------------------------------------- -------------------------------------------------------------
  DeepSpeed       ZeRO 1-3, `\newline `{=latex}Act. Offload   ResNet, DenseNet, MobileNet, VGG, BERT, GPT, Llama, T5, ViT

  1-2 PyTorch     `torch.compile`, FSDP, DDP
  -------------------------------------------------------------------------------------------------------------------------

  : Frameworks and models tested with `\sysname `{=latex}emulation, in addition to Megatron-LM.
:::

A key observation from running training scripts in the wild is that extra verification steps can occasionally lead to emulation failures if not disabled --- this is because they attempt to load and check the contents of certain portions of output buffers. We found this to be mitigated by allowing the emulator to `memcpy` small buffers to mock host-host and host-device transfers, passing most verification checks that inspect metadata (such as a tensor count or rank order).

To further validate the efficacy of `\sysname `{=latex}across model architectures, we collect results from a representative vision model, ResNet152 (`\Cref{fig:eval:resnet152}`{=latex}) on an 8xA40 node. This specific workload is particularly challenging due to heterogeneous GPU links and the use of `torch.compile` (Appendix `\ref{appendix:predictions}`{=latex}). Despite this, we observe consistent high-fidelity runtime predictions with less than 5% error over half of all configurations, similar to our experiments with Megatron-LM.

![Prediction accuracy of `\sysname `{=latex}across different configurations of ResNet152 deployed on 8xA40 GPUs.](figures/resnet152_performance.png){#fig:eval:resnet152 width="0.5\\linewidth"}

::: {#tab:eval-config-knobs}
  **Configuration Knob**       **Search Space**
  -------------------------- ------------------
  Tensor Parallel Degree             1, 2, 4, 8
  Pipeline Parallel Degree           1, 2, 4, 8
  Microbatch Multiplier           1, 2, 4, 6, 8
  Number of Virtual Stages              1, 2, 4
  Activation Recomputation          True, False
  Sequence Parallelism              True, False
  Distributed Optimizer             True, False

  : Configuration knobs and their search space.
:::

## Configuration Search with `\sysname`{=latex} {#sec:eval:opti}

We ran a hyperparameter search using our system over the Megatron-LM configuration space for each resource/model specification (`\Cref{tab:eval-config-knobs}`{=latex}). The system was configured to use CMA-ES [@cma; @hansen2001completely] as the search algorithm. Further, we enabled all of our optimizations: dynamic worker de-duplication, inter+intra trial concurrency, and fidelity-preserving trial pruning (using Megatron-LM specific tactics, detailed in Appendix `\ref{appendix:tactics}`{=latex}). The early stopping mechanism was configured to terminate the search if the MFU of the top 5 configs remained the same for 20 consecutive non-OOMing configs.

<figure id="fig:config-search-mfu-optimality">
<figure id="fig:config-search-runtime">
<img src="figures/config_search_e2e.png" />
<figcaption>Configuration search runtime</figcaption>
</figure>
<figure id="fig:config-search-mfu-optimality">
<img src="figures/mfu_optimality-simplified.png" />
<figcaption>Normalized cost</figcaption>
</figure>
<figcaption>End-to-end runtime and fidelity of configuration search. We compare the normalized cost of configurations found using against the optimal. For reference, we also include the optimal configuration found using grid search with .</figcaption>
</figure>

`\vheading{End to end performance}`{=latex} The search completed in under an hour across all resource/model specs (`\Cref{fig:config-search-runtime}`{=latex}). Further, the search was able to find configurations very close to if not the same as the optimal across all resource specs (`\Cref{fig:config-search-mfu-optimality}`{=latex}).

## Supporting Hyperscale Workloads {#sec:eval:hyperscale}

Our experiments thus far have maintained transparency, requiring no domain-specific knowledge of the workload. This also applies to worker deduplication (`\Cref{sec:deduplication}`{=latex}) --- in order to identify which workers are duplicates, the system first emulates all workers for at least one iteration. This presents a challenge when attempting to scale `\sysname `{=latex}to large clusters with thousands of GPUs.

With some explicit knowledge of the workload, however, we observe that unique workers can be identified ahead of time. For instance, in Megatron-LM, we can calculate which ranks would participate in tensor, data, and pipeline communication using the parallelism configuration (`\cref{tab:eval-config-knobs}`{=latex}). This determines the set of unique workers --- specifically, the first data-parallel rank of each communicator group and every pipeline parallel rank. Using this information, we extend `\sysname `{=latex}to *selectively launch* unique workers, drastically reducing overheads.

This optimization enables us to study the behavior of clusters with up to 16K GPUs. Since we did not have access to clusters of this size for profiling collectives, we integrated with ASTRA-sim [@won2023astra] for network simulation. First, keeping the parallelism configuration fixed (TP8, PP8, 12K batch size, 64 microbatches), we vary the data-parallel degree (`\Cref{fig:eval:h100_dp_scaling_mfu}`{=latex}). The results demonstrate the expected trend of ***sublinear scaling*** --- as the number of GPUs is scaled, communication overhead dominates and leads to low MFU.

<figure id="fig:eval:eager-scaling">
<img src="figures/h100_superscale_dp_scaling_mfu.png" style="width:90.0%" />
<img src="figures/h100_superscale_stack_runtime.png" style="width:92.0%" />
<figcaption>stack runtime when scaling to 16K GPUs.</figcaption>
</figure>

In `\cref{fig:eval:eager-scaling}`{=latex}, we keep the configuration entirely fixed and scale the global batch size. The largest configuration takes $\sim$`<!-- -->`{=html}25 minutes to run using 8 unique workers, each corresponding to a pipeline parallel rank. While not conducive to an exhaustive config search, these results demonstrate that `\sysname `{=latex}can effectively scale to thousands of GPUs.

## Ablation studies {#sec:eval:ablation}

`\vheading{Impact of dynamic worker deduplication}`{=latex} To quantify the impact of dynamic worker deduplication on `\sysname`{=latex}'s end-to-end runtime, we fix the parallelism configuration and increase the data parallel degree (thereby testing a larger cluster). Any new DP workers added would be redundant from the perspective of emulation --- this allows us to isolate the impact of dynamic worker deduplication. `\Cref{fig:ablation-unique-ranks}`{=latex} illustrates the results.

<figure id="fig:config-search-heuristic-performance">
<img src="figures/v100-unique_ranks_ablation.png" />
<img src="figures/heuristic_performance.png" />
<figcaption>Trial status breakdown during config search.</figcaption>
</figure>

Without worker deduplication, we observe a significant increase in runtime, with the H100 64 GPU run taking approximately two hours. This is because the system has to emulate and subsequently simulate the execution of every GPU, which results in increased overhead as the number of GPUs increases. In contrast, with dynamic worker deduplication, we observe that the runtime remains approximately the same, with the H100 64 GPU run now taking only 7 minutes -- a 94% improvement. We attribute this to the following. First, deduplication eliminates both the emulation and simulation of redundant GPUs. Second, scaling certain parallelism configuration knobs does not impact the number of *unique* workers; this can be exploited to improve efficiency.

`\vheading{Impact of fidelity-preserving trial pruning}`{=latex} For the configuration search carried out in `\Cref{sec:eval:opti}`{=latex}, the trial skipping mechanism skipped around 20-30% of configurations (`\Cref{fig:config-search-heuristic-performance}`{=latex}) across all resource/model specs, thereby playing a considerable role in bringing down the overall search time.

::: {#tab:eval-h100-32-ablation}
       **Stage**        `\sysname `{=latex}   No Optimization
  -------------------- --------------------- -----------------
       Emulation                9m                  14m
    Trace Collation             2m                  7m
   Runtime prediction          1.5m                 8m
       Simulation              4.5m                 55m
   Total search time            38m               \>24hrs

  : Runtime statistics of configuration search on the H100 32 resource/model spec with and without optimizations enabled. The per-stage times are averaged across all trials.
:::

`\vheading{Impact of optimizations on config search runtime}`{=latex} To evaluate the impact of all optimizations (including the use of the CMA search algorithm) on overall search runtime, we compare against grid search without any heuristic optimizations. As evident in `\Cref{tab:eval-h100-32-ablation}`{=latex}, the optimizations significantly reduce the overall search time, bringing it down from over a day to just under 40 minutes. Worker deduplication is a key enabler for this reduction since it reduces the resource usage of each trial, enabling greater concurrency. This is corroborated by the increased OOM rate when running the full set of workers without optimizations. `\sysname`{=latex}'s applicability to large configuration spaces would not be possible without deduplication and trial pruning.

# Discussion

`\vheading{Taxonomy of CPU computation}`{=latex} `\sysname `{=latex}models host-side overheads as wall-clock time measurements between API calls to the emulator. This allows arbitrary host logic to be abstracted away, while still accounting for the impact of these overheads on end-to-end latency. However, there are workloads where significant CPU computation is involved, and this could affect prediction accuracy if there are hardware differences between the machine used during emulation vs. the target cluster. This can be addressed by applying the per-operation prediction approach to CPU work instead of simply collecting a wall-clock time, though this may not be exposed through a narrow API surface like accelerators. A combination of these two models could enable more general CPU overhead estimates.

`\vheading{Dynamic control flow}`{=latex} As a result of relying on emulation, `\sysname `{=latex}does not model computation graphs where the control flow depends on the result of tensor computation. This assumption is shared with several DL compilers and parallelism search engines [@zheng2022alpa], [@flexflow]. Mixture-of-Experts (MoE) architectures display this pattern --- while most expert-parallel kernels used for MoE training [@deepep2025; @pplxrepo] remove the need for data-dependent control-flow, there are some implementations that use gating on the host.

For expert-parallel kernels, runtime predictors can be trained by encoding the input distribution during the profiling phase [@lin2025apexextensibledynamismawaresimulator; @vidur], keeping the rest of the `\sysname `{=latex}flow unchanged. To handle host-side gating, annotations on the source model can be used to identify the gating function --- instead of returning a random tensor during emulation, we would sample a distribution to generate a spread of runtimes.

`\vheading{SM Contention}`{=latex} `\sysname `{=latex}assumes decoupling between network collectives and concurrent compute streams. As a result, while we are able to model overlapping streams and arbitrary synchronization, we cannot trivially model SM-level interference where network and compute kernels contend for resources. It could be possible to modify and extend the simulator to identify such patterns and scale predicted durations accordingly --- we leave this to future work.

# Related Work

The growing computational demands of training large foundation models have driven significant research into performance modeling and optimization of DLT workloads.

`\vheading{Kernel runtime prediction}`{=latex} Habitat [@geoffrey2021habitat] extrapolates single-GPU measurements to predict cross-device performance. More recent approaches such as NeuSight [@neusight], Omniwise [@omniwise] rely on a mix of profiling data and architectural details of the accelerator to predict kernel runtimes more accurately. ASTRA-sim [@rashidi2020astra] focuses specifically on network topology and collective communication modeling. These are complementary to `\sysname `{=latex}and can be plugged in as needed, enabling end-to-end estimates on a wide range of workloads.

`\vheading{Analytical Performance Models}`{=latex} Analytical models predict DLT performance through mathematical formulations of system behavior. AMPed [@moolchandani2023amped] and Calculon [@isaev2023calculon] propose specialized models for LLMs but support only limited parallelization strategies and require explicit modeling of new optimizations. Other work focuses on specific architectures like CNNs [@yan2015performance; @qi2017paleo; @gianniti2018performance]. While these techniques can provide quick estimates, their applicability is limited to specific models and configurations.

`\vheading{Domain-Specific Simulators}`{=latex} Simulation-based approaches aim to capture detailed system behavior through explicit modeling. Proteus [@duan2023proteus] introduces a strategy tree abstraction for modeling parallelization patterns but requires translation into a custom specification language. DistIR [@santhanam2021distir] proposes an intermediate representation for distributed computations but struggles with complex parallelization strategies. Daydream [@zhu2020daydream] captures dependency graphs from execution traces, but requires GPU access and manual optimization modeling. vTrain [@bang2023vtrain] uses CUPTI profiling to measure kernel runtimes but faces challenges modeling communication patterns in complex parallelization strategies.

# Conclusion {#sec:conclusion}

Training large foundation models at scale has made the optimization of training recipes for hardware utilization a critical challenge, with costs reaching hundreds of millions of dollars. We introduce, `\sysname`{=latex}, a runtime modeling system, addresses this challenge through a fundamental insight: by operating at the narrow interface between training frameworks and accelerator devices, we can eliminate the semantic gap that forces existing systems to trade off between accuracy, usability, and generality. Through transparent device emulation and precise runtime simulation, `\sysname `{=latex}achieves prediction accuracy within `\evalpe`{=latex}% error and identifies configurations within `\evalcg`{=latex}% of optimal cost across diverse scenarios, from small V100 clusters to large-scale H100 deployments. As distributed training continues to push the boundaries of scale and complexity, `\sysname`{=latex}'s transparent runtime modeling approach is a crucial step toward sustainable and efficient deployment of large-scale AI systems.

# Acknowledgments {#acknowledgments .unnumbered}

This material is based on work that was partially supported by the National Science Foundation under grant number CNS-2420977. We would like to express our sincere gratitude to the reviewers, the PC panel, and especially our shepherd Prof. Richard Mortier for their insightful comments and thoughtful consideration, which significantly improved the quality of this paper.\
**Disclaimer**: Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

```{=latex}
\bibliographystyle{plain}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
# Key Algorithms {#appendix:simulator_core}

```{=latex}
\vheading{Simulator}
```
The core logic of the simulator is summarized in Algorithm `\ref{alg:simulator}`{=latex}. At a high-level, the simulator handles discrete events and increments a clock on the completion of each event. Events are classified into different types based on their effect --- for instance, some events indicate the enqueuing of a kernel on the GPU, while others indicate a synchronization between device streams.

The scheduler (Algorithm `\ref{alg:scheduler}`{=latex}) handles updating the state of each host and device in the simulated cluster and adding or removing operations from the corresponding queues. Resources that are busy will cause any new ops targeting them to be queued, effectively simulating blocking delays by deferring their execution. Every time an operation completes execution, the next scheduler tick pops an operation from the resource-specific queue and adds an $\texttt{EndEvent}$ marking its completion in the future. Every discrete $\texttt{EndEvent}$ in the top-level queue is followed by a scheduler tick, ensuring that operations do not block forever on resources.

```{=latex}
\begin{algorithm}
\caption{Core Discrete-Event Simulator Algorithm}
\label{alg:simulator}
\begin{algorithmic}[1]
\Procedure{Simulate}{config, host\_op\_trace}
    \State $\textit{time} \gets 0$
    \State $\textit{event\_queue} \gets \text{ PriorityQueue()}$
    \State $\textit{cluster} \gets \text{ Cluster(config)}$
    \State $\textit{scheduler} \gets \text{ Scheduler(config, cluster)}$

    \Statex \LComment{Init queue with host ops and inter-host-op overheads from the input trace}
    \For{each $\textit{host\_op} \lor \textit{overhead}$  in $\textit{host\_op\_trace}$}
        \State $\textit{event} \gets \text{ HostOpArrivalEvent}(\textit{host\_op})$
        \State $\textit{overhead} \gets \text{ HostOverhead}(\textit{overhead\_dur})$
        \State $\textit{event\_queue.put(event)}$
        \State $\textit{event\_queue.put(overhead)}$

    \EndFor

    \Statex
    \While{$\neg \textit{event\_queue.empty()}$}
        \LComment{Get the next chronological event}
        \State $\textit{event} \gets \textit{event\_queue.get()}$

        \Statex \LComment{Update simulation time.}
        \State $\textit{time} \gets \textit{event.end\_time}$

        \Statex \Comment{Handle event polymorphically based on its type.}
        \State $\textit{new\_events} \gets \textit{event.handle\_event(scheduler)}$

        \LComment{Add newly generated events to the queue}
        \For{each $\textit{new\_event}$ in $\textit{new\_events}$}
            \State $\textit{event\_queue.put(new\_event)}$
        \EndFor

    \EndWhile

    \Statex
    \Return $\textit{time}$
\EndProcedure
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\begin{algorithm}
\caption{Scheduler Event Handling Logic}
\label{alg:scheduler}
\begin{algorithmic}[1]
\Procedure{Event.handle\_event}{scheduler}
    \State \Comment{The logic here is polymorphic, depending on the concrete event type.}
    \Statex

    \If{$\textit{event}$ is an \textbf{OpArrivalEvent}}
        \LComment{An operation from the trace has arrived (e.g., kernel launch).}
        \State \Return $\textit{scheduler.schedule\_operation(event.op)}$

    \ElsIf{$\textit{event}$ is an \textbf{EndEvent}}
        \LComment{An operation has finished, freeing a resource.}
        \State \Return $\textit{scheduler.op\_complete(event.op)}$
    \ElsIf{$\textit{event}$ is a \textbf{ScheduleEvent}}
        \LComment{A global scheduling tick occurs.}
        \State $\textit{newly\_started\_ops} \gets \textit{scheduler.schedule()}$
        \State \Return $\text{create\_end\_events}(\textit{newly\_started\_ops})$

    \Else
        \LComment{Handle other event types (e.g., sync, collective).}
        \State \Return $\textit{handle\_other\_events(scheduler, event)}$
    \EndIf
\EndProcedure
\end{algorithmic}



\begin{algorithmic}[1]
\Procedure{Scheduler.schedule()}{}
\State $\textit{newly\_started\_ops}$ = $\emptyset$
    \For{each $\textit{device, stream}$ in $\textit{scheduler.cluster}$ }
        \If{$\textit{device.is\_busy()} \lor \textit{stream.is\_busy()}$}
            \LComment{A required resource is busy, so don't deque}
            \State $\textbf{continue}$
        \Else
            \LComment{Resources are free, so process the op}
            \State $\textit{op} \gets \textit{device\_queue}$.front()
            \State $\textit{device.set\_busy()}$
            \State $\textit{stream.set\_busy()}$

            \State $\textit{duration} \gets \text{get\_runtime(op)}$
            \State $\textit{end\_time} \gets \textit{current\_time} + \textit{duration}$
            \State $\textit{end\_event} \gets \text{ EndEvent}(\textit{end\_time}, \textit{op})$
        \EndIf
    \EndFor
    \State \Return $\textit{newly\_started\_ops}$
\EndProcedure
\end{algorithmic}



\begin{algorithmic}[1]
\Procedure{Scheduler.op\_complete}{completed\_op}
    \State $\textit{device}, \textit{stream} \gets \text{completed\_op.get\_resources()}$

    \LComment{Mark the resources as free}
    \State $\textit{device.set\_free()}$
    \State $\textit{stream.set\_free()}$

    \If{$\textit{device.wait\_queue.is\_not\_empty()}$}
        \State \Comment{Check for and schedule the next pending operation}
        \State $\textit{next\_op} \gets \textit{device.wait\_queue.get\_next()}$
        \State \Return $\text{schedule\_operation}(\textit{next\_op})$
    \Else
        \State \Return $\emptyset$ \Comment{No pending work for this resource}
    \EndIf
\EndProcedure
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\begin{algorithm}
\caption{Synchronization Wait Map Structures}
\label{alg:waitmaps}
\begin{algorithmic}[1]
\algblock{CudaEventWaitMap}{EndCudaEventWaitMap}
\CudaEventWaitMap
    \State \Comment{\textbf{Structure:} Map from a CUDA (event ID, version) pair to a list ops waiting for it. Versions track re-use of the same CUDA event handle.}
    \State $\textit{events}: (\textit{event\_id}, \textit{version}) \to \textit{waiting\_ops}$
    \Statex
    \Procedure{BlockOnEvent}{event\_id, version, op}
        \LComment{An operation `op` (from a host or stream) blocks on a future event.}
        \State $\textit{events}[\textit{event\_id},\textit{version}].add(op)$
        \State Stall the host/stream associated with $\textit{op}$
    \EndProcedure
    \Statex
    \Procedure{ReleaseWaiters}{event\_id, version}
        \State \Comment{The event has been recorded; release all waiting operations for scheduling.}
        \State $\textit{released\_ops} \gets \textit{events}.pop(\textit{event\_id},\textit{version})$
        \For{each $op$ in $released\_ops$}
        \State Free the host/stream associated with $op$ by creating the associated $EndEvent$ instances
        \EndFor
        \State \Return $\textit{released\_ops}$
    \EndProcedure
\EndCudaEventWaitMap
\end{algorithmic}


\hrule

\begin{algorithmic}[1]
\algblock{NetworkCollectiveWaitMap}{EndNetworkCollectiveWaitMap}
\NetworkCollectiveWaitMap
    \State \Comment{\textbf{Structure:} A map from a NCCL collective's unique ID to its list of participant kernels.}
    \State $\textit{collectives}: (\textit{nccl\_group\_id}, \textit{call\_idx}) \to \textit{kernels}$
    \Statex
    \Procedure{JoinCollective}{kernel}
        \State \Comment{A device's kernel joins a collective operation and waits for peers.}
        \State $collectives[\textit{group\_id}, \textit{call\_idx}].add(kernel)$
        \State $\textit{wait\_list} \gets collectives[\textit{group\_id}, \textit{call\_idx}]$
        \Statex
        \If{$\text{length}(\textit{wait\_list}) = \textit{kernel.num\_ranks}$}
            \LComment{The last worker has arrived; the collective can proceed.}
            \State $collectives.pop(\textit{group\_id}, \textit{call\_idx})$
            \LComment{Return all kernels to be scheduled.}
            \State \Return $\textit{wait\_list}$
        \Else
            \State \Comment{Not all workers have arrived; keep blocking.}
            \State \Return $\emptyset$
        \EndIf
    \EndProcedure
\EndNetworkCollectiveWaitMap
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\vfill
```
```{=latex}
\break
```
The simulator maintains two global structures to track synchronization across hosts and accelerators --- the CUDA Event Wait Map, and the Network Collective Wait Map. These are detailed in Algorithm `\ref{alg:waitmaps}`{=latex}. Corresponding event types that perform synchronization lookup these maps in their handlers. For example, in the case of $\texttt{cudaEventSynchronize}$, the specific device stream or host that intends to block while waiting for a CUDA event inserts an entry in the wait-map (keeping its associated resources from processing ops), while the matching $\texttt{cudaEventRecord}$ triggers the resources to be freed by creating $\texttt{EndEvent}$ instances for the blocking ops.

A similar mechanism is used to model collectives --- each worker makes an entry in the collective wait map. Once the final worker of the collective joins, all the corresponding streams are unblocked and can proceed. In this case, $\texttt{EndEvent}$s are scheduled with a timestamp after the predicted duration of the collective; effectively, `\sysname `{=latex}models the delays involved in starting a collective using a global sync point and then assumes that workers move in lockstep. Any effects associated with the on-the-wire time of the collective can thus be abstracted away in the predicted time --- while this is not completely faithful to the setup/teardown of NCCL collectives, it is sufficiently accurate for an end-to-end accounting of latency.

These fairly simple structures can express a wide variety of possible synchronization behaviors since they operate at the CUDA stream level. Computation streams can overlap with collectives since each stream is a separate resource. The host queue can block on a specific CUDA event or a device stream, deferring the execution of future CUDA API calls. An arbitrary pipeline parallel schedule is a combination of such synchronizing events, and thus `\sysname `{=latex}can trivially capture these behaviors without any explicit modeling.

# Per-kernel prediction accuracy {#appendix:predictions}

The default predictors in `\sysname `{=latex}use random forest regressors trained on per-kernel runtime data. Tables `\ref{tab:h100_kernel_accuracy}`{=latex}, `\ref{tab:v100_kernel_accuracy}`{=latex} and `\ref{tab:a40_kernel_accuracy}`{=latex} include metrics on the prediction error of the kernels trained for Megatron-LM (H100, V100) and PyTorch FSDP (A40). All results use a random 80:20 training/test data split.

As a general theme, we observe the same characteristics as [@geoffrey2021habitat], [@zhu2020daydream] --- a small portion of the kernels are responsible for a significant portion of end-to-end prediction error (matmuls for language models, convolution kernels for vision models). As a result, even large percentage-wise errors in several other kernels do not cause any significant degradation in end-to-end accuracy.

In keeping with this observation, we conduct more extensive profiling of these heavy-hitter kernels --- sweeping a large space of input dimensions for convolution/matmul. The remaining kernels are scraped from traces, collecting by running a single-layer LLaMa/OPT/vision model over a range of batch sizes and tensor-parallel dimensions (since other optimizations like pipeline parallelism do not affect the runtime of a single kernel). The training set for the heavy-hitter kernels included $\approx$ 42k individual points, compared to a few thousand points each for the rest.

::: {#tab:h100_kernel_accuracy}
  **Kernel**                              **MAPE**
  --------------------------------------- ----------
  RadixSortOnesweepKernel                 7.80%
  cuComputeGradGammaBeta                  7.95%
  masked_softmax_warp_backward            0.73%
  compute_num_of_partial_segments         7.37%
  unrolled_elementwise_kernel             5.80%
  write_num_of_segments                   7.27%
  cuApplyLayerNorm                        1.98%
  MemcpyHtoD                              14.23%
  CatArrayBatchedCopy_aligned16_contig    5.79%
  cuComputeGradInput                      3.50%
  MemcpyDtoH                              7.85%
  compute_grad_weight                     3.63%
  at_cuda_detailcubDeviceScanKernel       5.37%
  cublasSgemm_v2                          3.65%
  cublasSgemmStridedBatched               2.22%
  indexSelectLargeIndex                   1.88%
  multi_tensor_apply_kernel               1.68%
  at_cuda_detailcubDeviceScanInitKernel   6.99%
  triu_tril_kernel                        4.38%
  vectorized_elementwise_kernel           8.44%
  krn_partial_segment_offset              55.35%
  RadixSortExclusiveSumKernel             14.30%
  CatArrayBatchedCopy                     43.71%
  fused_dropout_kernel_vec                1.50%
  index_elementwise_kernel                12.86%
  sum_and_scatter                         48.82%
  MemcpyDtoD                              0.00%
  reduce_kernel                           16.75%
  RadixSortHistogramKernel                9.01%
  masked_softmax_warp_forward             1.00%
  cuComputePartGradGammaBeta              4.12%
  krn_partials_per_segment                7.16%
  elementwise_kernel                      10.28%
  elementwise_kernel_with_index           24.67%
  thrustcuda_cubcore_kernel_agent         12.51%
  Memset                                  13.25%

  : Mean absolute percentage error on a held-out validation set, trained on H100 kernel runtimes. Important kernel types for Megatron-LM models include `cublasSgemm_v2` and `cublasSgemmStridedBatched`, where we have \<5% prediction error. Kernels with large percentage-wise errors are extremely short in duration, and thus do not impact end-to-end latency significantly.
:::

::: {#tab:v100_kernel_accuracy}
  **Kernel**                              **MAPE**
  --------------------------------------- ----------
  scaled_masked_softmax_warp_backward     0.41%
  at_cuda_detailcubDeviceScanKernel       5.80%
  sum_and_scatter                         49.87%
  write_num_of_segments                   30.59%
  vectorized_elementwise_kernel           11.44%
  indexSelectLargeIndex                   7.20%
  elementwise_kernel                      26.48%
  krn_partial_segment_offset              48.18%
  fused_dropout_kernel_vec                1.03%
  index_elementwise_kernel                10.47%
  cuApplyLayerNorm                        1.36%
  elementwise_kernel_with_index           31.91%
  compute_num_of_partial_segments         10.39%
  cuComputePartGradGammaBeta              3.05%
  MemcpyDtoH                              39.56%
  unrolled_elementwise_kernel             13.89%
  Memset                                  36.75%
  MemcpyHtoD                              25.61%
  CatArrayBatchedCopy                     105.45%
  krn_partials_per_segment                11.16%
  cuComputeGradInput                      1.80%
  cublasSgemm_v2                          4.58%
  compute_grad_weight                     2.23%
  triu_tril_kernel                        11.76%
  multi_tensor_apply_kernel               3.40%
  RadixSortHistogramKernel                9.00%
  at_cuda_detailcubDeviceScanInitKernel   14.83%
  masked_softmax_warp_forward             1.20%
  RadixSortExclusiveSumKernel             38.65%
  thrustcuda_cubcore_kernel_agent         32.53%
  cublasSgemmStridedBatched               1.84%
  cuComputeGradGammaBeta                  18.95%
  CatArrayBatchedCopy_aligned16_contig    20.19%
  reduce_kernel                           24.64%
  RadixSortOnesweepKernel                 13.54%
  MemcpyDtoD                              33.25%
  scaled_masked_softmax_warp_forward      0.48%
  softmax_warp_backward                   1.04%

  : Mean absolute percentage error on a held-out validation set, trained on V100 kernel runtimes. Important kernel types for Megatron-LM models include `cublasSgemm_v2` and `cublasSgemmStridedBatched`, where we have \<5% prediction error. Kernels with large percentage-wise errors are extremely short in duration, and thus do not impact end-to-end latency significantly.
:::

In contrast to computation operations, there is a much smaller set of network collectives (\<10) that is used in deep learning workloads. Furthermore, the input space of these operators typically much smaller, typically comprising only two parameters -- number of workers and input size. This allows us to devise a simple policy for modeling these operations. We first collect performance data in a fashion similar to `nccl-tests`. We only sample data in the range that is generally relevant for training workloads ranging from tens of megabytes to tens of gigabytes. We then use our regression pipeline to interpolate within this range. While this affects generalization to dimensions outside the range of the training set, this does not pose a problem in practice since the collective sizes are bounded by the batch size, model parameters and accelerator memory.

Automatically generated fused kernels pose a unique challenge due to an explosion in generated kernel signatures --- arising from a large number of op combinations. We address this by collecting information from the compiler IR about the content of the kernels rather than just their inputs. In our experiments with the compiler-fused Triton kernels used in `torch.compile`, features such as the number of primitive Triton language instructions (add, sub etc.) in the kernel definition proved valuable in predicting kernel runtimes. The corresponding training data was collected by sweeping workload traces for different models/batch sizes and extracting the relevant features/runtimes. Through this approach, we achieve comparable accuracy to that of the kernel predictions trained on focused micro-benchmarks.

::: {#tab:a40_kernel_accuracy}
  **Kernel**                                    **MAPE**
  --------------------------------------------- ----------
  cudnnConvolutionBackwardFilter                9.16%
  elementwise_kernel                            24.35%
  CatArrayBatchedCopy_aligned16_contig          14.96%
  Memset                                        34.27%
  triton                                        4.13%
  cudnnConvolutionBackwardData                  7.89%
  tensor_kernel_scan_innermost_dim              153.91%
  MemcpyDtoH                                    37.05%
  cublasSgemm_v2                                37.08%
  softmax_warp_forward                          229.09%
  MemcpyHtoD                                    27.71%
  cudnnConvolutionForward                       6.31%
  multi_tensor_apply_kernel                     1.51%
  cublasSgemmStridedBatched                     63.61%
  nll_loss_backward_reduce_cuda_kernel_2d       253.28%
  softmax_warp_backward                         164.03%
  unrolled_elementwise_kernel                   10.98%
  max_pool_backward_nhwc                        17.19%
  cublasLtMatmul                                83.92%
  MemcpyDtoD                                    65.84%
  CatArrayBatchedCopy                           96.63%
  vectorized_elementwise_kernel                 18.18%
  distribution_elementwise_grid_stride_kernel   228.62%
  nll_loss_forward_reduce_cuda_kernel_2d        171.61%

  : Mean absolute percentage error on a held-out validation set, trained on A40 kernel runtimes. Important kernel types for vision models include `cudnnConvolution` and `triton`, where we have \<10% prediction error. Kernels with large percentage-wise errors are extremely short in duration, and thus do not impact end-to-end latency significantly.
:::

# Performance of alternate search algorithms

Tune supports several search algorithms out of the box. We investigated the performance of a subset of these algorithms by the progress of the search at distinct phases. Each phase was defined by the number of unique valid configurations sampled by the algorithm up to that point. Every algorithm (with the exception of grid search) was allocated a budget of 2000 samples. `\Cref{fig:config-search-alg-comparison}`{=latex} shows the results of this experiment, where the MFU is computed from the iteration times predicted by `\sysname`{=latex}. Interestingly, despite the fact that these algorithms are general-purpose and therefore lack domain-specific knowledge of the search space, they appear to converge after having sampled about 200 to 300 unique valid configurations, a 60-75% improvement over grid search.

# Fidelity-preserving Tactics for the Megatron-LM Search Space {#appendix:tactics}

We leverage the performance characteristics of certain Megatron-LM configuration knobs to devise four fidelity-preserving tactics, summarized in `\Cref{table:fidelity-preserving-tactics}`{=latex}.

![Comparison of search algorithms exploring GPT3-2.7B (left) and GPT3-18.4B (right). Each algorithm is given a 2000 sample budget. Most algorithms achieve near-optimal MFU after 200-300 valid configurations, providing 60-75% improvement over grid search. `\amey{is the model same in both cases?}`{=latex}](figures/alg_comparison_mini.png){#fig:config-search-alg-comparison width="44%"}

```{=latex}
\begin{table*}[t]

\begin{tabular}{c p{0.3\textwidth} p{0.4\textwidth}}
\toprule
\textbf{Knob} & \textbf{Performance characteristics} & \textbf{Tactic} \\
\midrule
Activation recomputation & Reduces memory footprint through smart activation checkpointing
 & If a prior config OOMed with activation recomputation enabled, then skip the similar config that only disables activation recomputation and mark its result as OOMed \\
Sequence parallelism & Reduces memory footprint by reducing activation memory with no added communication cost & If a prior config OOMed with sequence parallelism enabled, then skip the similar config that only disables sequence parallelism and mark its result as OOMed \\
Distributed optimizer & Reduces memory footprint by sharding gradient and optimizer state with added communication cost & If a prior config did not OOM without the distributed optimizer, then skip the similar config that only enables the distributed optimizer and set its runtime to be the same \\
No. of microbatches & In the absence of pipeline parallelism, hardware utilization is inversely proportional to the number of microbatches \cite{megatron}. & If a prior config did not OOM with number of microbatches $n$ and no pipeline parallelism, then skip the similar config that only increases the number of microbatches and set its runtime to be the same \\
\bottomrule
\end{tabular}
\caption{Summary of fidelity-preserving tactics used in Megatron-LM configuration search experiment}
\label{table:fidelity-preserving-tactics}
\end{table*}
```

[^1]: Equal technical contribution. Srihas Yarlagadda and Elton Pinto led the development on `\sysname `{=latex}and `\syssearchname `{=latex}respectively, Amey Agrawal was responsible for the overall system design. $^\dag$ Work done while at Georgia Institute of Technology.

[^2]: We contacted the authors of `\prot `{=latex}and `\amped `{=latex}to resolve these anomalies but could not arrive at a resolution.
