---
abstract: |
  Deploying LLMs efficiently requires testing hundreds of serving configurations, but evaluating each one on a GPU cluster takes hours and costs thousands of dollars. Discrete-event simulators are faster and cheaper, but they require re-implementing the serving system's control logic --- a burden that compounds as frameworks evolve.

  We present `\sysname{}`{=latex}, a *time-warp emulator* that enables performance modeling by directly executing real serving system code at simulation-like speed. The system intercepts CUDA API calls to virtualize device management, allowing serving frameworks to run without physical GPUs. Instead of executing GPU kernels, it performs *time jumps* -- fast-forwarding virtual time by predicted kernel durations. We propose a coordination protocol that synchronizes these jumps across distributed processes while preserving causality. On vLLM and SGLang, `\sysname{}`{=latex} achieves $<$`<!-- -->`{=html}5% prediction error across multiple models and parallelism configurations, while running 5-17$\times$ faster than real GPU execution.
author:
- Amey Agrawal$^*$
- Mayank Yadav$^*$
- Sukrit Kumar$^\dagger$
- Anirudha Agrawal$^\dagger$
- Garv Ghai$^\dagger$
- Souradeep Bera
- Elton Pinto
- Sirish Gambhira
- Mohammad Adain
- Kasra Sohrab
- Chus Antonanzas
- Alexey Tumanov
bibliography:
- all.bib
title: "`\\sysname`{=latex}: Transparent GPU-Free Time-Warp Emulation for LLM Serving"
---

```{=latex}
\def\UrlBreaks{\do\/\do-}
```
```{=latex}
\newcommand{\sysname}{\textsc{Revati}\xspace}
```
```{=latex}
\newcommand\blfootnote[1]{%
  \begingroup
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}
```
```{=latex}
\newcommand{\vllm}{vLLM\xspace}
```
```{=latex}
\newcommand{\sglang}{SGLang\xspace}
```
```{=latex}
\newcommand{\sarathi}{Sarathi-Serve\xspace}
```
```{=latex}
\newcommand{\dynamo}{NVIDIA Dynamo\xspace}
```
```{=latex}
\newcommand{\distserve}{DistServe\xspace}
```
```{=latex}
\newcommand{\orca}{Orca\xspace}
```
```{=latex}
\newcommand{\mooncake}{Mooncake\xspace}
```
```{=latex}
\newcommand{\splitwise}{Splitwise\xspace}
```
```{=latex}
\newcommand{\vidur}{Vidur\xspace}
```
```{=latex}
\newcommand{\lss}{LLMServingSim\xspace}
```
```{=latex}
\newcommand{\apex}{APEX\xspace}
```
```{=latex}
\newcommand{\frontier}{Frontier\xspace}
```
```{=latex}
\newcommand{\mist}{MIST\xspace}
```
```{=latex}
\newcommand{\sgap}{\textit{semantic gap}\xspace}
```
```{=latex}
\newcommand{\Sgap}{\textit{Semantic Gap}\xspace}
```
```{=latex}
\newcommand{\tacc}{\textit{time accelerated emulation}\xspace}
```
```{=latex}
\newcommand{\Tacc}{\textit{Time Accelerated Emulation}\xspace}
```
```{=latex}
\newcommand{\vgpu}{\textit{virtual emulated GPUs}\xspace}
```
```{=latex}
\newcommand{\Vgpu}{\textit{Virtual Emulated GPUs}\xspace}
```
```{=latex}
\newcommand{\llmservingsim}{LLMServingSim\xspace}
```
```{=latex}
\newcommand{\maya}{Maya\xspace}
```
```{=latex}
\newcommand{\greencheck}{\textcolor{codegreen}{\checkmark}}
```
```{=latex}
\newcommand{\redcross}{\textcolor{red}{$\times$}}
```
```{=latex}
\newcommand{\partialmark}{{\color{blue}$\sim$}}
```
```{=latex}
\newcommand{\transparentemulation}{transparent emulation\xspace}
```
```{=latex}
\newcommand{\timeaccelemulation}{time-accelerated emulation\xspace}
```
```{=latex}
\newcommand{\semanticgap}{semantic gap\xspace}
```
```{=latex}
\newcommand{\virtualtime}{virtual time\xspace}
```
```{=latex}
\newcommand{\controlplane}{control plane\xspace}
```
```{=latex}
\newcommand{\dataplane}{data plane\xspace}
```
```{=latex}
\newcommand{\timekeeper}{\textit{Timekeeper}\xspace}
```
```{=latex}
\newcommand{\actors}{\textit{Actors}\xspace}
```
```{=latex}
\newcommand{\observers}{\textit{Observers}\xspace}
```
```{=latex}
\newcommand{\actor}{\textit{Actor}\xspace}
```
```{=latex}
\newcommand{\observer}{\textit{Observer}\xspace}
```
```{=latex}
\newcommand{\pagedattention}{PagedAttention\xspace}
```
```{=latex}
\newcommand{\radixattention}{RadixAttention\xspace}
```
```{=latex}
\newcommand{\Prefill}{\textit{Prefill}\xspace}
```
```{=latex}
\newcommand{\prefill}{\textit{prefill}\xspace}
```
```{=latex}
\newcommand{\Decode}{\textit{Decode}\xspace}
```
```{=latex}
\newcommand{\decode}{\textit{decode}\xspace}
```
```{=latex}
\newcommand{\myx}{$\times$\xspace}
```
```{=latex}
\newcommand{\llama}{LLaMA\xspace}
```
```{=latex}
\newcommand{\llamatwo}{LLaMA-2\xspace}
```
```{=latex}
\newcommand{\llamathree}{LLaMA-3.1\xspace}
```
```{=latex}
\newcommand{\mistral}{Mistral\xspace}
```
```{=latex}
\newcommand{\mixtral}{Mixtral\xspace}
```
```{=latex}
\newcommand{\ie}{\textit{i.e.,}\xspace}
```
```{=latex}
\newcommand{\eg}{\textit{e.g.,}\xspace}
```
```{=latex}
\newcommand{\etal}{\textit{et al.}\xspace}
```
```{=latex}
\newcommand{\sref}[1]{\S\ref{#1}}
```
```{=latex}
\newcommand{\mycaption}[2]{\caption{\textbf{#1} {#2}}}
```
```{=latex}
\newcommand{\vheading}[1]{\noindent\textbf{#1.}}
```
```{=latex}
\newcommand{\begincompactitemize}{\begin{itemize}[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]}
```
```{=latex}
\newcommand{\todo}[1]{}
```
```{=latex}
\newcommand{\llamatwoSevenB}{7B\xspace}
```
```{=latex}
\newcommand{\llamatwoThirteenB}{13B\xspace}
```
```{=latex}
\newcommand{\llamatwoSeventyB}{70B\xspace}
```
```{=latex}
\newcommand{\llamathreeEightB}{8B\xspace}
```
```{=latex}
\newcommand{\llamathreeSeventyB}{70B\xspace}
```
```{=latex}
\newcommand{\llamathreeFourZeroFiveB}{405B\xspace}
```
```{=latex}
\newcommand{\mistralLargeSize}{123B\xspace}
```
```{=latex}
\newcommand{\mixtralSize}{8x22B\xspace}
```
```{=latex}
\newcommand{\maxModelSize}{480B\xspace}
```
```{=latex}
\newcommand{\llamaS}{Llama-3.1-8B\xspace}
```
```{=latex}
\newcommand{\llamaL}{Llama-3.1-70B\xspace}
```
```{=latex}
\newcommand{\qwenMM}{Qwen3-30B-A3B\xspace}
```
```{=latex}
\newcommand{\deepseek}{TODO deepseek\xspace}
```
```{=latex}
\newcommand{\kimitwo}{TODO Kimi2\xspace}
```
```{=latex}
\newcommand{\revatiMedianError}{5.1\%\xspace}
```
```{=latex}
\newcommand{\revatiMeanError}{X\%\xspace}
```
```{=latex}
\newcommand{\revatiPNineFiveError}{Y\%\xspace}
```
```{=latex}
\newcommand{\vidurError}{9\%\xspace}
```
```{=latex}
\newcommand{\apexError}{10.7\%\xspace}
```
```{=latex}
\newcommand{\llmservingsimError}{14.7\%\xspace}
```
```{=latex}
\newcommand{\mayaError}{<5\%\xspace}
```
```{=latex}
\newcommand{\simulatorErrorRange}{9-15\%\xspace}
```
```{=latex}
\newcommand{\predictionModelError}{<3\%\xspace}
```
```{=latex}
\newcommand{\batchCompositionAccuracy}{93\%\xspace}
```
```{=latex}
\newcommand{\cacheHitAccuracy}{>95\%\xspace}
```
```{=latex}
\newcommand{\sloAccuracy}{98\%\xspace}
```
```{=latex}
\newcommand{\timeAccelerationMin}{20\xspace}
```
```{=latex}
\newcommand{\timeAccelerationMax}{50\xspace}
```
```{=latex}
\newcommand{\timeAccelerationRange}{20-50}
```
```{=latex}
\newcommand{\timeAccelerationMultiplier}{20-50\myx}
```
```{=latex}
\newcommand{\chatgptDailyCost}{\$694K\xspace}
```
```{=latex}
\newcommand{\apexExplorationCost}{\$218K\xspace}
```
```{=latex}
\newcommand{\numConfigsExplored}{2000\xspace}
```
```{=latex}
\newcommand{\vidurGPUHours}{42K GPU hours\xspace}
```
```{=latex}
\newcommand{\realDeploymentConfigCost}{\$12.25\xspace}
```
```{=latex}
\newcommand{\revatiConfigCost}{\$0.0003\xspace}
```
```{=latex}
\newcommand{\mayaConfigCost}{\$0.025\xspace}
```
```{=latex}
\newcommand{\costReductionFactor}{41,000\myx}
```
```{=latex}
\newcommand{\hOneHundredHourlyRate}{\$4.90/hour\xspace}
```
```{=latex}
\newcommand{\cpuHourlyRate}{\$0.01/hour\xspace}
```
```{=latex}
\newcommand{\simulatorCostRange}{\$7-100\xspace}
```
```{=latex}
\newcommand{\costReductionMultiplier}{WWW\myx}
```
```{=latex}
\newcommand{\totalLOC}{8,500\xspace}
```
```{=latex}
\newcommand{\coordinatorLOC}{2,100\xspace}
```
```{=latex}
\newcommand{\emulationLibraryLOC}{3,800\xspace}
```
```{=latex}
\newcommand{\pythonBindingsLOC}{1,200\xspace}
```
```{=latex}
\newcommand{\profilingToolsLOC}{900\xspace}
```
```{=latex}
\newcommand{\workloadGenLOC}{500\xspace}
```
```{=latex}
\newcommand{\vllmIntegrationLOC}{39\xspace}
```
```{=latex}
\newcommand{\sglangIntegrationLOC}{52\xspace}
```
```{=latex}
\newcommand{\sarathiIntegrationLOC}{39\xspace}
```
```{=latex}
\newcommand{\integrationLOCRange}{39-52\xspace}
```
```{=latex}
\newcommand{\newFrameworkIntegrationLOC}{30-50\xspace}
```
```{=latex}
\newcommand{\numCUDAFunctions}{87\xspace}
```
```{=latex}
\newcommand{\numCuBLASFunctions}{23\xspace}
```
```{=latex}
\newcommand{\numNCCLFunctions}{15\xspace}
```
```{=latex}
\newcommand{\apiCoveragePercent}{95\%\xspace}
```
```{=latex}
\newcommand{\sharegptPromptTokens}{161\xspace}
```
```{=latex}
\newcommand{\sharegptOutputTokens}{127\xspace}
```
```{=latex}
\newcommand{\arxivPromptTokens}{2048\xspace}
```
```{=latex}
\newcommand{\arxivOutputTokens}{256\xspace}
```
```{=latex}
\newcommand{\codingPromptTokens}{512\xspace}
```
```{=latex}
\newcommand{\codingOutputTokens}{384\xspace}
```
```{=latex}
\newcommand{\longContextPromptTokens}{8192\xspace}
```
```{=latex}
\newcommand{\longContextOutputTokens}{512\xspace}
```
```{=latex}
\newcommand{\minRequestRate}{1 req/s\xspace}
```
```{=latex}
\newcommand{\maxRequestRate}{50 req/s\xspace}
```
```{=latex}
\newcommand{\calibrationTime}{2 hours\xspace}
```
```{=latex}
\newcommand{\calibrationSamples}{5000\xspace}
```
```{=latex}
\newcommand{\modelFittingTime}{10 minutes\xspace}
```
```{=latex}
\newcommand{\validationTime}{1 hour\xspace}
```
```{=latex}
\newcommand{\totalCalibrationTime}{~3 hours\xspace}
```
```{=latex}
\newcommand{\gemmPredictionTime}{50$\mu$s\xspace}
```
```{=latex}
\newcommand{\gemmActualTimeMin}{2-5ms\xspace}
```
```{=latex}
\newcommand{\configEvalTimeGPU}{2.5 H100 hours\xspace}
```
```{=latex}
\newcommand{\configEvalTimeCPU}{1.5 CPU minutes\xspace}
```
```{=latex}
\newcommand{\numHOneHundreds}{4\xspace}
```
```{=latex}
\newcommand{\numAOneHundreds}{8\xspace}
```
```{=latex}
\newcommand{\hOneHundredMemory}{80GB\xspace}
```
```{=latex}
\newcommand{\aOneHundredMemory}{80GB\xspace}
```
```{=latex}
\newcommand{\emulationCPUCores}{128 cores\xspace}
```
```{=latex}
\newcommand{\emulationRAM}{512GB\xspace}
```
```{=latex}
\newcommand{\workloadConfigVariation}{2\myx}
```
```{=latex}
\affil{Georgia Institute of Technology}
```
```{=latex}
\maketitle
```
```{=latex}
\blfootnote{$^*$$^\dagger$ Denote equal contribution levels. Revati is a mythological character who features in one of the earliest fictional tales of time dilation.}
```
# Introduction

LLM inference increasingly dominates AI costs. To manage these costs, operators need to tune a vast set of configuration knobs, including, various parallelization strategies, caching policies, and scheduling algorithms [@vidur; @maya]. These knobs must be tuned for each model, workload and hardware platform as they evolve over time. Finding the optimal deployment configuration through trial-and-error end-to-end evaluation is prohibitively slow and expensive. Evaluating even a single configuration on a GPU cluster can take several hours and cost thousands of dollars in compute [@vidur]. This bottlenecks both production deployment and systems research.

![ Discrete-event simulators must re-implement the entire control-flow and scheduling logic of serving systems, as a result, the rapid advancements in LLM inference render these simulators perpetually outdated. `\sysname{}`{=latex} ***entirely eliminates*** this problem by ***directly*** running the serving systems in a emulated environment.](figures-graphics/mini_hld_revati.png){width="0.75\\columnwidth"}

`\vheading{Limitations of Current Approaches}`{=latex} To address this challenge, the community has developed discrete-event simulators  [@vidur; @llmservingsim; @apex] that can cheaply model serving performance. Simulators offer compelling benefits: they run orders of magnitude faster than real execution and are easy to modify for prototyping new policies and optimization. However, they require re-implementing all the core system components including the scheduler, memory allocator, and request dispatcher, etc. This approached worked when systems were simple: modeling vLLM's original continuous batching scheduler took less than 150 lines to implement in `\vidur`{=latex} [@vidur]. However, today LLM engines include complex features like prefill-decode disaggregation [@splitwise; @distserve; @mooncake; @nvidia-dynamo; @cheng2025lmcache] and distributed prefix caching [@nvidia-dynamo; @cheng2025lmcache] with implementations spanning tens of thousands of lines of code. Moreover, with thousands of commits per year from hundreds of contributors, inference serving systems are evolving at a rapid pace, causing simulators to perpetually lag behind with surmounting maintenance burden.

`\vheading{Key Insight}`{=latex} Serving systems spend most of their execution time waiting for GPU computation, *not* making control decisions. The CPU control plane (scheduling, batching, memory management) executes in microseconds, while GPU operations take tens of milliseconds accounting for more than 90% of execution time. Moreover, the control flow logic is largely decoupled from the output of GPU computation. This separation enables a radical alternative: run the *real* serving system while skipping GPU waits through virtual time advancement. We call this ***Time-Warp Emulation***.

`\vheading{Our Approach}`{=latex} `\sysname `{=latex}executes actual serving frameworks like vLLM or SGLang with GPU operations replaced by virtual time jumps. When a worker prepares to execute a batch, it asks: \`\`How long would this take on H100?". Using the predicted duration, it requests a ***time jump*** from a central `\timekeeper `{=latex}that coordinates virtual time across all processes. The `\timekeeper `{=latex}ensures all processes advance together while preserving causality: if worker A needs 15ms and worker B needs 25ms, the system advances by 15ms first. A CUDA emulation layer transparently intercepts GPU calls, presenting virtual devices to the serving framework while bypassing actual execution. This allows us to directly execute the serving system's code *without need for actual GPUs* in accelerated virtual time.

By running an unchanged control plane, `\sysname `{=latex}automatically captures complex system behaviors without requiring manual modeling. Complex features like adaptive batching, multi-tiered prefix cache management, PD disaggregation, etc. work out of the box. `\sysname `{=latex}poses a minimal maintenance overhead with less than 25 lines of change required to onboard a serving system. This combination of speed, cost-effectiveness, and ease of adoption enables rapid exploration of optimal deployment configurations and accelerates both the development and adoption of state-of-the-art LLM inference techniques.

<figure>
<p><img src="figures-graphics/time_progression_revati.png" /> </p>
</figure>

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{2pt}
```
```{=latex}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{lccccccc}
\toprule
& \textbf{\sysname} & \textbf{Discrete-Event Simulators} \\
\cmidrule(lr){3-8}
&  & VD & LS1 & AP & FT & TS & LS2 \\
\midrule
\rowcolor{gray!15} \textit{System Properties} \\
Direct System Emulation & \greencheck & \redcross & \redcross & \redcross & \redcross & \redcross & \redcross \\
Serving System Agnostic & \greencheck & \redcross & \redcross & \redcross & \redcross & \redcross & \redcross \\
Minimal Maintenance Overhead & \greencheck & \redcross & \redcross & \redcross & \redcross & \redcross & \redcross \\
\midrule
\rowcolor{gray!15} \textit{Modeling Domain} \\
Continuous Batching & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck \\
Chunked Prefill & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck & \greencheck \\
Prefix Caching & \greencheck & \redcross & \redcross & \redcross & \redcross & \greencheck & \greencheck \\
Hierarchical Caching & \greencheck & \redcross & \redcross & \redcross & \redcross & \redcross & \greencheck \\
PD Disaggregation & \greencheck & \redcross & \redcross & \redcross & \greencheck & \greencheck & \greencheck \\
DP Attention & \greencheck & \redcross & \redcross & \redcross & \greencheck & \redcross & \greencheck \\
MoE / Expert Parallel & \greencheck & \redcross & \redcross & \greencheck & \greencheck & \redcross & \greencheck \\
\bottomrule
\end{tabular}%
}
```
`\noindent `{=latex}This paper makes the following contributions:

-   We present `\sysname{}`{=latex}, a time-warp emulator that models serving system performance at simulation-like speed without re-implementing system logic.

-   We develop a virtual time protocol that coordinates time jumps across distributed processes while preserving causality.

-   We evaluate `\sysname{}`{=latex} on vLLM and SGLang, achieving $<$`<!-- -->`{=html}5% prediction error while running 5--17$\times$ faster than real GPU execution.

# Background & Motivation {#sec:background}

Performance modeling is essential for the efficient deployment of LLM serving systems because real-world experimentation on GPU clusters is prohibitively expensive. However, existing simulator-based approaches struggle to maintain fidelity as serving systems evolve due to lagging feature parity. This section characterizes the configuration optimization challenge and illustrates why discrete-event simulation fails to meet the needs of modern serving infrastructure.

## Configuration Optimization Problem

Modern LLM serving systems expose vast configuration spaces. Consider deploying Qwen3-235B-A22B [@qwen3]: operators must select the tensor parallelism (TP) degree (1-8), number of pipeline parallelism (PP) stages (1-8), expert parallelism (EP) degree (1-8), maximum batch size (1-256), chunked prefill size (256-8192 tokens), KV cache eviction policy (LRU, LFU, cost-aware), routing policy (round-robin, sticky, cache-aware), and disaggregation strategy (co-located vs. separate prefill/decode) [@distserve; @sarathi2023; @mooncake] --- yielding a large configuration space. Prior research has shown that tuning configuration choices in accordance with model and workload characteristics can improve throughput by 3-5$\times$ [@vidur; @distserve]. For instance, Mitra et al. [@mitra2025beyond] show that serving Llama-70B with PD disaggregation [@splitwise; @distserve] can provide 1.8$\times$ throughput improvement over co-location [@taming] for RAG-like, prefill-heavy workloads (16K input tokens, 2K output tokens). Yet for decode-heavy workloads (2K input, 8K output), they find that disaggregation provides minimal benefit ($<$`<!-- -->`{=html}10% throughput gain) and can even reduce performance due to KV cache transfer overhead and dynamic shift in traffic patterns.

To effectively deploy these systems, operators evaluate these configurations on representative workloads and pick those that satisfy their latency requirements while providing maximal throughput. Evaluating each configuration requires around 2-4 hours of profiling to collect statistically significant tail latency characteristics across request load levels  [@vidur]. For a 64-GPU cluster at current cloud pricing (\$2.50-\$7 per GPU-hour for an H100), evaluating just 100 configurations takes 12,800-25,600 GPU-hours, costing \$32,000-\$179,200 USD. These factors make broader exploration of the design space economically and practically infeasible, forcing practitioners to resort to *rule-of-thumb* heuristics.

## Discrete-Event Simulators {#sec:background:sims}

To reduce the cost of evaluating a configuration, researchers have developed Discrete-Event Simulators (DES)  [@vidur; @llmservingsim; @apex; @frontier; @llmservingsim2; @tokensim]. These simulators provide a cheap and fast way to model the performance of a serving system without the need for expensive hardware. These tools operate by manually re-implementing the serving system's control logic -- schedulers, memory allocators, and request routers -- within a simplified simulation framework. A performance model predicts GPU execution time for the batch. The simulator advances its clock by this predicted duration and processes the next event. By eliminating physical GPU execution, simulators achieve up to an order of magnitude speedups over real-time without requiring hardware.

## Limitations of Simulators {#sec:des_limitations}

```{=latex}
\vheading{Semantic Gap}
```
The re-implementation effort inevitably introduces subtle differences between the simulator and actual system behavior. Early systems were tractable to model. For instance, Vidur modeled the original vLLM scheduler in $\sim$`<!-- -->`{=html}150 lines of code. Today, production control planes exceed tens of thousands of lines of code [@vLLM:github; @sglang:github] with complex logic for implementing features like hierarchical prefix caching and disaggregated execution. Capturing the nuanced performance characteristics of these features requires an unwieldy amount of engineering effort. Any deviation in the simulator's logic from the ground truth leads to critical inaccuracies in performance prediction.

```{=latex}
\vheading{Framework Specificity}
```
Simulators also struggle with generality because serving frameworks often make fundamentally different implementation choices for the same feature. For instance, while both vLLM and SGLang implement the batching policy described in Sarathi-Serve [@taming] at high-level, they critically differ in their implementation. vLLM adopts both chunked-prefills as well as mixed batching (combining prefill and decode requests in a single batch), whereas SGLang by default doesn't perform mixed batching and resorts to a prefill prioritizing policy. This difference leads to divergent latency characteristics that a generic implementation would fail to capture. We observe such differences in other features too, such as prefix caching, disaggregated execution, etc. For example, Hierarchical cache in vLLM (using LMCache [@lmcache:github]) uses a write-through policy, where every cache element written to GPU is also immediately written to lower tiers (CPU, NVMe, etc.). SGLang, on the other hand, adopts a write-through-selective policy that asynchronously copies data only on the first cache hit.

To model this landscape accurately, researchers are forced to build and maintain separate, specialized simulators for each major framework: Vidur [@vidur] is modeled after Sarathi-Serve [@taming], and LLMServingSim2.0 [@llmservingsim2] is modeled after vLLM. This approach both limits flexibility and increases engineering effort required to construct these simulators.

```{=latex}
\vheading{Maintenance Burden}
```
The engineering overhead of building a DES is exacerbated by the breakneck pace of LLM server evolution. vLLM alone has received over 2,800 commits from hundreds of contributors in the first half of 2025 [@vLLM:github]. Each commit has the potential to invalidate the modeling assumptions of a simulator. As a result, simulators like Vidur and LLMServingSim have historically fallen behind shortly after release. There have been attempts to bridge this gap --- Frontier [@frontier] and LLMServingSim2.0 [@llmservingsim2] have extended the past generation of simulators with modern features like prefix caching and PD disaggregation. However, even these efforts are destined to face the inevitable maintainability challenge as the field progresses.\
\
***Takeaway:** Existing discrete-event simulation approaches face a perpetual maintenance burden. We need a fundamentally new runtime modeling approach that does not require manual re-implementation of the serving engine logic.*

# Time-Accelerated Emulation {#sec:time_accel_emulation}

To address the limitations of discrete-event simulation, we propose a new modeling paradigm: *Time-Accelerated Emulation*. Instead of re-implementing the serving system's logic, this approach allows us to the *directly* execute unmodified LLM engine, providing high fidelity performance modeling while retaining the speed and cost-effectiveness of simulation. This section outlines the key building blocks that enable this approach.

## Transparent Device Virtualization {#sec:emulation:virtualization}

A recent work in deep learning training performance modeling, Maya [@maya], has demonstrated the viability of transparent device emulation. By intercepting CUDA API calls via `LD_PRELOAD`, it is possible to create \`\`virtual devices" to model the runtime of the workload without need for physical GPUs. Maya achieves this by building a two-step pipeline, first, the workload is executed with the virtual device. This run produces a trace of all the CUDA API invocations made by the workload. A simulator then replays the trace, predicting the runtime of each operation before finally outputting an end-to-end runtime estimate.

Applying this technique to inference serving promises to allow direct execution of unmodified serving system code -- fundamentally eliminating the primary limitations of discrete-event simulation, providing a path toward robust and maintainable runtime modeling for modern LLM serving systems.

## Temporal Semantics in Online Serving

Unfortunately, unlike training workloads which are typically offline, inference is an *online* process where request arrival timing fundamentally alters control flow. Ignoring GPU execution time breaks these temporal semantics, leading to incorrect batching decisions.

```{=latex}
\todo{Figure would be nice to highlight the conundrum}
```
For instance, suppose we have two requests, A and B, that both require 500 ms for prefill. Request A arrives at $t=0$, and Request B arrives at $t=200$ ms. In a real system, the scheduler would begin processing A; when B arrives 200 ms later, the GPU is still busy with A. B would be added to the queue, and the scheduler would eventually schedule B once prefill for request A is completed -- perhaps batching B's prefill with A's subsequent decode steps. However, if we simply trace the GPU without accounting for GPU execution in real-time (as in Maya's offline emulation approach), Request A completes instantly at $t \approx 0$. When Request B arrives at $t=200$ms, the system perceives an idle GPU and an empty queue. The temporal overlap is lost, the batching pattern changes completely, and the modeling diverges from reality.

```{=latex}
\vheading{A Strawman Approach}
```
To preserve temporal semantics, the emulator must force the control plane to experience the passage of time. A naive but correct approach is to mimic GPU latency by *sleeping* for the predicted duration of the kernel. If a batch is predicted to take 500 ms, the emulator blocks the CPU thread for 500 ms. This restores fidelity: during the 500 ms sleep for Request A, the wall clock advances. Request B arrives at $t=200$ ms as scheduled, but now finds the worker \`\`busy" (sleeping). The scheduler correctly queues B, preserving the system's state dynamics and resource contention exactly as they would appear on real hardware.

While sleep-based emulation guarantees correctness, it is prohibitively slow. Evaluating a 2-hour trace would require 2 hours of real time. For configuration searches requiring thousands of evaluations, this real-time constraint renders the approach impractical. We refer to this strawman approach as *Sleep-Based Emulation*.

<figure>
<p><img src="figures-graphics/revati_e2e_mot_stacked_comparison.png" />  </p>
</figure>

## Key Insight: Time Acceleration {#sec:emulation:enabling-insight}

The strawman emulation approach pays wall-clock cost identical to full GPU execution. However, a closer look at the design of inference engines reveals an opportunity: these systems are engineered to maximize GPU utilization while keeping CPU overhead minimal. `\autoref{fig:cpu-gpu-breakdown}`{=latex} quantifies this observation. We process 100 requests in offline mode (all requests available at start) and measure execution time under two conditions -- normal GPU execution and GPU-skip mode where all GPU execution is bypassed via emulation. The CPU time represents the critical-path control-plane overhead, while GPU time captures actual kernel execution. Across both vLLM and SGLang, GPU computation accounts for 90-95% of total execution time, with CPU overhead remaining minimal regardless of model size or serving framework. Consequently, the system spends the vast majority of wall-clock time simply waiting for GPU computations to complete.

Furthermore, crucially, the control plane's logic does not depend on the *values* computed by the GPU. Schedulers operate on logical abstractions (requests, batches, memory blocks), while GPU workers perform physical computation without making control flow decisions[^1]. These properties naturally yield a path to a critical optimization: instead of sleeping during the 85-95% of time the system is idle, we can potentially *jump* virtual time to accelerate emulation.

```{=latex}
\vheading{Tackling Causality}
```
While the concept of time-jumping is straightforward in principle, implementing it in a distributed serving system introduces a fundamental challenge with respect to causality. In a serving system, multiple independent processes (the batch scheduler, workers, and the request load generator) operate concurrently. Each process has its own view of the next event. For example, a worker may be ready to *jump* 50 ms to simulate batch execution, while the load generator is scheduled to inject a new request in 10 ms.

If the worker unilaterally advances its local clock by 50 ms, it would effectively \`\`step over" the arrival of the new request. In the virtual timeline, the request arrives *after* the batch completes, whereas in reality, it would have arrived *during* execution, potentially altering the scheduler's decision for the subsequent batch. This violation of causality leads to incorrect queuing behavior and invalid latency measurements.

`\vheading{Time Acceleration in the Literature}`{=latex} This is a classic problem, widely studied in the distributed event simulator literature. Optimistic approaches [@virtualtime; @timewarp] allow different processes to advance their clocks independently. When a causality violation is detected, the process state is rolled back to the last consistent checkpoint. However, this approach is infeasible for emulation because serving systems cannot be rolled back. In contrast, conservative approaches [@distributedsim; @asyncdistributedsim; @packetcommunication] are more amenable to emulation. These approaches require every process in the system to agree on a common lookahead time within which execution can safely be advanced without violating causality.

While there have been several implementations of conservative time acceleration discussed in literature for discrete-event simulation [@packetcommunication; @distributedsim; @asyncdistributedsim], none are directly applicable to emulation. In an event driven simulator, the event loop has direct control over causality of events. In contrast, in emulation systems, all the distributed components operate in real-time, and we need to *transparently* accelerate time without introducing any additional overhead. For instance, the Chandy-Misra-Bryant [@asyncdistributedsim; @packetcommunication] algorithm requires each simulation worker to send a null message to each peer with the next lookahead time. The simulator needs to be able to control the event scheduling to implement this approach. In this paper, we propose an approach to enable time acceleration without violating causality in emulation systems.

<figure>
<p><img src="figures-graphics/process_model.png" />  </p>
</figure>

# `\sysname`{=latex}: System Design {#sec:design}

`\sysname`{=latex} bridges the gap between parallel distributed event-driven simulation and emulation approaches through *Time-Warp Emulation*. Realizing this paradigm requires resolving the tension between the asynchronous nature of distributed serving systems and the strict causality required for accurate modeling. The rest of this section details the design of `\sysname`{=latex} and how it addresses challenges related to time virtualization.

## Overview {#sec:design:overview}

`\vheading{Process Model}`{=latex} Running a performance evaluation for a LLM serving system involves two main pieces: a benchmark runner, which submits requests, and the LLM inference engine, which processes these requests (`\Cref{fig:process-model}`{=latex}). These components map to two temporal roles in `\sysname{}`{=latex}'s execution model. *`\actors`{=latex}* have deterministic plans and know when their future actions will complete. For instance, the benchmark runner's request dispatcher knows when to dispatch each new request (e.g., at t=100ms, t=250ms), and the inference engine's GPU workers know when batch execution will finish. These processes actively drive virtual time forward. *`\observers`{=latex}* are purely reactive and respond to events as they occur. The benchmark runner's output processor and the inference engine's request scheduler are two such examples.

`\vheading{System Architecture}`{=latex} `\sysname{}`{=latex} comprises two main components: the **Timekeeper**, which synchronizes virtual time across distributed components and manages time jumps to accelerate execution, and the **`\mbox{Device Emulation Layer}`{=latex}**, which transparently intercepts and models GPU operations without requiring physical hardware.

Integrating these pieces with `\sysname{}`{=latex} involves patching every `\actor `{=latex}in the system to make time jumps when possible. The benchmark runner must be modified to make time jumps between request dispatches. Similarly, the GPU workers of the inference engine need to jump over GPU batch executions. Combined, these take up about 50 lines of code for vLLM and SGLang. Device Emulation Layer is loaded via `LD_PRELOAD` to intercept any device management calls separate from GPU kernel dispatches. When execution begins, `\actors `{=latex}submit concurrent time jump requests to the `\timekeeper `{=latex}--- the benchmark runner requesting a jump to the next request arrival, GPU workers requesting jumps over their batch execution durations. The Timekeeper computes the minimum target across all `\actors`{=latex}, advances the global virtual clock to that point, and broadcasts the update. `\actors `{=latex}whose targets have been reached return immediately. `\actors `{=latex}with further targets remain blocked and resubmit in subsequent barrier rounds. Observers query virtual time asynchronously to timestamp events without participating in this coordination.

## `\timekeeper`{=latex} {#sec:design:timekeeper}

The `\timekeeper `{=latex}is a service that manages virtual time across connected clients. It exposes a [TimeJump($\Delta t$)]{.smallcaps} API that allows clients to request time jumps and receive clock updates. We classify clients of the `\timekeeper `{=latex}as either `\actors `{=latex}or `\observers`{=latex}. `\actors `{=latex}are active drivers of the simulation that process operations with predictable durations and must actively request virtual time advancement. `\observers `{=latex}are reactive components that consume virtual time to timestamp events but do not block its progression. This distinction allows `\sysname`{=latex} to minimize coordination overhead by requiring synchronization only from `\actors`{=latex}, while enabling integration with serving frameworks through minimal code changes.

<figure>
<p><img src="figures-graphics/multi_chunk_jump.png" />  </p>
</figure>

To ensure scalability, the architecture separates the *request path* from the *update path*. `\actors`{=latex} submit time jump requests individually to the `\timekeeper `{=latex}via a reliable channel (fan-in). The `\timekeeper `{=latex}disseminates clock updates to all clients simultaneously via a broadcast channel (fan-out). This asymmetry allows the `\timekeeper `{=latex}to update hundreds of distributed workers with constant serialization cost per round, preventing bottlenecks during high-frequency barrier resolutions.

### Virtual Time Protocol {#sec:design:timekeeper:time-protocol}

Coordinating virtual time across distributed processes requires a protocol that preserves causality without controlling process execution. This section presents `\sysname`{=latex}'s barrier-based protocol and analyzes its correctness properties.

```{=latex}
\vheading{Design Constraints}
```
Adapting virtual time to real-time emulation introduces three constraints absent in traditional discrete-event simulators:

-   ***No rollback.*** Optimistic approaches like Time Warp [@virtualtime; @timewarp] speculatively advance time and rollback on causality violations. Since we execute real system code with irreversible side effects---network messages sent, data structures modified---this approach is infeasible.

-   ***No event scheduling control.*** Conservative protocols like Chandy-Misra-Bryant [@asyncdistributedsim] assume the simulator controls event processing order. Our processes run asynchronously in wall-clock time; the protocol must coordinate time advancement without controlling execution.

-   ***Graceful degradation.*** When coordination fails (stragglers, network delays), the system must remain correct, potentially sacrificing speed but never producing invalid results.

```{=latex}
\begin{algorithm}[t]\caption{Client: TimeJump($\Delta t$)}
\label{alg:client-time-jump}
\begin{algorithmic}[1]
\Require $\Delta t > 0$: virtual time to advance (ms)
\State $t_{target} \gets \textsc{GetVirtualTime}() + \Delta t$
\Statex $\triangleright$ \textit{Compute absolute target once}
\While{$\textsc{GetVirtualTime}() < t_{target}$}
    \State \textsc{SendTimeJumpRequest}$(t_{target})$
    \Statex \quad $\triangleright$ \textit{Request advance to target}
    \State \textsc{WaitForAck}()
    \State $t_{remaining} \gets t_{target} - \textsc{GetVirtualTime}()$
    \If{$t_{remaining} > 0$}
        \State \textsc{WaitForClockUpdate}$(t_{remaining})$
        \Statex \qquad $\triangleright$ \textit{Block until broadcast or timeout}
    \EndIf
\EndWhile
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\vheading{Virtual Time Representation}
```
`\sysname`{=latex} maintains a global virtual clock as an offset from wall-clock time: $$t_{virtual} = t_{wall} + \mathit{offset}$$ Initially $\mathit{offset} = 0$, so virtual time equals wall time. As `\actors`{=latex} request time jumps, the `\timekeeper `{=latex}increases the offset, causing virtual time to advance faster than wall time. This representation allows `\observers`{=latex} to query virtual time without coordination---they simply read the current offset and add wall time.

```{=latex}
\vheading{Barrier-Based Coordination}
```
When an `\actor`{=latex} calls [TimeJump]{.smallcaps}($\Delta t$), it computes its target virtual time and sends a request to the `\timekeeper`{=latex}. The `\timekeeper `{=latex}collects requests from all `\actors`{=latex}, then advances virtual time to the *minimum* requested target. This minimum-advancement rule ensures no `\actor`{=latex} jumps past its intended time, preserving causality. For example, if one worker needs 50ms to process its batch while another needs 10ms, advancing each independently would break causality: an `\actor`{=latex} at virtual time $t=1000$ could observe events from an `\actor`{=latex} still at $t=500$. Algorithm `\autoref{alg:server-barrier}`{=latex} shows the server-side protocol.

A single [TimeJump]{.smallcaps} call may span multiple barrier rounds. Consider two workers: $W_A$ calls [TimeJump]{.smallcaps}(50), $W_B$ calls [TimeJump]{.smallcaps}(10). The `\timekeeper `{=latex}computes $t_{min}$ from $W_B$'s target and advances by 10ms. $W_B$'s call returns. $W_A$ remains in its loop---virtual time has advanced but not to $W_A$'s target---and re-requests in the next barrier round. `\Cref{alg:client-time-jump}`{=latex} shows the client-side protocol. The client computes its target time once (line 1), then loops until virtual time reaches this target. Each iteration sends the target to the `\timekeeper`{=latex}, receives acknowledgment, and waits for a clock update broadcast with a timeout equal to the remaining virtual time needed.

```{=latex}
\begin{algorithm}[t]\caption{Server: ProcessTimeJumpRequests()}
\label{alg:server-barrier}
\begin{algorithmic}[1]
\Require $numActors$: number of registered actors
\State $pending \gets \emptyset$, $offset \gets 0$
\Statex $\triangleright$ \textit{Initialize pending requests and virtual offset}
\While{\textbf{true}}
    \State $(c, t_{target}) \gets \textsc{ReceiveRequest}()$
    \State $pending[c] \gets t_{target}$
    \Statex \quad $\triangleright$ \textit{Store request for client $c$}
    \If{$|pending| = numActors$}
        \Statex \qquad $\triangleright$ \textit{All actors at barrier---advance virtual time}
        \State $t_{min} \gets \min\{t : t \in pending\}$
        \Statex \qquad $\triangleright$ \textit{Minimum target preserves causality}
        \If{$\textsc{GetWallTime}() < t_{min}$}
            \State $t_{wall} \gets \textsc{GetWallTime}()$
            \State $offset \gets \max(offset, t_{min} - t_{wall})$
            \State \textsc{BroadcastClockUpdate}$(offset)$
            \Statex \qquad\quad $\triangleright$ \textit{Notify all clients}
        \EndIf
        \State $pending \gets \emptyset$
        \Statex \qquad $\triangleright$ \textit{Reset for next round}
    \EndIf
\EndWhile
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\vheading{Graceful Degradation}
```
The timeout in `\Cref{alg:client-time-jump}`{=latex} ensures correctness under failures. If the broadcast never arrives---due to network issues or a stalled `\actor`{=latex}---the wait times out after $t_{remaining}$ wall-clock milliseconds. At that point, wall time has advanced by $t_{remaining}$, and since $t_{virtual} = t_{wall} + \mathit{offset}$, virtual time has also advanced. The loop condition succeeds, and [TimeJump]{.smallcaps} returns.

The operation completes correctly but at wall-clock speed. In the worst case, `\sysname`{=latex} degenerates to sleep-based emulation---slow but never incorrect. This property guarantees `\sysname`{=latex} cannot produce invalid results or deadlock; it can only lose acceleration.

```{=latex}
\vheading{Handling Message Jitter}
```
Real-time emulation must account for network delays and CPU scheduling jitter. Consider this scenario: the inference server generates a token at wall time $t_w{=}100ms$ and virtual time $t_v{=}3400ms$. The benchmark runner receives it promptly and records $t_v{=}3400ms$. The server then processes the next batch (20ms predicted), updating virtual time to $t_v{=}3420ms$ at $t_w{=}101ms$. Due to network delay, the runner reads the next token at $t_w{=}102.5ms$. Meanwhile, the server completes another batch and updates to $t_v{=}3450ms$ at $t_w{=}102ms$. When the runner queries virtual time at $t_w{=}102.5ms$, it observes 3450ms instead of 3420ms---an inaccurate latency measurement.

Attaching virtual timestamps to messages as soon as they are produced would eliminate this ambiguity. While accurate, this approach also requires significant changes (a few hundred lines) to the serving engine. For easier integration, `\sysname `{=latex}also offers a *bounded-jitter model*: assuming messages are received within $J$ time of transmission, inserting a $J$-duration cooldown between consecutive time jumps prevents `\observers`{=latex} from reading stale virtual times. We find that $J \approx 500\mu s$ suffices in most practical settings.

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{figures-graphics/revati_e2e_ttft_tpot.pdf}
\mycaption{End-to-end accuracy.}{\sysname{} (dashed) closely matches real execution (solid) for both TTFT and TPOT distributions across three model configurations. Prediction error remains below 5\% even at tail.}
\label{fig:eval:e2e:accuracy}
\end{figure*}
```
<figure>
<p><img src="figures-graphics/revati_e2e_speedup.png" />  </p>
</figure>

## Device Emulation Layer {#sec:design:emulation-layer}

Serving frameworks are deeply integrated with CUDA and NCCL. Modifying all call sites would require invasive per-framework changes, undermining `\sysname`{=latex}'s transparency goal. The Device Emulation Layer adapts Maya's [@maya] transparent device virtualization to intercept and emulate CUDA API calls for inference workloads. Unmodified framework code executes while believing it has access to target hardware. In addition to maintaining transparency, this approach enables evaluation at scales beyond available resources. For example, a researcher who wants to perform evaluations on a 128-GPU H200 cluster, but doesn't have access to such hardware can simply configure `\sysname{}`{=latex} to emulate the desired hardware.

However, performing emulation in real-time (as opposed to Maya's offline approach) introduces some unique challenges that we discuss below.

```{=latex}
\vheading{Preserving Distributed Dependencies}
```
In distributed serving (pipeline parallelism, tensor parallelism), NCCL collectives enforce causal dependencies---stage $i{+}1$ cannot proceed until stage $i$ completes `ncclSend`. Without actual GPU execution, `\sysname`{=latex} must block emulated stage $i{+}1$ until emulated stage $i$ reaches the corresponding send. We convert NCCL collectives into barrier synchronization points across participating workers, preserving temporal ordering without data transfer.

```{=latex}
\vheading{Split-State Memory Model}
```
Unlike training workloads, serving systems often use GPU channels for control-plane communication. SGLang, for instance, broadcasts batch composition via NCCL. Pure black-box emulation would break this because the CPU expects valid data in these buffers. To tackle this, `\sysname`{=latex} bifurcates allocations into two categories. Metadata buffers -- small allocations (below 4MB by default) potentially used for control decisions -- are backed by real host memory and their operations execute faithfully. Compute buffers -- large allocations for KV caches and weights -- receive virtual pointers with no physical backing and their operations become no-ops with durations estimated by the runtime predictor. This model relies on compute buffers never being read by the CPU. To enforce this invariant, `\sysname `{=latex}raises a fatal exception if the application attempts to read from a virtual compute buffer rather than returning garbage values. A successful emulation run thus guarantees the control plane never operated on phantom data.

## Runtime Prediction {#sec:runtime-prediction}

`\sysname`{=latex} provides a pluggable interface supporting both analytical and profiling-based runtime predictors. By default, we extend Vidur's [@vidur] operator-level models with support for Mixture-of-Experts routing, fused attention variants, and optimized all-reduce collectives. The predictor accepts batch composition (prefill count, decode count, sequence lengths) and target hardware spec, returning an estimated duration.

# Implementation {#sec:design:implementation}

The `\timekeeper `{=latex}server and client library comprise ${\sim}800$ lines of C++, with Python bindings for framework integration. The device emulator adds ${\sim}6000$ lines of C++. LLM server patches total fewer than 50 lines each for vLLM and SGLang. The messaging layer uses ZeroMQ for low-latency communication. The `\timekeeper `{=latex}server employs a multi-threaded architecture: a dedicated I/O thread handles serialization and socket operations, and a background thread manages the barrier state and time-jump logic. This separation ensures high-frequency requests do not block barrier resolution.

# Evaluation {#sec:evaluation}

In this section, we present a comprehensive evaluation of `\sysname `{=latex}to demonstrate its fidelity in simulating LLM inference performance.

```{=latex}
\todo{add section refs}
```
-   **Accuracy**: How accurately does `\sysname `{=latex}predict end-to-end inference latency across varying model sizes, and deployment configurations?

-   **Efficiency**: Can `\sysname `{=latex}effectively accelerate evaluations compared to the strawman approach?

## Experimental Setup

```{=latex}
\vheading{Models and Inference Systems}
```
We conduct profiling and simulation across a spectrum of model sizes to ensure generalization. Specifically, we evaluate the performance of our system across 2 dense models (`\llamaS `{=latex}& `\llamaL`{=latex}), and one sparse model (`\qwenMM`{=latex}). We set the tensor parallel degrees to one and four for `\llamaS `{=latex}and `\llamaL  `{=latex}respectively. We run `\qwenMM `{=latex}with expert parallel degree of two. To evaluate the generality of the system, we perform experimentation across two different serving engines -- vLLM [@vLLM:github] and SGLang [@sglang:github]. To ensure a consistent comparison, we enable chunked prefill with mixed batching across all settings with chunk size 512.

```{=latex}
\vheading{Hardware}
```
We perform our evaluations on a machine with 4 H200 GPUs featuring a fully connected NVLink fabric with a 128 core AMD 9334 EPYC CPU and 756GB memory.

<figure>
<p><img src="figures-graphics/revati_abl1_simple.png" />  </p>
</figure>

<figure>
<p><img src="figures-graphics/revati_abl2_simple.png" />  </p>
</figure>

<figure>
<p><img src="figures-graphics/revati_combined_speedup.png" />  </p>
</figure>

## End-to-end Evaluations

We evaluate `\sysname{}`{=latex}'s accuracy and speedup on end-to-end inference benchmarks across two production serving engines (vLLM and SGLang), three model configurations (`\llamaS{}`{=latex}, `\llamaL{}`{=latex}, and `\qwenMM{}`{=latex}), using the ShareGPT dataset and Poisson arrival.

```{=latex}
\vheading{Accuracy}
```
`\autoref{fig:eval:e2e:accuracy}`{=latex} compares the latency distributions predicted by `\sysname{}`{=latex} (dashed) against real execution (solid). Across all model sizes and both serving engines, `\sysname{}`{=latex} closely tracks the baseline Time-to-first-token (TTFT) and Time-per-output-token (TPOT) distributions. The predicted latencies match near identically across the CDF with median prediction error below 5%. By directly executing serving engines, `\sysname `{=latex}is able to accurately model minute details in the scheduling policies of these systems. We observe that though both the systems exhibit similar TTFT performance, SGLang suffers from a significantly worse tail when it comes to decode TPOT -- this is because, SGLang does not perform mixed batching by default, though it performs chunked prefills.

`\vheading{Speedup}`{=latex} As shown in `\autoref{fig:eval:e2e:speedup}`{=latex} `\sysname{}`{=latex} achieves over an order of magnitude speedups across both vLLM and SGLang over real execution due to its time acceleration protocol. Note that as models get larger and the GPU execution time increases, we obtain higher speed ups with peak performance for `\llamaL`{=latex}.

## Ablation Experiments

To understand how workload characteristics affect `\sysname{}`{=latex}'s performance, we conduct various ablation studies.

`\vheading{Varying Batch Durations}`{=latex} First, we evaluate how accurately and efficiently `\sysname `{=latex}can operate if the model size and hardware changes. We mimic this by using static batch time predictions of varying durations between 5 and 40 ms. We compare our performance against the naive sleep-based approach, where GPU workers simply sleep for the batch duration instead of performing time jumps.

As shown in `\autoref{fig:eval:ablation:batch}`{=latex}, `\sysname{}`{=latex} produces highly accurate TTFT and TPOT distributions with less than 5% error. As we increase the batch duration, we obtain higher speedup --- going up to 27$\times$ at batch runtime of 40ms. This is a direct consequence of `\sysname{}`{=latex}'s ability to skip over GPU compute wait times.

`\vheading{Varying Arrival Rates}`{=latex} Another interesting aspect of the workload which could directly affect the accuracy and efficacy of virtual time advancement protocol is the request arrival rate. `\autoref{fig:eval:ablation:qps}`{=latex} shows `\sysname`{=latex}'s performance on request arrival rate varying between 0.5 and 8 queries per second (QPS). To exclude any errors introduced due to the runtime predictor, we run this experiment with a fixed batch time of 20 ms and compare against the sleep-based baseline. The system maintains less than 5% error in TTFT predictions across the board.

As the arrival rate increases, we observe slight reduction in speed up. This is because at higher arrival rates the system operates with larger batch sizes, which results in higher CPU work for batch formation and bookkeeping. However, since this overhead is relatively small (typically \< 200$\mu$s) compared to batch execution time, `\sysname `{=latex}sustains over an order of magnitude speedup even at high request load.

# Related Work {#sec:related-work}

```{=latex}
\vheading{Virtual Time in Distributed Simulation}
```
Jefferson's Time Warp [@virtualtime; @timewarp] pioneered optimistic virtual time: processes advance speculatively and rollback on causality violations. Chandy-Misra  [@asyncdistributedsim; @packetcommunication] introduced conservative protocols where processes exchange lookahead information to advance safely. Both assume the simulator controls event scheduling. `\sysname`{=latex} operates under a different constraint: processes run asynchronously in wall-clock time with real side effects that cannot be rolled back. Our barrier-based protocol adapts conservative principles to this setting, using minimum-target advancement to preserve causality while timeouts ensure graceful degradation to real-time execution.

```{=latex}
\vheading{Time Dilation for Network Emulation}
```
Time dilation [@gupta2006infinity; @modelnet; @diecast] slows virtual time to make physical hardware appear faster---a 1 Gbps NIC can emulate 10 Gbps by running applications at $10\times$ slower virtual time. Tools like DieCast [@diecast], ModelNet [@modelnet], and Mininet [@mininet] use this technique for network testing at scale. `\sysname`{=latex} inverts this relationship: rather than slowing time to amplify hardware capability, we accelerate time to skip over GPU computation. Both share the insight that applications can operate in virtual time decoupled from wall-clock time, but the mechanisms differ substantially. Network emulators intercept system calls and scale delays; `\sysname`{=latex} coordinates distributed processes through explicit barrier synchronization to maintain causality across time jumps.

# Discussion {#sec:discussion}

```{=latex}
\vheading{Towards Fully Transparent Emulation}
```
`\sysname`{=latex} requires a minimal (${\sim}$`<!-- -->`{=html}50 line) patch per framework to invoke time jumps. However, this introduces a natural question: could we eliminate this requirement entirely through kernel-level interception, as Maya does for training? However, there are two challenges with this approach. First, modern inference systems have framework-specific implementations for certain complex operations like flash attention, MoE routing. Predicting their runtime from kernel API signatures alone would require complex reverse-engineering of framework internals -- introducing maintenance burden that `\sysname`{=latex} aims to avoid. More recently, there have been attempts to directly estimate the runtime of GPU kernels from their source code and input parameters [@neusight; @ithemal] -- this direction has the promise of allowing fully transparent emulation in future.

```{=latex}
\vheading{Speed of Light Execution}
```
The systems community has been striving hard to minimize the CPU overheads in serving systems to maximize the GPU utilization. This is becoming increasingly important as GPUs become faster and more capable over generations. Recently, there have been multiple efforts to implement highly efficient executions in C++ and Rust [@tgi; @llamacpp] instead of the traditional Python implementations. Existing frameworks like SGLang have also started to recently migrate critical segments of their codebase like the prefix cache tree to C++ to achieve this goal [@sglang:github]. Interestingly, this raises the question: what maximum acceleration can we achieve with `\sysname`{=latex}? As the emulation speeds up, we expect to encounter more interesting challenges and bottlenecks in the system -- for instance, the tokenization/detokenization steps could add non-trivial overheads etc. We hope to explore these challenges in future work.

# Conclusion {#sec:conclusion}

The rapid evolution of LLM serving systems has rendered traditional discrete-event simulation unsustainable. The gap between a simulator's model and frontier-grade serving systems is simply too expensive to bridge manually. `\sysname`{=latex} eliminates this gap entirely by cheap and accurate runtime modeling directly on top of original serving frameworks. We show that the transparent device emulation layer, combined with the barrier-based virtual time protocol, can enable accurate runtime modeling (${<}5\%$ error) while concurrently achieving an order of magnitude speedup over physical execution.

```{=latex}
\bibliographystyle{plain}
```

[^1]: Note: In real deployments, the request termination is determined by the generation of EOS tokens. However, for performance modeling purposes, it is a common practice to run each request for a predetermined number of tokens by running specific sampling parameters.
