Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

Source

Raw Markdown: paper_revati-2026.md
PDF: paper_revati-2026.pdf
Preprint: arXiv:2601.00397
Alex-provided X trigger: Amey Agrawal thread root. Local snapshots: papers/revati-2026/x_post_2013762528573096077.json, papers/revati-2026/x_thread_author_2013762528573096077.json, and papers/revati-2026/x_post_2013762528573096077.md.

Status And Credibility

Revati is an arXiv v1 preprint submitted on 2026-01-01 by Amey Agrawal, Mayank Yadav, Sukrit Kumar, Anirudha Agrawal, Garv Ghai, Souradeep Bera, Elton Pinto, Sirish Gambhira, Mohammad Adain, Kasra Sohrab, Chus Antonanzas, and Alexey Tumanov. It does not yet have a peer-reviewed venue in the arXiv metadata.

Treat it as current and credible preprint evidence because it comes from an author line connected to Vidur and Maya, evaluates real serving frameworks vLLM and SGLang, and is accompanied by a public author X discussion. Keep the venue caveat: it is not yet peer reviewed.

Core Claim

Revati argues that LLM serving optimization needs an evaluator that is cheaper than real GPU clusters but more maintainable than discrete-event simulators that reimplement serving-engine logic.

The system runs real serving framework code without physical GPUs. It virtualizes CUDA device management, intercepts CUDA API calls, stubs GPU work, and advances virtual time by predicted kernel durations. A distributed coordination protocol synchronizes time jumps across processes while preserving causality.

flowchart LR
  Engine[vLLM / SGLang real serving code] --> CUDA[CUDA API interception]
  CUDA --> Virtual[virtual GPU device state]
  CUDA --> Durations[predicted kernel durations]
  Durations --> Warp[virtual time jumps]
  Warp --> Barrier[distributed causality protocol]
  Barrier --> Metrics[latency and throughput predictions]

Evidence And Results

The paper reports less than 5% prediction error across multiple models and parallelism configurations.
It reports 5—17x faster-than-real-GPU execution on vLLM and SGLang.
The author X context frames the motivation as agentic serving-system optimization: AI-driven search over serving designs needs a fast, cheap, accurate evaluation mechanism.
The paper explicitly positions emulation against simulator maintenance tax: new serving framework control logic does not need to be manually reimplemented if the real code can run against virtualized CUDA.

Why It Matters For GPU Inference Optimization

Revati is the time-warp emulator source on GPU Inference Optimization, now complemented by LLM-Emu’s wall-clock vLLM endpoint emulation. Together they sharpen the simulator-versus-emulator distinction:

simulators such as Vidur and LLMServingSim 2.0 can explore designs cheaply but must model or reimplement enough serving behavior;
emulators such as Revati and LLM-Emu run real framework control logic and virtualize only the accelerator/time interface.

For an agentic optimization loop, Revati is attractive because the evaluator can follow framework changes more naturally than a reimplemented simulator. The cost is a new dependency on API interposition, kernel-duration prediction, and distributed virtual-time correctness. LLM-Emu moves that tradeoff: it avoids CUDA interposition and keeps wall-clock online HTTP behavior, but trust shifts to profile coverage, synthetic-output realism, vLLM API drift, and single-node/profile-matched validation.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	Provides GPU-free evaluation of serving configurations for real LLM serving engines.	Does not evaluate TSFM serving or stateful time-series deployment directly.
Control and counterfactuals	adjacent	Enables what-if serving-configuration exploration while executing real control logic.	Counterfactual validity depends on timing models and virtual-time causality.
Benchmark validity	warning	Shows why simulator maintenance tax can invalidate evaluations as serving frameworks evolve.	Emulation can also drift when CUDA APIs, kernels, or prediction models change.

Limitations And Gotchas

Revati is a 2026 arXiv preprint and should not yet be treated as peer-reviewed evidence.
It still needs predicted kernel durations; bad profiles can produce bad virtual time.
CUDA API interposition may need maintenance as frameworks, libraries, and hardware features evolve.
It is designed for offline time-warped performance modeling, not necessarily for exercising every online endpoint overhead in wall-clock mode.

Alex Open Research Wiki

Explorer

Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving

Source

Status And Credibility

Core Claim

Evidence And Results

Why It Matters For GPU Inference Optimization

Foundation TSFM Relevance

Limitations And Gotchas

Links Into The Wiki

Graph View

Table of Contents

Backlinks