GPU Inference Optimization
Purpose
This page tracks GPU inference optimization sources that are useful for designing, evaluating, or automating LLM serving systems. The current page is organized around three branches:
- Simulators that reimplement or model serving behavior to explore configurations cheaply.
- Emulators that run real framework code while virtualizing the accelerator/time interface.
- Serving optimization papers that use forecasts, scheduling, routing, placement, or autoscaling to reduce GPU cost while respecting latency constraints.
The page is intentionally systems-oriented. It is adjacent to the time-series foundation-model agenda because serving workloads are event streams and because optimization policies act on system state over time, but the current sources are LLM infrastructure papers rather than time-series foundation-model papers.
Taxonomy
mindmap root((GPU inference optimization)) Simulators Vidur profiling + predictive modeling Vidur-Search configuration search LLMServingSim iteration-level HW/SW co-simulation computation reuse LLMServingSim 2.0 heterogeneous hardware disaggregated serving routing/cache/scheduling/power Emulators Revati CUDA interception GPU-free time-warp serving emulation vLLM/SGLang real control logic Maya training-side runtime emulation predecessor mechanism for Revati Serving control SageServe forecast-aware autoscaling routing and model placement regional GPU-hour savings
Simulator Branch
Vidur
Vidur is the baseline large-scale LLM inference simulator in this batch. It combines profiling and predictive modeling to estimate latency and throughput across workloads, models, batching policies, scheduling policies, and parallelization choices. Vidur-Search then searches deployment configurations that meet application constraints.
Its value is the economic argument for simulation: the paper reports finding a LLaMA2-70B deployment configuration in one CPU-machine hour instead of an estimated 42K GPU hours. It also makes workload specificity part of the serving contract: the best configuration is a function of both model and request trace, and reusing one workload’s optimum on another can produce about a 2x cost overhead. Its caveat is the simulator maintenance tax: if the simulator reimplements serving-engine behavior, framework changes must be mirrored inside the simulator.
LLMServingSim Line
LLMServingSim 2024 introduces iteration-level HW/SW co-simulation for LLM serving. It models dynamic autoregressive serving behavior and reuses computation across decoder blocks and iterations. It reports less than 14.7% error against a real GPU-based LLM serving system and 91.5x faster simulation than existing accelerator simulators.
LLMServingSim2.0 2025 is the short CAL bridge paper. It adds trace-driven performance modeling and an operator-level latency profiler so new accelerators can be integrated with less custom code. It reports 1.9% GPU-serving reproduction error and an 18.5x reduction in integration code in a TPU case study.
LLMServingSim 2.0 2026 is the fuller current reference. It targets heterogeneous and disaggregated LLM serving, embedding batching, routing, offloading, memory, power, and hardware behavior in a unified runtime loop. The paper reports 0.95% average error for key performance, memory, and power metrics and simulation times around 10 minutes for complex configurations.
Emulator Branch
Revati
Revati is the serving emulator source. Instead of reimplementing vLLM or SGLang control logic, Revati runs the real serving framework code against a virtualized CUDA interface. GPU kernels are not executed; the emulator advances virtual time by predicted kernel durations and uses a distributed coordination protocol to preserve causality across processes.
It reports less than 5% prediction error and 5—17x faster-than-real-GPU execution across multiple models and parallelism configurations. The central design tradeoff is clear: Revati reduces simulator maintenance burden but introduces dependency on CUDA interposition, kernel-duration models, and virtual-time correctness.
Maya
Maya is not an inference-serving paper, but it belongs here as Revati’s training-side predecessor. It uses transparent device emulation to model deep-learning training workloads from unmodified code. The paper reports less than 5% prediction error and up to 56% training-cost reduction from identified configurations.
Maya’s relevance is the semantic-gap critique: if users must translate real workloads into custom specification languages, the evaluator can stop matching the actual framework behavior. Revati applies the same critique to LLM serving.
Forecast-Aware Serving Optimization
SageServe is the workload-forecasting and resource-control branch. It characterizes Microsoft Office 365 LLM serving workloads and proposes a multi-timescale controller for mixed interactive and non-interactive workloads. The controller combines short-term routing with longer-lead-time GPU VM scaling and model placement, using traffic forecasts and an ILP formulation.
The reported results are up to 25% GPU-hour savings, 80% reduction in GPU-hour wastage from inefficient autoscaling, and potential savings up to $2.5M per month while maintaining tail latency and SLAs.
Comparison Table
| Source | Branch | Evaluator / optimizer | Main knob | Reported fidelity or savings | Credibility note |
|---|---|---|---|---|---|
| Vidur 2024 | Simulator | Profiling + predictive modeling + Vidur-Search | Deployment configuration, batching, scheduling, parallelism | <9% latency error; LLaMA2-70B search in one CPU-machine hour vs estimated 42K GPU hours | Older than one year, but official Microsoft repo and important baseline. |
| LLMServingSim 2024 | Simulator | Iteration-level HW/SW co-simulation | Accelerator/compiler stack and serving iteration behavior | <14.7% GPU-serving error; 91.5x faster simulation | IISWC 2024 KAIST source. |
| LLMServingSim2.0 2025 | Simulator | Trace-driven performance modeling + operator profiler | Hardware extensibility plus routing/cache/scheduling interfaces | 1.9% GPU-serving error; 18.5x fewer LoC in TPU integration case | IEEE CAL 2025 KAIST source. |
| LLMServingSim 2.0 2026 | Simulator | Unified runtime loop for heterogeneous/disaggregated serving | Batching, routing, offloading, memory, power, parallelism | 0.95% average error; ~10 minute complex simulations | Current primary LLMServingSim reference; official repo lists ISPASS 2026. |
| Revati 2026 | Emulator | Real vLLM/SGLang code + CUDA virtualization + time warp | Serving configuration under real framework control logic | <5% prediction error; 5--17x faster than real GPU execution | Current arXiv preprint; author X context, not yet peer reviewed. |
| Maya 2025 | Emulator | Training-code device emulation | Training configuration search | <5% prediction error; up to 56% cost reduction | EuroSys 2026; training-side predecessor to Revati. |
| SageServe 2025 | Serving control | Forecast-aware routing/scaling/model placement | Multi-region GPU VM allocation under SLAs | Up to 25% GPU-hour savings; 80% less autoscaling waste | ACM POMACS 2025; Microsoft O365 production workload study. |
Simulator vs Emulator Reading Rule
A simulator is best when the goal is controlled, fast exploration of a modeled design space. It is only trustworthy when the modeled policies, kernels, workload distributions, and hardware abstractions match the real system being optimized.
An emulator is best when framework control logic changes quickly and should be executed directly. It is only trustworthy when device virtualization, API coverage, predicted durations, and distributed time semantics preserve the relevant behavior.
In both cases, the evaluator is a proxy world model of the serving system. Optimization results should be checked against held-out real deployments before claiming production savings.
Relation To Foundation TSFM Agenda
GPU inference optimization is not itself a time-series foundation-model method, but it is operationally relevant to always-on stateful systems:
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute and serving | adjacent | The sources expose serving-time compute, routing, batching, memory, parallelism, and scaling as first-class variables. | Needs direct TSFM serving studies with long observation histories, state updates, and action-conditioned forecasts. |
| Observability and event streams | adjacent | SageServe and the simulators treat request arrivals, queue state, latency, throughput, memory, and power as time-varying operational signals. | Need public traces and benchmarks that include richer event streams, interventions, exogenous variables, and incident outcomes. |
| Control and counterfactuals | adjacent | Simulators/emulators support what-if exploration over serving control inputs. | Need calibrated counterfactual validation against real deployments and explicit uncertainty. |
| Benchmark validity | warning | The simulator/emulator split shows how evaluator drift can make benchmark wins brittle. | Need held-out deployment checks, drift monitoring, and versioned evaluator contracts. |
Open Questions
- When should an optimizer use a simulator, an emulator, or a small real-GPU canary loop?
- Can an agent safely optimize serving policies using simulator/emulator rewards without overfitting evaluator blind spots?
- Which serving state should be exposed as an event stream: request arrivals, KV-cache occupancy, queue state, power, network traffic, memory-tier movement, or tenant/application metadata?
- How should uncertainty in forecast models and latency models propagate into routing, scaling, and model-placement decisions?
- Can TSFM-style latent-state models improve workload forecasting or serving-control policies beyond hand-built forecast/ILP pipelines?
- What is the right held-out validation protocol for disaggregated serving and heterogeneous accelerators, including profile drift, interconnect/data-movement assumptions, serving-framework feature coverage, and fast hardware turnover?