GPU Inference Optimization

Purpose

This page tracks GPU inference optimization sources that are useful for designing, evaluating, or automating LLM serving systems. The current page is organized around three branches:

  1. Simulators that reimplement or model serving behavior to explore configurations cheaply.
  2. Emulators that run real framework code while virtualizing the accelerator/time interface.
  3. Serving optimization papers that use forecasts, scheduling, routing, placement, or autoscaling to reduce GPU cost while respecting latency constraints.

The page is intentionally systems-oriented. It is adjacent to the time-series foundation-model agenda because serving workloads are event streams and because optimization policies act on system state over time, but the current sources are LLM infrastructure papers rather than time-series foundation-model papers.

Taxonomy

mindmap
  root((GPU inference optimization))
    Simulators
      Vidur
        profiling + predictive modeling
        Vidur-Search configuration search
      LLMServingSim
        iteration-level HW/SW co-simulation
        computation reuse
      LLMServingSim 2.0
        heterogeneous hardware
        disaggregated serving
        routing/cache/scheduling/power
    Emulators
      Revati
        CUDA interception
        GPU-free time-warp serving emulation
        vLLM/SGLang real control logic
      Maya
        training-side runtime emulation
        predecessor mechanism for Revati
    Serving control
      SageServe
        forecast-aware autoscaling
        routing and model placement
        regional GPU-hour savings

Simulator Branch

Vidur

Vidur is the baseline large-scale LLM inference simulator in this batch. It combines profiling and predictive modeling to estimate latency and throughput across workloads, models, batching policies, scheduling policies, and parallelization choices. Vidur-Search then searches deployment configurations that meet application constraints.

Its value is the economic argument for simulation: the paper reports finding a LLaMA2-70B deployment configuration in one CPU-machine hour instead of an estimated 42K GPU hours. It also makes workload specificity part of the serving contract: the best configuration is a function of both model and request trace, and reusing one workload’s optimum on another can produce about a 2x cost overhead. Its caveat is the simulator maintenance tax: if the simulator reimplements serving-engine behavior, framework changes must be mirrored inside the simulator.

LLMServingSim Line

LLMServingSim 2024 introduces iteration-level HW/SW co-simulation for LLM serving. It models dynamic autoregressive serving behavior and reuses computation across decoder blocks and iterations. It reports less than 14.7% error against a real GPU-based LLM serving system and 91.5x faster simulation than existing accelerator simulators.

LLMServingSim2.0 2025 is the short CAL bridge paper. It adds trace-driven performance modeling and an operator-level latency profiler so new accelerators can be integrated with less custom code. It reports 1.9% GPU-serving reproduction error and an 18.5x reduction in integration code in a TPU case study.

LLMServingSim 2.0 2026 is the fuller current reference. It targets heterogeneous and disaggregated LLM serving, embedding batching, routing, offloading, memory, power, and hardware behavior in a unified runtime loop. The paper reports 0.95% average error for key performance, memory, and power metrics and simulation times around 10 minutes for complex configurations.

Emulator Branch

Revati

Revati is the serving emulator source. Instead of reimplementing vLLM or SGLang control logic, Revati runs the real serving framework code against a virtualized CUDA interface. GPU kernels are not executed; the emulator advances virtual time by predicted kernel durations and uses a distributed coordination protocol to preserve causality across processes.

It reports less than 5% prediction error and 5—17x faster-than-real-GPU execution across multiple models and parallelism configurations. The central design tradeoff is clear: Revati reduces simulator maintenance burden but introduces dependency on CUDA interposition, kernel-duration models, and virtual-time correctness.

Maya

Maya is not an inference-serving paper, but it belongs here as Revati’s training-side predecessor. It uses transparent device emulation to model deep-learning training workloads from unmodified code. The paper reports less than 5% prediction error and up to 56% training-cost reduction from identified configurations.

Maya’s relevance is the semantic-gap critique: if users must translate real workloads into custom specification languages, the evaluator can stop matching the actual framework behavior. Revati applies the same critique to LLM serving.

Forecast-Aware Serving Optimization

SageServe is the workload-forecasting and resource-control branch. It characterizes Microsoft Office 365 LLM serving workloads and proposes a multi-timescale controller for mixed interactive and non-interactive workloads. The controller combines short-term routing with longer-lead-time GPU VM scaling and model placement, using traffic forecasts and an ILP formulation.

The reported results are up to 25% GPU-hour savings, 80% reduction in GPU-hour wastage from inefficient autoscaling, and potential savings up to $2.5M per month while maintaining tail latency and SLAs.

Comparison Table

SourceBranchEvaluator / optimizerMain knobReported fidelity or savingsCredibility note
Vidur 2024SimulatorProfiling + predictive modeling + Vidur-SearchDeployment configuration, batching, scheduling, parallelism<9% latency error; LLaMA2-70B search in one CPU-machine hour vs estimated 42K GPU hoursOlder than one year, but official Microsoft repo and important baseline.
LLMServingSim 2024SimulatorIteration-level HW/SW co-simulationAccelerator/compiler stack and serving iteration behavior<14.7% GPU-serving error; 91.5x faster simulationIISWC 2024 KAIST source.
LLMServingSim2.0 2025SimulatorTrace-driven performance modeling + operator profilerHardware extensibility plus routing/cache/scheduling interfaces1.9% GPU-serving error; 18.5x fewer LoC in TPU integration caseIEEE CAL 2025 KAIST source.
LLMServingSim 2.0 2026SimulatorUnified runtime loop for heterogeneous/disaggregated servingBatching, routing, offloading, memory, power, parallelism0.95% average error; ~10 minute complex simulationsCurrent primary LLMServingSim reference; official repo lists ISPASS 2026.
Revati 2026EmulatorReal vLLM/SGLang code + CUDA virtualization + time warpServing configuration under real framework control logic<5% prediction error; 5--17x faster than real GPU executionCurrent arXiv preprint; author X context, not yet peer reviewed.
Maya 2025EmulatorTraining-code device emulationTraining configuration search<5% prediction error; up to 56% cost reductionEuroSys 2026; training-side predecessor to Revati.
SageServe 2025Serving controlForecast-aware routing/scaling/model placementMulti-region GPU VM allocation under SLAsUp to 25% GPU-hour savings; 80% less autoscaling wasteACM POMACS 2025; Microsoft O365 production workload study.

Simulator vs Emulator Reading Rule

A simulator is best when the goal is controlled, fast exploration of a modeled design space. It is only trustworthy when the modeled policies, kernels, workload distributions, and hardware abstractions match the real system being optimized.

An emulator is best when framework control logic changes quickly and should be executed directly. It is only trustworthy when device virtualization, API coverage, predicted durations, and distributed time semantics preserve the relevant behavior.

In both cases, the evaluator is a proxy world model of the serving system. Optimization results should be checked against held-out real deployments before claiming production savings.

Relation To Foundation TSFM Agenda

GPU inference optimization is not itself a time-series foundation-model method, but it is operationally relevant to always-on stateful systems:

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute and servingadjacentThe sources expose serving-time compute, routing, batching, memory, parallelism, and scaling as first-class variables.Needs direct TSFM serving studies with long observation histories, state updates, and action-conditioned forecasts.
Observability and event streamsadjacentSageServe and the simulators treat request arrivals, queue state, latency, throughput, memory, and power as time-varying operational signals.Need public traces and benchmarks that include richer event streams, interventions, exogenous variables, and incident outcomes.
Control and counterfactualsadjacentSimulators/emulators support what-if exploration over serving control inputs.Need calibrated counterfactual validation against real deployments and explicit uncertainty.
Benchmark validitywarningThe simulator/emulator split shows how evaluator drift can make benchmark wins brittle.Need held-out deployment checks, drift monitoring, and versioned evaluator contracts.

Open Questions

  • When should an optimizer use a simulator, an emulator, or a small real-GPU canary loop?
  • Can an agent safely optimize serving policies using simulator/emulator rewards without overfitting evaluator blind spots?
  • Which serving state should be exposed as an event stream: request arrivals, KV-cache occupancy, queue state, power, network traffic, memory-tier movement, or tenant/application metadata?
  • How should uncertainty in forecast models and latency models propagate into routing, scaling, and model-placement decisions?
  • Can TSFM-style latent-state models improve workload forecasting or serving-control policies beyond hand-built forecast/ILP pipelines?
  • What is the right held-out validation protocol for disaggregated serving and heterogeneous accelerators, including profile drift, interconnect/data-movement assumptions, serving-framework feature coverage, and fast hardware turnover?