GPU Inference Optimization

Purpose

This page tracks GPU inference optimization sources that are useful for designing, evaluating, or automating LLM serving systems. The current page is organized around three branches:

Simulators that reimplement or model serving behavior to explore configurations cheaply.
Emulators that run real framework code while virtualizing the accelerator/time interface.
Serving optimization papers that use forecasts, scheduling, routing, placement, or autoscaling to reduce GPU cost while respecting latency constraints.
Sparse attention kernels that change the attention computation itself and must be evaluated as algorithm-plus-kernel systems.

The page is intentionally systems-oriented. It is adjacent to the time-series foundation-model agenda because serving workloads are event streams and because optimization policies act on system state over time, but the current sources are LLM infrastructure papers rather than time-series foundation-model papers.

Taxonomy

mindmap
  root((GPU inference optimization))
    Simulators
      Vidur
        profiling + predictive modeling
        Vidur-Search configuration search
      LLMServingSim
        iteration-level HW/SW co-simulation
        computation reuse
      LLMServingSim 2.0
        heterogeneous hardware
        disaggregated serving
        routing/cache/scheduling/power
    Emulators
      LLM-Emu
        vLLM wall-clock endpoint
        profile-sampled GPU forward
        synthetic output tokens
      Revati
        CUDA interception
        GPU-free time-warp serving emulation
        vLLM/SGLang real control logic
      Maya
        training-side runtime emulation
        predecessor mechanism for Revati
    Serving control
      SageServe
        forecast-aware autoscaling
        routing and model placement
        regional GPU-hour savings

Simulator Branch

Vidur

Vidur is the baseline large-scale LLM inference simulator in this batch. It combines profiling and predictive modeling to estimate latency and throughput across workloads, models, batching policies, scheduling policies, and parallelization choices. Vidur-Search then searches deployment configurations that meet application constraints.

Its value is the economic argument for simulation: the paper reports finding a LLaMA2-70B deployment configuration in one CPU-machine hour instead of an estimated 42K GPU hours. It also makes workload specificity part of the serving contract: the best configuration is a function of both model and request trace, and reusing one workload’s optimum on another can produce about a 2x cost overhead. Its caveat is the simulator maintenance tax: if the simulator reimplements serving-engine behavior, framework changes must be mirrored inside the simulator.

LLMServingSim Line

LLMServingSim 2024 introduces iteration-level HW/SW co-simulation for LLM serving. It models dynamic autoregressive serving behavior and reuses computation across decoder blocks and iterations. It reports less than 14.7% error against a real GPU-based LLM serving system and 91.5x faster simulation than existing accelerator simulators.

LLMServingSim2.0 2025 is the short CAL bridge paper. It adds trace-driven performance modeling and an operator-level latency profiler so new accelerators can be integrated with less custom code. It reports 1.9% GPU-serving reproduction error and an 18.5x reduction in integration code in a TPU case study.

LLMServingSim 2.0 2026 is the fuller current reference. It targets heterogeneous and disaggregated LLM serving, embedding batching, routing, offloading, memory, power, and hardware behavior in a unified runtime loop. The paper reports 0.95% average error for key performance, memory, and power metrics and simulation times around 10 minutes for complex configurations.

External Simulator Watchlist

These works are not yet ingested as source pages. They are listed here so future agents do not miss them during simulator, emulator, or hardware-analysis scans. Preserve the reading rule below: a simulator or analytical evaluator is a proxy serving-world model until its assumptions are checked against the target serving stack, hardware generation, workload mix, and held-out real deployments.

Work	Date / credibility	Branch fit	What to remember
APEX / APEX+, Microsoft repo	Submitted 2024-11, revised 2025-04; Microsoft-affiliated authors and official Microsoft repository.	Simulator	Extensible, dynamism-aware simulator for automated parallel execution planning in LLM serving. Models iteration-level batching, memory, quantization, device clusters, and data/pipeline/tensor parallelism tradeoffs; reports CPU-side execution-plan search within about 15 minutes.
TokenSim	2025-03 arXiv preprint; current but not yet a source page here.	Simulator	Hardware/software exploration simulator for LLM inference systems. Emphasizes extensible scheduling and memory-management studies and reports less than 1% validation error against real-system datasets.
MIST / HERMES	2025-04 arXiv preprint; current multi-stage inference work.	Simulator / co-design framework	Multi-stage LLM inference simulator and co-design framework for RAG, KV retrieval, reasoning, prefill, and decode stages across heterogeneous devices. Useful when serving is a pipeline of interacting stages, not only prefill/decode on one engine.
Frontier v1	2025-08 arXiv preprint from CUHK / StepFun / Anuttacon authors.	Simulator	Discrete-event simulator aimed at next-generation LLM inference systems, especially disaggregated and MoE serving. Treat as the earlier Frontier reference.
Frontier v2, NetX-lab repo	2026-05 arXiv preprint; current revision with public GitHub repository.	Simulator	Broader Frontier reference for comprehensive LLM inference simulation, including disaggregated, MoE, stateful, agentic, scheduling, and post-training reconfiguration scenarios. Prefer v2 when only one Frontier link is needed.
AIConfigurator	2026-01 arXiv preprint; NVIDIA-authored system.	Optimizer / performance-modeling system	Configuration optimizer across vLLM, SGLang, TensorRT-LLM, and Dynamo. Near-simulator because it uses fast performance modeling to choose serving configurations rather than only benchmark one stack.
Dooly	2026-05 arXiv preprint from UT Austin authors.	Profiling substrate for simulation	Configuration-agnostic, redundancy-aware profiling for LLM inference simulation. Relevant as a lower-maintenance way to feed simulator latency models without exhaustive profiling every configuration.
AgentServeSim	2026-06 arXiv preprint from UCF authors.	Simulator	Hardware-aware simulator for multi-turn LLM agent serving. Important because agent workloads are stateful event streams with turn dependencies, tool-induced gaps, reusable KV state, routing, and scheduling constraints.
GenZ, code	2024-06 arXiv preprint from Georgia Tech / Intel Labs authors; public analyzer.	Analytical platform analyzer	Lower-level analytical tool for relating LLM inference performance, latency, memory, and platform parameters. Use as hardware-platform context, not a full serving-engine simulator.
LLMCompass, Princeton repo	ISCA 2024; Princeton authors; DOI and public repository.	Hardware evaluation framework	Hardware evaluation framework for LLM inference workloads. Useful for architecture and accelerator studies; adjacent to serving simulators because it models inference hardware behavior rather than full production serving control.

Emulator Branch

LLM-Emu

LLM-Emu is the serving-native wall-clock emulator source. It runs the real vLLM online serving stack and replaces only GPU forward execution with profile-sampled latency plus synthetic output tokens. This means live HTTP traffic, admission, scheduling, KV-cache management, and output processing still exercise vLLM itself rather than a simulator-side reimplementation.

The reported fidelity is strong for steady-state metrics: TPOT and ITL stay within 4.8% absolute error, E2E latency within 5.3%, and output throughput within 1.9% across two GPUs, two model families, four model variants, two attention backends, Poisson arrivals, and bursty ShareGPT workloads. TTFT is less stable, with maximum error 10.41%, because it is especially sensitive to admission and queue state. Its research value is also architectural: the profile-sampled oracle and synthetic-token path are natural insertion points for learned latency, output-length, or workload models.

Revati

Revati is the time-warp serving emulator source. Instead of reimplementing vLLM or SGLang control logic, Revati runs the real serving framework code against a virtualized CUDA interface. GPU kernels are not executed; the emulator advances virtual time by predicted kernel durations and uses a distributed coordination protocol to preserve causality across processes.

It reports less than 5% prediction error and 5—17x faster-than-real-GPU execution across multiple models and parallelism configurations. The central design tradeoff is clear: Revati reduces simulator maintenance burden but introduces dependency on CUDA interposition, kernel-duration models, and virtual-time correctness. LLM-Emu takes the complementary online path: less accelerated and less CUDA-deep, but closer to wall-clock endpoint behavior.

Maya

Maya is not an inference-serving paper, but it belongs here as Revati’s training-side predecessor. It uses transparent device emulation to model deep-learning training workloads from unmodified code. The paper reports less than 5% prediction error and up to 56% training-cost reduction from identified configurations.

Maya’s relevance is the semantic-gap critique: if users must translate real workloads into custom specification languages, the evaluator can stop matching the actual framework behavior. Revati applies the same critique to LLM serving.

Forecast-Aware Serving Optimization

SageServe is the workload-forecasting and resource-control branch. It characterizes Microsoft Office 365 LLM serving workloads and proposes a multi-timescale controller for mixed interactive and non-interactive workloads. The controller combines short-term routing with longer-lead-time GPU VM scaling and model placement, using traffic forecasts and an ILP formulation.

The reported results are up to 25% GPU-hour savings, 80% reduction in GPU-hour wastage from inefficient autoscaling, and potential savings up to $2.5M per month while maintaining tail latency and SLAs.

Sparse Attention Kernel Branch

MiniMax Sparse Attention

MiniMax Sparse Attention belongs in this systems page because its claim is not only architectural sparsity. The paper co-designs the learned block selector with TopK and sparse-attention kernels, using exp-free selection and KV-outer execution so selected KV blocks gather associated queries and keep tensor-core work dense enough to matter.

The reported 1M-context numbers are $28.4 \times$ lower per-token attention FLOPs, $14.2 \times$ prefill speedup, and $7.6 \times$ decode speedup versus dense GQA in the paper’s setup. The reading rule is the same as for simulators and emulators: sparse-attention wins are proxy evidence until reproduced under the actual serving stack, hardware generation, batching regime, KV-cache pressure, and workload mix.

Comparison Table

Source	Branch	Evaluator / optimizer	Main knob	Reported fidelity or savings	Credibility note
Vidur 2024	Simulator	Profiling + predictive modeling + Vidur-Search	Deployment configuration, batching, scheduling, parallelism	`<9%` latency error; LLaMA2-70B search in one CPU-machine hour vs estimated 42K GPU hours	Older than one year, but official Microsoft repo and important baseline.
LLMServingSim 2024	Simulator	Iteration-level HW/SW co-simulation	Accelerator/compiler stack and serving iteration behavior	`<14.7%` GPU-serving error; `91.5x` faster simulation	IISWC 2024 KAIST source.
LLMServingSim2.0 2025	Simulator	Trace-driven performance modeling + operator profiler	Hardware extensibility plus routing/cache/scheduling interfaces	`1.9%` GPU-serving error; `18.5x` fewer LoC in TPU integration case	IEEE CAL 2025 KAIST source.
LLMServingSim 2.0 2026	Simulator	Unified runtime loop for heterogeneous/disaggregated serving	Batching, routing, offloading, memory, power, parallelism	`0.95%` average error; ~10 minute complex simulations	Current primary LLMServingSim reference; official repo lists ISPASS 2026.
LLM-Emu 2026	Emulator	Real vLLM online serving path + profile-sampled GPU-forward latency	Wall-clock online endpoint, scheduler/KV/output fidelity, profile oracle	TPOT/ITL within `4.8%`, E2E within `5.3%`, throughput within `1.9%`; TTFT max error `10.41%`	Current arXiv preprint with Apache-2.0 code; single-node and profile-matched validation.
Revati 2026	Emulator	Real vLLM/SGLang code + CUDA virtualization + time warp	Serving configuration under real framework control logic	`<5%` prediction error; `5--17x` faster than real GPU execution	Current arXiv preprint; author X context, not yet peer reviewed.
Maya 2025	Emulator	Training-code device emulation	Training configuration search	`<5%` prediction error; up to `56%` cost reduction	EuroSys 2026; training-side predecessor to Revati.
SageServe 2025	Serving control	Forecast-aware routing/scaling/model placement	Multi-region GPU VM allocation under SLAs	Up to `25%` GPU-hour savings; `80%` less autoscaling waste	ACM POMACS 2025; Microsoft O365 production workload study.
MiniMax Sparse Attention 2026	Sparse attention kernel	Learned block selector + GPU kernel co-design	1M-context attention compute and sparse prefill/decode	`28.4x` lower attention FLOPs, `14.2x` prefill speedup, `7.6x` decode speedup in paper setup	Current MiniMax technical report with MIT code and MiniMax-M3 release; needs independent serving-stack reproduction.

Learned / Hybrid Simulator Ingredients

Alex’s LLM-Emu note frames a research gap that is not fully covered by current systems: an end-to-end hybrid learned simulator for LLM serving. The deterministic loop or serving-native runtime path would remain explicit, while the uncertain parts are learned or statistically generated from traces.

Component	Candidate research links	Role in a learned/hybrid serving simulator
Base event loop or serving runtime	Vidur, LLMServingSim 2.0, MIST, TokenSim, Frontier, LLM-Emu	Preserve queues, batching, KV-cache state, prefill/decode, disaggregation, and scheduler decisions.
Latency / throughput surrogate	IBM ALA, Roofline-driven ML latency prediction, llm-d predicted-latency scheduling, MaverIQ, LENS	Predict per-request, per-batch, or per-configuration TTFT/TPOT/latency/throughput from model, hardware, batch, queue, KV-cache, and configuration features.
Workload generator	ServeGen	Generate realistic arrivals, prompt/output length distributions, multimodal/reasoning composition, and per-client patterns.
Output-length model	S3, SSJF, vLLM-LTR, Magnus, ForeLen	Predict decode cost or at least rank short versus long requests, which is essential for realistic queue and KV-cache evolution.
Uncertainty layer	ALA-style uncertainty, held-out profile checks, live canaries	Propagate profile and workload uncertainty into scheduling, scaling, and capacity-planning decisions.

The important distinction is research-operational: LLM-Emu is not the learned simulator itself, but it is a strong substrate because its profile-sampled GPU-forward latency and synthetic output-token interface can be replaced by learned predictors while keeping the serving stack real.

Simulator vs Emulator Reading Rule

A simulator is best when the goal is controlled, fast exploration of a modeled design space. It is only trustworthy when the modeled policies, kernels, workload distributions, and hardware abstractions match the real system being optimized.

An emulator is best when framework control logic changes quickly and should be executed directly. It is only trustworthy when device virtualization, API coverage, predicted durations, wall-clock or virtual-time semantics, and synthetic-output assumptions preserve the relevant behavior.

In both cases, the evaluator is a proxy world model of the serving system. Optimization results should be checked against held-out real deployments before claiming production savings.

Relation To Foundation TSFM Agenda

GPU inference optimization is not itself a time-series foundation-model method, but it is operationally relevant to always-on stateful systems:

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	The sources expose serving-time compute, routing, batching, memory, parallelism, emulated GPU-forward latency, sparse attention kernels, output-length uncertainty, and scaling as first-class variables.	Needs direct TSFM serving studies with long observation histories, state updates, and action-conditioned forecasts.
Observability and event streams	adjacent	SageServe and the simulators treat request arrivals, queue state, latency, throughput, memory, and power as time-varying operational signals.	Need public traces and benchmarks that include richer event streams, interventions, exogenous variables, and incident outcomes.
Control and counterfactuals	adjacent	Simulators/emulators support what-if exploration over serving control inputs.	Need calibrated counterfactual validation against real deployments and explicit uncertainty.
Benchmark validity	warning	The simulator/emulator split shows how evaluator drift can make benchmark wins brittle.	Need held-out deployment checks, drift monitoring, and versioned evaluator contracts.

Open Questions

When should an optimizer use a simulator, a wall-clock emulator, a time-warp emulator, or a small real-GPU canary loop?
Can MSA-style sparse attention kernels keep their reported advantage once measured inside full serving stacks with batching, prefix caching, KV-cache pressure, and bursty multi-tenant workloads?
Can an agent safely optimize serving policies using simulator/emulator rewards without overfitting evaluator blind spots?
Can LLM-Emu-style serving-native emulation be combined with ServeGen-style workload generation, output-length predictors, and learned latency surrogates into a calibrated hybrid learned simulator?
Which serving state should be exposed as an event stream: request arrivals, KV-cache occupancy, queue state, power, network traffic, memory-tier movement, or tenant/application metadata?
How should uncertainty in forecast models and latency models propagate into routing, scaling, and model-placement decisions?
Can TSFM-style latent-state models improve workload forecasting or serving-control policies beyond hand-built forecast/ILP pipelines?
What is the right held-out validation protocol for disaggregated serving and heterogeneous accelerators, including profile drift, interconnect/data-movement assumptions, serving-framework feature coverage, and fast hardware turnover?

Alex Open Research Wiki

Explorer

GPU Inference Optimization

GPU Inference Optimization

Purpose

Taxonomy

Simulator Branch

Vidur

LLMServingSim Line

External Simulator Watchlist

Emulator Branch

LLM-Emu

Revati

Maya

Forecast-Aware Serving Optimization

Sparse Attention Kernel Branch

MiniMax Sparse Attention

Comparison Table

Learned / Hybrid Simulator Ingredients

Simulator vs Emulator Reading Rule

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

GPU Inference Optimization

GPU Inference Optimization

Purpose

Taxonomy

Simulator Branch

Vidur

LLMServingSim Line

External Simulator Watchlist

Emulator Branch

LLM-Emu

Revati

Maya

Forecast-Aware Serving Optimization

Sparse Attention Kernel Branch

MiniMax Sparse Attention

Comparison Table

Learned / Hybrid Simulator Ingredients

Simulator vs Emulator Reading Rule

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks