LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Source

Raw Markdown: paper_llmservingsim-2024.md
PDF: paper_llmservingsim-2024.pdf
Preprint: arXiv:2408.05499
Related DOI: 10.1109/IISWC63097.2024.00012
Official code: casys-kaist/LLMServingSim. The repository snapshot is stored with the 2026 successor source as papers/llmservingsim-2-0-2026/source_repo_metadata.json and papers/llmservingsim-2-0-2026/source_github_readme.md.

Status And Credibility

LLMServingSim was submitted to arXiv on 2024-08-10 and has an IISWC 2024 venue reference with an IEEE DOI. The paper comes from KAIST and is backed by an official public simulator repository. Although it is older than one year, it remains the baseline source for the LLMServingSim line and is directly superseded by the 2025 and 2026 LLMServingSim 2.0 papers.

Use it as historical and baseline evidence for hardware/software co-simulation of LLM serving, not as the final state of the LLMServingSim system.

Core Claim

The paper argues that LLM serving simulators need to capture the dynamic, autoregressive nature of generation without paying the cost of fully repeated accelerator simulation.

LLMServingSim makes two central moves:

simulate LLM serving at iteration granularity so prefill/decode dynamics and scheduling behavior are visible;
reuse simulation results across decoder blocks and iterations to avoid redundant accelerator simulations.

It also exposes a plug-in path for accelerator compiler-and-simulation stacks, making it a hardware/software co-design tool rather than only a queueing simulator.

Evidence And Results

The paper reports less than 14.7% error against a real GPU-based LLM serving system.
It reports 91.5x faster simulation speed compared with existing accelerator simulators.
The simulator is designed around LLM-specific behavior: autoregressive iteration, dynamic request arrivals, layer-specific processing, and decoder-block redundancy.
The official repository later became the home of the broader LLMServingSim 2.0 line.

Why It Matters For GPU Inference Optimization

LLMServingSim is the simulator branch of GPU Inference Optimization. It asks how to evaluate serving designs, accelerator choices, and scheduling decisions before spending real GPU time.

Its most transferable idea is that serving simulation should preserve the event-stream structure of online inference: requests arrive over time, batching changes over time, and each decode step changes the active workload. That makes static per-kernel profiling insufficient on its own.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	Models serving-time batching, iteration structure, and accelerator behavior under dynamic LLM workloads.	Does not study time-series foundation model architecture or deployment directly.
Observability and event streams	adjacent	Treats LLM request arrivals and decode iterations as the relevant serving event stream.	Needs broader workload traces and direct links to operational time-series modeling.
Control and counterfactuals	adjacent	Supports what-if exploration over system design choices and hardware/software configurations.	Simulator fidelity and supported serving policies bound the validity of counterfactual conclusions.

Limitations And Gotchas

The 2024 version focuses on the first-generation co-simulation abstraction; later 2.0 work expands heterogeneous hardware, disaggregation, routing, caching, scheduling, memory, and power coverage.
A simulator can reduce exploration cost, but it still needs accurate profiles, workload models, and explicit support for the serving features being tested.
The maintenance burden of reimplementing serving logic is the key limitation later highlighted by Revati.

Alex Open Research Wiki

Explorer

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale

Source

Status And Credibility

Core Claim

Evidence And Results

Why It Matters For GPU Inference Optimization

Foundation TSFM Relevance

Limitations And Gotchas

Links Into The Wiki

Graph View

Table of Contents

Backlinks