LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Source
- Raw Markdown: paper_llmservingsim-2024.md
- PDF: paper_llmservingsim-2024.pdf
- Preprint: arXiv:2408.05499
- Related DOI: 10.1109/IISWC63097.2024.00012
- Official code: casys-kaist/LLMServingSim. The repository snapshot is stored with the 2026 successor source as
papers/llmservingsim-2-0-2026/source_repo_metadata.jsonandpapers/llmservingsim-2-0-2026/source_github_readme.md.
Status And Credibility
LLMServingSim was submitted to arXiv on 2024-08-10 and has an IISWC 2024 venue reference with an IEEE DOI. The paper comes from KAIST and is backed by an official public simulator repository. Although it is older than one year, it remains the baseline source for the LLMServingSim line and is directly superseded by the 2025 and 2026 LLMServingSim 2.0 papers.
Use it as historical and baseline evidence for hardware/software co-simulation of LLM serving, not as the final state of the LLMServingSim system.
Core Claim
The paper argues that LLM serving simulators need to capture the dynamic, autoregressive nature of generation without paying the cost of fully repeated accelerator simulation.
LLMServingSim makes two central moves:
- simulate LLM serving at iteration granularity so prefill/decode dynamics and scheduling behavior are visible;
- reuse simulation results across decoder blocks and iterations to avoid redundant accelerator simulations.
It also exposes a plug-in path for accelerator compiler-and-simulation stacks, making it a hardware/software co-design tool rather than only a queueing simulator.
Evidence And Results
- The paper reports less than 14.7% error against a real GPU-based LLM serving system.
- It reports 91.5x faster simulation speed compared with existing accelerator simulators.
- The simulator is designed around LLM-specific behavior: autoregressive iteration, dynamic request arrivals, layer-specific processing, and decoder-block redundancy.
- The official repository later became the home of the broader LLMServingSim 2.0 line.
Why It Matters For GPU Inference Optimization
LLMServingSim is the simulator branch of GPU Inference Optimization. It asks how to evaluate serving designs, accelerator choices, and scheduling decisions before spending real GPU time.
Its most transferable idea is that serving simulation should preserve the event-stream structure of online inference: requests arrive over time, batching changes over time, and each decode step changes the active workload. That makes static per-kernel profiling insufficient on its own.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute and serving | adjacent | Models serving-time batching, iteration structure, and accelerator behavior under dynamic LLM workloads. | Does not study time-series foundation model architecture or deployment directly. |
| Observability and event streams | adjacent | Treats LLM request arrivals and decode iterations as the relevant serving event stream. | Needs broader workload traces and direct links to operational time-series modeling. |
| Control and counterfactuals | adjacent | Supports what-if exploration over system design choices and hardware/software configurations. | Simulator fidelity and supported serving policies bound the validity of counterfactual conclusions. |
Limitations And Gotchas
- The 2024 version focuses on the first-generation co-simulation abstraction; later 2.0 work expands heterogeneous hardware, disaggregation, routing, caching, scheduling, memory, and power coverage.
- A simulator can reduce exploration cost, but it still needs accurate profiles, workload models, and explicit support for the serving features being tested.
- The maintenance burden of reimplementing serving logic is the key limitation later highlighted by Revati.