LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure
Source
- Raw Markdown: paper_llmservingsim-2-0-2025.md
- PDF: paper_llmservingsim-2-0-2025.pdf
- Preprint: arXiv:2511.07229
- Related DOI: 10.1109/LCA.2025.3628325
- Official code: casys-kaist/LLMServingSim. Local repository snapshot is stored with the 2026 successor source.
Status And Credibility
This is a short IEEE Computer Architecture Letters 2025 paper, posted to arXiv on 2025-11-10. It is authored by Jaehong Cho, Hyunmin Choi, and Jongse Park from KAIST and is part of the same official LLMServingSim repository lineage as the IISWC 2024 and ISPASS 2026 versions.
Treat it as a credible intermediate systems source that records the 2.0 shift toward trace-driven hardware extensibility and broader serving-technique coverage. The later 2026 paper is the more complete reference for heterogeneous and disaggregated serving infrastructure.
Core Claim
LLMServingSim2.0 targets two gaps in first-generation serving simulators:
- integrating new hardware models into a system-level simulator is too manual;
- existing simulators cover too narrow a set of modern serving techniques.
The paper introduces trace-driven performance modeling plus an operator-level latency profiler. This lets users integrate new accelerators with much less glue code while still exposing policy interfaces for request routing, cache management, and scheduling.
Evidence And Results
- The TPU case study reports 18.5x fewer lines of code relative to the predecessor’s hardware-simulator integration path.
- GPU-based LLM serving is reproduced with 1.9% error in the reported experiments.
- The simulator keeps practical simulation time while broadening hardware and serving-technique coverage.
Why It Matters For GPU Inference Optimization
This paper is the bridge between LLMServingSim 2024’s HW/SW co-simulation idea and LLMServingSim 2.0 2026’s more complete heterogeneous/disaggregated simulator.
Its key contribution for GPU Inference Optimization is the profile-driven abstraction: instead of hand-integrating every accelerator in detail, collect operator-level latency profiles and plug them into a system-level runtime model that still sees routing, caching, and scheduling decisions.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute and serving | adjacent | Models serving policy knobs and hardware latency profiles under LLM workloads. | Does not evaluate time-series model serving or stateful TSFM deployment directly. |
| Benchmark validity | adjacent | Explicitly reports reproduction error against GPU serving behavior. | Simulator fidelity depends on profile coverage and workload realism. |
| Control and counterfactuals | adjacent | Exposes routing, cache-management, and scheduling interfaces for what-if exploration. | Does not prove those policies transfer across all serving engines or hardware generations. |
Limitations And Gotchas
- This is a four-page CAL paper; use the 2026 paper for the fuller simulator description and disaggregation results.
- Trace-driven profiles reduce integration effort but can become stale as kernels, frameworks, hardware, and serving features evolve.
- Lower reproduction error on tested cases does not imply complete fidelity for unseen workload distributions or new accelerators.