LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

Source

Raw Markdown: paper_llmservingsim-2-0-2025.md
PDF: paper_llmservingsim-2-0-2025.pdf
Preprint: arXiv:2511.07229
Related DOI: 10.1109/LCA.2025.3628325
Official code: casys-kaist/LLMServingSim. Local repository snapshot is stored with the 2026 successor source.

Status And Credibility

This is a short IEEE Computer Architecture Letters 2025 paper, posted to arXiv on 2025-11-10. It is authored by Jaehong Cho, Hyunmin Choi, and Jongse Park from KAIST and is part of the same official LLMServingSim repository lineage as the IISWC 2024 and ISPASS 2026 versions.

Treat it as a credible intermediate systems source that records the 2.0 shift toward trace-driven hardware extensibility and broader serving-technique coverage. The later 2026 paper is the more complete reference for heterogeneous and disaggregated serving infrastructure.

Core Claim

LLMServingSim2.0 targets two gaps in first-generation serving simulators:

integrating new hardware models into a system-level simulator is too manual;
existing simulators cover too narrow a set of modern serving techniques.

The paper introduces trace-driven performance modeling plus an operator-level latency profiler. This lets users integrate new accelerators with much less glue code while still exposing policy interfaces for request routing, cache management, and scheduling.

Evidence And Results

The TPU case study reports 18.5x fewer lines of code relative to the predecessor’s hardware-simulator integration path.
GPU-based LLM serving is reproduced with 1.9% error in the reported experiments.
The simulator keeps practical simulation time while broadening hardware and serving-technique coverage.

Why It Matters For GPU Inference Optimization

This paper is the bridge between LLMServingSim 2024’s HW/SW co-simulation idea and LLMServingSim 2.0 2026’s more complete heterogeneous/disaggregated simulator.

Its key contribution for GPU Inference Optimization is the profile-driven abstraction: instead of hand-integrating every accelerator in detail, collect operator-level latency profiles and plug them into a system-level runtime model that still sees routing, caching, and scheduling decisions.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	Models serving policy knobs and hardware latency profiles under LLM workloads.	Does not evaluate time-series model serving or stateful TSFM deployment directly.
Benchmark validity	adjacent	Explicitly reports reproduction error against GPU serving behavior.	Simulator fidelity depends on profile coverage and workload realism.
Control and counterfactuals	adjacent	Exposes routing, cache-management, and scheduling interfaces for what-if exploration.	Does not prove those policies transfer across all serving engines or hardware generations.

Limitations And Gotchas

This is a four-page CAL paper; use the 2026 paper for the fuller simulator description and disaggregation results.
Trace-driven profiles reduce integration effort but can become stale as kernels, frameworks, hardware, and serving features evolve.
Lower reproduction error on tested cases does not imply complete fidelity for unseen workload distributions or new accelerators.

Alex Open Research Wiki

Explorer

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure

Source

Status And Credibility

Core Claim

Evidence And Results

Why It Matters For GPU Inference Optimization

Foundation TSFM Relevance

Limitations And Gotchas

Links Into The Wiki

Graph View

Table of Contents

Backlinks