LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Source

Status And Credibility

This is arXiv v2 from 2026-03-23, authored by Jaehong Cho, Hyunmin Choi, Guseul Heo, and Jongse Park at KAIST. The official repository lists the corresponding publication as ISPASS 2026 and links to an IEEE DOI and Zenodo artifact. The same repository also lists the IISWC 2024 and CAL 2025 predecessors.

Treat it as the current primary LLMServingSim reference because it expands the simulator from heterogeneous hardware into heterogeneous and disaggregated LLM serving infrastructure.

Core Claim

LLMServingSim 2.0 argues that future LLM serving performance is dominated by runtime interactions among heterogeneous accelerators, disaggregated memory/computation, scheduling, data movement, interconnect behavior, cache policy, and power.

The simulator embeds serving decisions and hardware behavior in a single runtime loop. It supports profile-based integration of emerging accelerators and memory systems while modeling dynamic serving behavior.

The official repository describes the implementation as a cycle-level simulator with:

  • a Python frontend mirroring vLLM continuous batching;
  • an ASTRA-Sim C++ analytical network backend;
  • per-hardware latency data from a vLLM-based layerwise profiler;
  • support for heterogeneous accelerators, CPU/CXL/PIM memory tiers, MoE routing, and TP/PP/EP/DP parallelism.

Evidence And Results

  • The paper reports average error of 0.95% for key performance, memory, and power metrics against real deployments.
  • It reports simulation times around 10 minutes even for complex configurations.
  • The comparison table in the paper positions the system as covering prefill/decode disaggregation, accelerator flexibility, heterogeneous topologies, multiple parallelism modes, profiling accuracy, profile collection, execution/offload, and power modeling more comprehensively than prior simulators.

Why It Matters For GPU Inference Optimization

This is the most complete simulator in the current ingest batch for GPU Inference Optimization. It moves beyond “how fast is this model on this GPU” toward hardware/software co-design for serving infrastructure where memory tiers, accelerators, network topology, scheduling, routing, and model parallelism interact.

For an agentic optimization loop, LLMServingSim 2.0 can be read as a candidate world model of serving infrastructure: it predicts how future system state changes under candidate control inputs such as scheduling, routing, parallelism, cache, offload, and hardware choices. The caveat is that it is still a simulator whose action-conditioned predictions are bounded by profiling and model assumptions.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute and servingadjacentModels runtime interactions among batching, routing, offloading, memory, power, and parallelism.Evidence is for LLM serving infrastructure, not TSFM serving.
Control and counterfactualsadjacentEnables systematic what-if exploration over hardware/software serving decisions.Fidelity for unseen accelerators and serving frameworks depends on profile quality and model coverage.
Observability and event streamsadjacentServing requests, scheduler state, memory movement, and power metrics are modeled as dynamic system state.Does not provide a general learned world model over operational telemetry.
Benchmark validitywarningReports low average error, but simulator-based optimization can overfit missing mechanisms.Needs continuous validation as kernels, engines, workloads, and hardware evolve.

Limitations And Gotchas

  • The simulator is only as useful as its hardware profiles, workload traces, and modeled serving features.
  • Disaggregated serving introduces data movement and interconnect assumptions that can dominate latency; those assumptions must be checked before trusting optimization results.
  • It complements, rather than replaces, real deployment experiments and transparent emulators such as Revati.