Vidur: A Large-Scale Simulation Framework For LLM Inference
Source
- Raw Markdown: paper_vidur-2024.md
- PDF: paper_vidur-2024.pdf
- Preprint: arXiv:2405.05465
- Official code: microsoft/vidur. Local snapshots:
papers/vidur-2024/source_repo_metadata.jsonandpapers/vidur-2024/source_github_readme.md.
Status And Credibility
Vidur is an arXiv 2024 paper, last revised on 2024-05-21, with an official Microsoft GitHub repository. The repository describes it as an MLSys 2024 paper and provides a maintained simulator implementation under the microsoft organization.
Although it is older than one year, Vidur is included as an important baseline because it is a named simulator in the current Revati and LLMServingSim comparison space and because Alex explicitly included both the paper and official repository in this ingest batch.
Core Claim
Vidur addresses the cost of exploring LLM inference deployment configurations. Instead of running every candidate configuration on a GPU cluster, it simulates end-to-end inference performance using a combination of experimental profiling and predictive modeling.
The system estimates metrics such as latency and throughput under different workloads and system knobs, including parallelization strategies, batching, scheduling, and deployment topology. Vidur-Search then uses the simulator to search for cost-effective configurations that satisfy application constraints.
Evidence And Results
- The arXiv abstract reports less than 9% inference-latency error across tested LLMs.
- Vidur-Search reportedly finds a deployment configuration for LLaMA2-70B in one CPU-machine hour, compared with an estimated 42K GPU hours and about $218K for deployment-based exploration.
- The official repository positions Vidur for performance analysis, capacity planning, and rapid prototyping of scheduling or speculative-decoding ideas.
- The paper reports that optimal deployment configuration is workload-specific; applying one trace’s optimum to another can create up to about 2x serving-cost overhead.
- The README lists support for common LLaMA, CodeLLaMA, InternLM, and Qwen models across A100/H100/A40-style device and topology combinations.
Why It Matters For GPU Inference Optimization
Vidur is the canonical simulator baseline on the GPU Inference Optimization page. It makes the economic case for simulation: deployment search can be too expensive to do directly on GPUs.
It also exposes the limitation that motivates emulation systems such as Revati: once a simulator reimplements serving-engine behavior, it can lag behind fast-moving frameworks such as vLLM and SGLang.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute and serving | adjacent | Simulates deployment configurations, scheduling, batching, and parallelism for LLM inference. | Does not evaluate time-series model serving directly. |
| Benchmark validity | adjacent | Reports latency error against profiled deployments and makes configuration-search cost explicit. | Fidelity depends on profiling coverage and correct modeling of serving-engine behavior. |
| Control and counterfactuals | adjacent | Vidur-Search performs what-if exploration over deployment configurations. | Simulator actions are system knobs, not learned action-conditioned world-model rollouts. |
Limitations And Gotchas
- As a simulator, Vidur has a maintenance burden: new serving-engine features must be represented in the simulator to be evaluated accurately.
- Profiling and predictive models can miss framework overheads, kernel changes, or workload regimes outside the training/calibration set.
- The paper is older than the current vLLM/SGLang and disaggregated-serving landscape, so use it as a baseline rather than the sole current source.