Vidur: A Large-Scale Simulation Framework For LLM Inference

Source

Raw Markdown: paper_vidur-2024.md
PDF: paper_vidur-2024.pdf
Preprint: arXiv:2405.05465
Official code: microsoft/vidur. Local snapshots: papers/vidur-2024/source_repo_metadata.json and papers/vidur-2024/source_github_readme.md.

Status And Credibility

Vidur is an arXiv 2024 paper, last revised on 2024-05-21, with an official Microsoft GitHub repository. The repository describes it as an MLSys 2024 paper and provides a maintained simulator implementation under the microsoft organization.

Although it is older than one year, Vidur is included as an important baseline because it is a named simulator in the current Revati and LLMServingSim comparison space and because Alex explicitly included both the paper and official repository in this ingest batch.

Core Claim

Vidur addresses the cost of exploring LLM inference deployment configurations. Instead of running every candidate configuration on a GPU cluster, it simulates end-to-end inference performance using a combination of experimental profiling and predictive modeling.

The system estimates metrics such as latency and throughput under different workloads and system knobs, including parallelization strategies, batching, scheduling, and deployment topology. Vidur-Search then uses the simulator to search for cost-effective configurations that satisfy application constraints.

Evidence And Results

The arXiv abstract reports less than 9% inference-latency error across tested LLMs.
Vidur-Search reportedly finds a deployment configuration for LLaMA2-70B in one CPU-machine hour, compared with an estimated 42K GPU hours and about $218K for deployment-based exploration.
The official repository positions Vidur for performance analysis, capacity planning, and rapid prototyping of scheduling or speculative-decoding ideas.
The paper reports that optimal deployment configuration is workload-specific; applying one trace’s optimum to another can create up to about 2x serving-cost overhead.
The README lists support for common LLaMA, CodeLLaMA, InternLM, and Qwen models across A100/H100/A40-style device and topology combinations.

Why It Matters For GPU Inference Optimization

Vidur is the canonical simulator baseline on the GPU Inference Optimization page. It makes the economic case for simulation: deployment search can be too expensive to do directly on GPUs.

It also exposes the limitation that motivates emulation systems such as Revati and LLM-Emu: once a simulator reimplements serving-engine behavior, it can lag behind fast-moving frameworks such as vLLM and SGLang. LLM-Emu is the wall-clock online endpoint version of that response: keep vLLM’s serving path and swap only GPU forward execution.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	Simulates deployment configurations, scheduling, batching, and parallelism for LLM inference.	Does not evaluate time-series model serving directly.
Benchmark validity	adjacent	Reports latency error against profiled deployments and makes configuration-search cost explicit.	Fidelity depends on profiling coverage and correct modeling of serving-engine behavior.
Control and counterfactuals	adjacent	Vidur-Search performs what-if exploration over deployment configurations.	Simulator actions are system knobs, not learned action-conditioned world-model rollouts.

Limitations And Gotchas

As a simulator, Vidur has a maintenance burden: new serving-engine features must be represented in the simulator to be evaluated accurately.
Profiling and predictive models can miss framework overheads, kernel changes, or workload regimes outside the training/calibration set.
The paper is older than the current vLLM/SGLang and disaggregated-serving landscape, so use it as a baseline rather than the sole current source.

Alex Open Research Wiki

Explorer

Vidur: A Large-Scale Simulation Framework For LLM Inference

Vidur: A Large-Scale Simulation Framework For LLM Inference

Source

Status And Credibility

Core Claim

Evidence And Results

Why It Matters For GPU Inference Optimization

Foundation TSFM Relevance

Limitations And Gotchas

Links Into The Wiki

Graph View

Table of Contents

Backlinks