LLM-Emu
Summary
LLM-Emu is a serving-native, wall-clock emulator for vLLM. It keeps the real vLLM HTTP path, admission logic, scheduler, KV-cache management, and output pipeline, but replaces GPU forward execution with latency sampled from an offline profile pack and synthetic output tokens.
Interface
- Serving framework: vLLM 0.18.1.
- Runtime boundary: GPU worker / executor step.
- Profile key: total tokens in the step plus request concurrency, split into decode-only and prefill/mixed buckets.
- Runtime output: timer-resolved
Futureplus synthetic token IDs. - Public artifact: AKafakA/llm-emu, Apache-2.0.
Role In The Wiki
LLM-Emu belongs to the emulator branch of GPU Inference Optimization. It complements simulator sources such as Vidur and LLMServingSim 2.0, and it contrasts with Revati: Revati virtualizes CUDA and advances virtual time, while LLM-Emu preserves a wall-clock online vLLM endpoint and swaps only GPU forward execution.
For a future learned/hybrid LLM-serving simulator, LLM-Emu is mainly an engineering substrate. Its profile-sampled latency oracle, synthetic output-token path, and workload/benchmark interface are the places where learned latency predictors, output-length predictors, and statistical workload generators could be inserted.