LLM-Emu

Summary

LLM-Emu is a serving-native, wall-clock emulator for vLLM. It keeps the real vLLM HTTP path, admission logic, scheduler, KV-cache management, and output pipeline, but replaces GPU forward execution with latency sampled from an offline profile pack and synthetic output tokens.

Interface

  • Serving framework: vLLM 0.18.1.
  • Runtime boundary: GPU worker / executor step.
  • Profile key: total tokens in the step plus request concurrency, split into decode-only and prefill/mixed buckets.
  • Runtime output: timer-resolved Future plus synthetic token IDs.
  • Public artifact: AKafakA/llm-emu, Apache-2.0.

Role In The Wiki

LLM-Emu belongs to the emulator branch of GPU Inference Optimization. It complements simulator sources such as Vidur and LLMServingSim 2.0, and it contrasts with Revati: Revati virtualizes CUDA and advances virtual time, while LLM-Emu preserves a wall-clock online vLLM endpoint and swaps only GPU forward execution.

For a future learned/hybrid LLM-serving simulator, LLM-Emu is mainly an engineering substrate. Its profile-sampled latency oracle, synthetic output-token path, and workload/benchmark interface are the places where learned latency predictors, output-length predictors, and statistical workload generators could be inserted.

Evidence