LLM-Emu

Summary

LLM-Emu is a serving-native, wall-clock emulator for vLLM. It keeps the real vLLM HTTP path, admission logic, scheduler, KV-cache management, and output pipeline, but replaces GPU forward execution with latency sampled from an offline profile pack and synthetic output tokens.

Interface

Serving framework: vLLM 0.18.1.
Runtime boundary: GPU worker / executor step.
Profile key: total tokens in the step plus request concurrency, split into decode-only and prefill/mixed buckets.
Runtime output: timer-resolved Future plus synthetic token IDs.
Public artifact: AKafakA/llm-emu, Apache-2.0.

Role In The Wiki

LLM-Emu belongs to the emulator branch of GPU Inference Optimization. It complements simulator sources such as Vidur and LLMServingSim 2.0, and it contrasts with Revati: Revati virtualizes CUDA and advances virtual time, while LLM-Emu preserves a wall-clock online vLLM endpoint and swaps only GPU forward execution.

For a future learned/hybrid LLM-serving simulator, LLM-Emu is mainly an engineering substrate. Its profile-sampled latency oracle, synthetic output-token path, and workload/benchmark interface are the places where learned latency predictors, output-length predictors, and statistical workload generators could be inserted.

Evidence

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

Alex Open Research Wiki

Explorer

LLM-Emu

LLM-Emu

Summary

Interface

Role In The Wiki

Evidence

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

LLM-Emu

LLM-Emu

Summary

Interface

Role In The Wiki

Evidence

Related Pages

Graph View

Table of Contents

Backlinks