LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
Source
- Raw Markdown: paper_llm-emu-2026.md
- PDF: paper_llm-emu-2026.pdf
- Preprint: arXiv:2605.00616
- Official code: AKafakA/llm-emu. Local metadata:
papers/llm-emu-2026/official_artifacts_metadata.json. - Alex-provided research-context links:
papers/llm-emu-2026/research-context-links.md
Status And Credibility
LLM-Emu is an arXiv v1 preprint submitted on 2026-05-01 by Wei Da and Evangelia Kalyvianaki from the University of Cambridge. The arXiv metadata lists cs.DC, a CC BY 4.0 license, and the comment 6 page, 2 figures, workshop paper shape.
Treat it as current, useful systems evidence with a venue caveat. Credibility signals are the public Apache-2.0 repository, a small and inspectable vLLM plugin implementation, direct comparison against real vLLM runs, and validation across two GPUs, two model families, four model variants, two attention backends, Poisson arrivals, and bursty ShareGPT workloads. Caveats: no peer-reviewed venue is visible in arXiv metadata, validation is single-node, and each model/hardware/configuration still needs profile collection.
Core Claim
LLM-Emu argues that realistic LLM-serving evaluation needs online arrivals, queueing, live HTTP behavior, framework scheduling, KV-cache behavior, and output processing, but real-GPU experiments are expensive and pure simulators often drift from the serving engine they model.
The system keeps the real vLLM online serving path and replaces only GPU forward execution:
flowchart LR Client[HTTP clients / vLLM bench serve] Stack[vLLM HTTP, admission, scheduler, KV cache, output path] Hook[LLM-Emu executor-boundary hook] Oracle[density-aware profile oracle] Future[timer-resolved Future] Tokens[synthetic token IDs] Metrics[TTFT, TPOT, ITL, E2E, throughput] Client --> Stack --> Hook Hook --> Oracle --> Future --> Tokens --> Stack Stack --> Metrics
The key design point is that LLM-Emu is serving-native wall-clock emulation, not a standalone discrete-event simulator. It runs the same vllm serve interface and exercises vLLM’s own online stack while swapping GPU forward work for sampled delay plus synthetic outputs.
Method Notes
LLM-Emu targets vLLM 0.18.1. When the emulator is enabled, a runtime plugin:
- bypasses real model loading and GPU setup;
- redirects the GPU worker’s per-step execution path to
vllm_emulator/; - loads a profile pack captured from real GPU serving traces;
- samples latency from decode-only or prefill/mixed buckets keyed by total tokens in the step and request concurrency;
- returns a timer-resolved
Futureso scheduler-worker overlap remains asynchronous; - emits synthetic token IDs so vLLM’s downstream output path still runs.
The profile pack stores raw per-step latency samples rather than only aggregate summaries. Sparse profile regions are handled by adaptive nearest-neighbor expansion and Shepard-weighted sampling. A representative main profile in the paper has about 276K samples across about 7.3K (total tokens, concurrency) buckets and takes about 3.5—4.5 GPU hours to capture.
Evidence And Results
| Evaluation axis | Paper evidence | Local interpretation |
|---|---|---|
| vLLM path fidelity | vLLM HTTP, admission, scheduler, KV-cache, and output pipeline stay on the real code path; only GPU forward is replaced. | Stronger online-stack fidelity than simulators that reimplement scheduling, but still only for vLLM 0.18.1 and covered execution paths. |
| Accuracy | TPOT and ITL within 4.8% absolute error, E2E latency within 5.3%, output throughput within 1.9%; TTFT max error 10.41%. | Good steady-state serving fidelity; TTFT remains the most sensitive metric because admission and queue state amplify small timing differences. |
| Coverage | Tested on RTX 8000 and A40, Qwen3-4B/8B/14B and Llama-3.1-8B, FlashInfer and FlashAttention 2, Poisson and bursty arrivals. | Useful breadth for a workshop-sized systems paper; not yet a cluster-level or multi-node evaluation. |
| Maintenance surface | About 2.5K LoC of LLM-Emu code plus around 173 LoC of vLLM wiring, according to the paper. | Lower maintenance surface than reimplementing a full simulator, but still sensitive to vLLM API drift. |
| Public artifacts | Official GitHub repository is Apache-2.0 and includes plugin code, vLLM patch/overrides, example profiling data, and reproduction docs. | Strong reproducibility signal relative to simulator/emulator papers without public implementations. |
Placement In The Inference-Emulation Landscape
LLM-Emu belongs on GPU Inference Optimization as a serving-native emulator. It occupies a distinct point between Vidur, LLMServingSim 2.0, and Revati:
- Vidur / LLMServingSim / Frontier / TokenSim / MIST: discrete-event or runtime-driven simulators that model serving dynamics and often rely on profiling, analytical models, or learned predictors.
- Revati: emulator that runs real serving-framework code, but virtualizes CUDA and advances virtual time by predicted kernel durations.
- LLM-Emu: wall-clock vLLM emulator that preserves the online endpoint and serving stack, avoids CUDA interception, and replaces only GPU forward execution with profile-sampled latency and synthetic output tokens.
The user’s supplied research landscape also points to a broader hybrid simulator recipe: keep a deterministic event loop or serving-native runtime path, but learn workload arrivals, output length, kernel/server latency, and uncertainty from traces. LLM-Emu is not itself ML-heavy; its value is the engineering insertion point where learned latency or output models could replace sampled profiles.
Research-Context Links From Alex’s Note
The following list preserves only research context from Alex’s prompt, not business framing.
| Branch | Work | Link | Why it matters here |
|---|---|---|---|
| Deterministic / event-driven simulator | Vidur / Vidur-Bench | arXiv:2405.05465 | Baseline large-scale simulator; operator performance uses profiling and predictive modeling. |
| Deterministic / event-driven simulator | LLMServingSim 2.0 | arXiv:2602.23036 | Runtime-driven heterogeneous/disaggregated serving simulator. |
| Deterministic / event-driven simulator | MIST | arXiv:2504.09775 | Multi-stage inference simulator for RAG, KV retrieval, reasoning, prefill, and decode stages. |
| Deterministic / event-driven simulator | TokenSim | arXiv:2503.08415, GitHub | Hardware/software exploration simulator for scheduling and memory-management studies. |
| Deterministic / event-driven simulator | Frontier | arXiv:2605.21312, GitHub | Modern discrete-event simulator for disaggregated serving, complex parallelism, runtime optimizations, and stateful workloads. |
| Learned latency / throughput surrogate | IBM ALA | arXiv:2505.09319 | Analytical with Learning Augmentation framework for throughput prediction and uncertainty estimation. |
| Learned latency / throughput surrogate | Roofline-driven ML latency prediction | IBM Research | Combines an LLM-specific Roofline model with regression trained on historical runtime data. |
| Learned latency / throughput surrogate | llm-d predicted-latency scheduling | blog, guide | Online XGBoost TTFT/TPOT model using prompt length, queue/server state, cache, and concurrency features. |
| Learned latency / throughput surrogate | MaverIQ | ACM DOI, GitHub | Fingerprint-guided latency/configuration extrapolation and fragmentation-aware deployment. |
| Learned latency / throughput surrogate | LENS | arXiv:2606.18042 | NPU latency estimator that avoids needing hidden microarchitecture/compiler details. |
| Workload generator | ServeGen | arXiv:2505.09999, USENIX NSDI 2026, GitHub | Production-informed generator for arrivals, input/output lengths, and per-client composition. |
| Output-length model | S3 output-length model | arXiv:2306.06000 | Predicts output length for scheduling and KV-cache utilization. |
| Output-length model | SSJF / proxy model sequence-length prediction | arXiv:2404.08509, author page/code link | Uses a lightweight proxy model for output-length prediction and speculative shortest-job-first scheduling. |
| Output-length model | vLLM-LTR | arXiv:2408.15792, GitHub, project blog | Learns relative output-length ranks when exact generation length is hard to predict. |
| Output-length model | Magnus | arXiv:2406.04785 | Predicts generation length from input length plus application/user semantic features for batching. |
| Output-length model | ForeLen / entropy-guided representations | arXiv:2602.11812 | Uses on-the-fly activations and entropy-guided pooling for output-length prediction and introduces ForeLen. |
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Dynamic compute and serving | adjacent | LLM-Emu makes serving-time batching, queueing, KV-cache state, prefill/decode phases, and request arrivals explicit in an online emulator. | Evidence is LLM serving, not TSFM serving with long observation histories or stateful updates. |
| Observability and event streams | adjacent | The emulator evaluates request arrivals, queue/admission effects, concurrency, latency, and throughput as time-varying operational signals. | Does not provide a general observability benchmark, incident labels, or action-conditioned system-dynamics model. |
| Control and counterfactuals | adjacent | Enables what-if testing of vLLM serving configurations under emulated GPU-forward latency. | Counterfactual validity depends on profile match, output-token realism, and held-out real-GPU validation. |
| Benchmark validity | warning | Shows that keeping real framework control logic can reduce simulator drift. | The emulator can still drift across vLLM releases, GPUs, workload distributions, and profile gaps. |
Limitations And Gotchas
- LLM-Emu is an arXiv preprint / workshop-shaped paper at ingest time.
- It validates single-node serving only; multi-GPU, multi-node, and cluster-level emulation remain future work.
- It still needs GPU time to collect a profile for each model, hardware platform, attention backend, and serving configuration.
- The profile and evaluation workloads are matched; out-of-distribution request mixes need separate validation.
- TTFT is less stable than TPOT/ITL/E2E/throughput because queueing and startup effects are sensitive to small timing differences.
- Synthetic output tokens preserve the output path, but they are not a learned or semantic output model.
- The plugin targets vLLM 0.18.1, so API drift checks are part of the maintenance contract.
- It is not a learned event simulator by itself; learned/hybrid simulation would require replacing or augmenting the profile oracle, output-length source, and workload generator.
Open Questions
- Can LLM-Emu’s profile-sampled oracle be replaced with an uncertainty-aware learned latency model such as an ALA-style or llm-d-style predictor?
- Can output-length models such as SSJF, vLLM-LTR, Magnus, or ForeLen produce synthetic decode traces that match real online workloads better than fixed benchmark reference lengths?
- What held-out validation protocol should decide when a profile pack or learned surrogate is stale after vLLM, CUDA, attention backend, GPU, or workload changes?
- How should a hybrid emulator propagate uncertainty from workload generation, output-length prediction, and latency prediction into scheduler or capacity-planning decisions?
- Can an emulator that preserves serving-engine code serve as a controllable world model for agentic serving-policy search without overfitting profile gaps?