LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

Source

Raw Markdown: paper_llm-emu-2026.md
PDF: paper_llm-emu-2026.pdf
Preprint: arXiv:2605.00616
Official code: AKafakA/llm-emu. Local metadata: papers/llm-emu-2026/official_artifacts_metadata.json.
Alex-provided research-context links: papers/llm-emu-2026/research-context-links.md

Status And Credibility

LLM-Emu is an arXiv v1 preprint submitted on 2026-05-01 by Wei Da and Evangelia Kalyvianaki from the University of Cambridge. The arXiv metadata lists cs.DC, a CC BY 4.0 license, and the comment 6 page, 2 figures, workshop paper shape.

Treat it as current, useful systems evidence with a venue caveat. Credibility signals are the public Apache-2.0 repository, a small and inspectable vLLM plugin implementation, direct comparison against real vLLM runs, and validation across two GPUs, two model families, four model variants, two attention backends, Poisson arrivals, and bursty ShareGPT workloads. Caveats: no peer-reviewed venue is visible in arXiv metadata, validation is single-node, and each model/hardware/configuration still needs profile collection.

Core Claim

LLM-Emu argues that realistic LLM-serving evaluation needs online arrivals, queueing, live HTTP behavior, framework scheduling, KV-cache behavior, and output processing, but real-GPU experiments are expensive and pure simulators often drift from the serving engine they model.

The system keeps the real vLLM online serving path and replaces only GPU forward execution:

flowchart LR
  Client[HTTP clients / vLLM bench serve]
  Stack[vLLM HTTP, admission, scheduler, KV cache, output path]
  Hook[LLM-Emu executor-boundary hook]
  Oracle[density-aware profile oracle]
  Future[timer-resolved Future]
  Tokens[synthetic token IDs]
  Metrics[TTFT, TPOT, ITL, E2E, throughput]

  Client --> Stack --> Hook
  Hook --> Oracle --> Future --> Tokens --> Stack
  Stack --> Metrics

The key design point is that LLM-Emu is serving-native wall-clock emulation, not a standalone discrete-event simulator. It runs the same vllm serve interface and exercises vLLM’s own online stack while swapping GPU forward work for sampled delay plus synthetic outputs.

Method Notes

LLM-Emu targets vLLM 0.18.1. When the emulator is enabled, a runtime plugin:

bypasses real model loading and GPU setup;
redirects the GPU worker’s per-step execution path to vllm_emulator/;
loads a profile pack captured from real GPU serving traces;
samples latency from decode-only or prefill/mixed buckets keyed by total tokens in the step and request concurrency;
returns a timer-resolved Future so scheduler-worker overlap remains asynchronous;
emits synthetic token IDs so vLLM’s downstream output path still runs.

The profile pack stores raw per-step latency samples rather than only aggregate summaries. Sparse profile regions are handled by adaptive nearest-neighbor expansion and Shepard-weighted sampling. A representative main profile in the paper has about 276K samples across about 7.3K (total tokens, concurrency) buckets and takes about 3.5—4.5 GPU hours to capture.

Evidence And Results

Evaluation axis	Paper evidence	Local interpretation
vLLM path fidelity	vLLM HTTP, admission, scheduler, KV-cache, and output pipeline stay on the real code path; only GPU forward is replaced.	Stronger online-stack fidelity than simulators that reimplement scheduling, but still only for vLLM 0.18.1 and covered execution paths.
Accuracy	TPOT and ITL within 4.8% absolute error, E2E latency within 5.3%, output throughput within 1.9%; TTFT max error 10.41%.	Good steady-state serving fidelity; TTFT remains the most sensitive metric because admission and queue state amplify small timing differences.
Coverage	Tested on RTX 8000 and A40, Qwen3-4B/8B/14B and Llama-3.1-8B, FlashInfer and FlashAttention 2, Poisson and bursty arrivals.	Useful breadth for a workshop-sized systems paper; not yet a cluster-level or multi-node evaluation.
Maintenance surface	About 2.5K LoC of LLM-Emu code plus around 173 LoC of vLLM wiring, according to the paper.	Lower maintenance surface than reimplementing a full simulator, but still sensitive to vLLM API drift.
Public artifacts	Official GitHub repository is Apache-2.0 and includes plugin code, vLLM patch/overrides, example profiling data, and reproduction docs.	Strong reproducibility signal relative to simulator/emulator papers without public implementations.

Placement In The Inference-Emulation Landscape

LLM-Emu belongs on GPU Inference Optimization as a serving-native emulator. It occupies a distinct point between Vidur, LLMServingSim 2.0, and Revati:

Vidur / LLMServingSim / Frontier / TokenSim / MIST: discrete-event or runtime-driven simulators that model serving dynamics and often rely on profiling, analytical models, or learned predictors.
Revati: emulator that runs real serving-framework code, but virtualizes CUDA and advances virtual time by predicted kernel durations.
LLM-Emu: wall-clock vLLM emulator that preserves the online endpoint and serving stack, avoids CUDA interception, and replaces only GPU forward execution with profile-sampled latency and synthetic output tokens.

The user’s supplied research landscape also points to a broader hybrid simulator recipe: keep a deterministic event loop or serving-native runtime path, but learn workload arrivals, output length, kernel/server latency, and uncertainty from traces. LLM-Emu is not itself ML-heavy; its value is the engineering insertion point where learned latency or output models could replace sampled profiles.

Research-Context Links From Alex’s Note

The following list preserves only research context from Alex’s prompt, not business framing.

Branch	Work	Link	Why it matters here
Deterministic / event-driven simulator	Vidur / Vidur-Bench	arXiv:2405.05465	Baseline large-scale simulator; operator performance uses profiling and predictive modeling.
Deterministic / event-driven simulator	LLMServingSim 2.0	arXiv:2602.23036	Runtime-driven heterogeneous/disaggregated serving simulator.
Deterministic / event-driven simulator	MIST	arXiv:2504.09775	Multi-stage inference simulator for RAG, KV retrieval, reasoning, prefill, and decode stages.
Deterministic / event-driven simulator	TokenSim	arXiv:2503.08415, GitHub	Hardware/software exploration simulator for scheduling and memory-management studies.
Deterministic / event-driven simulator	Frontier	arXiv:2605.21312, GitHub	Modern discrete-event simulator for disaggregated serving, complex parallelism, runtime optimizations, and stateful workloads.
Learned latency / throughput surrogate	IBM ALA	arXiv:2505.09319	Analytical with Learning Augmentation framework for throughput prediction and uncertainty estimation.
Learned latency / throughput surrogate	Roofline-driven ML latency prediction	IBM Research	Combines an LLM-specific Roofline model with regression trained on historical runtime data.
Learned latency / throughput surrogate	llm-d predicted-latency scheduling	blog, guide	Online XGBoost TTFT/TPOT model using prompt length, queue/server state, cache, and concurrency features.
Learned latency / throughput surrogate	MaverIQ	ACM DOI, GitHub	Fingerprint-guided latency/configuration extrapolation and fragmentation-aware deployment.
Learned latency / throughput surrogate	LENS	arXiv:2606.18042	NPU latency estimator that avoids needing hidden microarchitecture/compiler details.
Workload generator	ServeGen	arXiv:2505.09999, USENIX NSDI 2026, GitHub	Production-informed generator for arrivals, input/output lengths, and per-client composition.
Output-length model	S3 output-length model	arXiv:2306.06000	Predicts output length for scheduling and KV-cache utilization.
Output-length model	SSJF / proxy model sequence-length prediction	arXiv:2404.08509, author page/code link	Uses a lightweight proxy model for output-length prediction and speculative shortest-job-first scheduling.
Output-length model	vLLM-LTR	arXiv:2408.15792, GitHub, project blog	Learns relative output-length ranks when exact generation length is hard to predict.
Output-length model	Magnus	arXiv:2406.04785	Predicts generation length from input length plus application/user semantic features for batching.
Output-length model	ForeLen / entropy-guided representations	arXiv:2602.11812	Uses on-the-fly activations and entropy-guided pooling for output-length prediction and introduces ForeLen.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Dynamic compute and serving	adjacent	LLM-Emu makes serving-time batching, queueing, KV-cache state, prefill/decode phases, and request arrivals explicit in an online emulator.	Evidence is LLM serving, not TSFM serving with long observation histories or stateful updates.
Observability and event streams	adjacent	The emulator evaluates request arrivals, queue/admission effects, concurrency, latency, and throughput as time-varying operational signals.	Does not provide a general observability benchmark, incident labels, or action-conditioned system-dynamics model.
Control and counterfactuals	adjacent	Enables what-if testing of vLLM serving configurations under emulated GPU-forward latency.	Counterfactual validity depends on profile match, output-token realism, and held-out real-GPU validation.
Benchmark validity	warning	Shows that keeping real framework control logic can reduce simulator drift.	The emulator can still drift across vLLM releases, GPUs, workload distributions, and profile gaps.

Limitations And Gotchas

LLM-Emu is an arXiv preprint / workshop-shaped paper at ingest time.
It validates single-node serving only; multi-GPU, multi-node, and cluster-level emulation remain future work.
It still needs GPU time to collect a profile for each model, hardware platform, attention backend, and serving configuration.
The profile and evaluation workloads are matched; out-of-distribution request mixes need separate validation.
TTFT is less stable than TPOT/ITL/E2E/throughput because queueing and startup effects are sensitive to small timing differences.
Synthetic output tokens preserve the output path, but they are not a learned or semantic output model.
The plugin targets vLLM 0.18.1, so API drift checks are part of the maintenance contract.
It is not a learned event simulator by itself; learned/hybrid simulation would require replacing or augmenting the profile oracle, output-length source, and workload generator.

Open Questions

Can LLM-Emu’s profile-sampled oracle be replaced with an uncertainty-aware learned latency model such as an ALA-style or llm-d-style predictor?
Can output-length models such as SSJF, vLLM-LTR, Magnus, or ForeLen produce synthetic decode traces that match real online workloads better than fixed benchmark reference lengths?
What held-out validation protocol should decide when a profile pack or learned surrogate is stale after vLLM, CUDA, attention backend, GPU, or workload changes?
How should a hybrid emulator propagate uncertainty from workload generation, output-length prediction, and latency prediction into scheduler or capacity-planning decisions?
Can an emulator that preserves serving-engine code serve as a controllable world model for agentic serving-policy search without overfitting profile gaps?

Alex Open Research Wiki

Explorer

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

Source

Status And Credibility

Core Claim

Method Notes

Evidence And Results

Placement In The Inference-Emulation Landscape

Research-Context Links From Alex’s Note

Foundation TSFM Relevance

Limitations And Gotchas

Open Questions

Links Into The Wiki

Graph View

Table of Contents

Backlinks