LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling

Source

Status And Credibility

LLM-Emu is an arXiv v1 preprint submitted on 2026-05-01 by Wei Da and Evangelia Kalyvianaki from the University of Cambridge. The arXiv metadata lists cs.DC, a CC BY 4.0 license, and the comment 6 page, 2 figures, workshop paper shape.

Treat it as current, useful systems evidence with a venue caveat. Credibility signals are the public Apache-2.0 repository, a small and inspectable vLLM plugin implementation, direct comparison against real vLLM runs, and validation across two GPUs, two model families, four model variants, two attention backends, Poisson arrivals, and bursty ShareGPT workloads. Caveats: no peer-reviewed venue is visible in arXiv metadata, validation is single-node, and each model/hardware/configuration still needs profile collection.

Core Claim

LLM-Emu argues that realistic LLM-serving evaluation needs online arrivals, queueing, live HTTP behavior, framework scheduling, KV-cache behavior, and output processing, but real-GPU experiments are expensive and pure simulators often drift from the serving engine they model.

The system keeps the real vLLM online serving path and replaces only GPU forward execution:

flowchart LR
  Client[HTTP clients / vLLM bench serve]
  Stack[vLLM HTTP, admission, scheduler, KV cache, output path]
  Hook[LLM-Emu executor-boundary hook]
  Oracle[density-aware profile oracle]
  Future[timer-resolved Future]
  Tokens[synthetic token IDs]
  Metrics[TTFT, TPOT, ITL, E2E, throughput]

  Client --> Stack --> Hook
  Hook --> Oracle --> Future --> Tokens --> Stack
  Stack --> Metrics

The key design point is that LLM-Emu is serving-native wall-clock emulation, not a standalone discrete-event simulator. It runs the same vllm serve interface and exercises vLLM’s own online stack while swapping GPU forward work for sampled delay plus synthetic outputs.

Method Notes

LLM-Emu targets vLLM 0.18.1. When the emulator is enabled, a runtime plugin:

  • bypasses real model loading and GPU setup;
  • redirects the GPU worker’s per-step execution path to vllm_emulator/;
  • loads a profile pack captured from real GPU serving traces;
  • samples latency from decode-only or prefill/mixed buckets keyed by total tokens in the step and request concurrency;
  • returns a timer-resolved Future so scheduler-worker overlap remains asynchronous;
  • emits synthetic token IDs so vLLM’s downstream output path still runs.

The profile pack stores raw per-step latency samples rather than only aggregate summaries. Sparse profile regions are handled by adaptive nearest-neighbor expansion and Shepard-weighted sampling. A representative main profile in the paper has about 276K samples across about 7.3K (total tokens, concurrency) buckets and takes about 3.5—4.5 GPU hours to capture.

Evidence And Results

Evaluation axisPaper evidenceLocal interpretation
vLLM path fidelityvLLM HTTP, admission, scheduler, KV-cache, and output pipeline stay on the real code path; only GPU forward is replaced.Stronger online-stack fidelity than simulators that reimplement scheduling, but still only for vLLM 0.18.1 and covered execution paths.
AccuracyTPOT and ITL within 4.8% absolute error, E2E latency within 5.3%, output throughput within 1.9%; TTFT max error 10.41%.Good steady-state serving fidelity; TTFT remains the most sensitive metric because admission and queue state amplify small timing differences.
CoverageTested on RTX 8000 and A40, Qwen3-4B/8B/14B and Llama-3.1-8B, FlashInfer and FlashAttention 2, Poisson and bursty arrivals.Useful breadth for a workshop-sized systems paper; not yet a cluster-level or multi-node evaluation.
Maintenance surfaceAbout 2.5K LoC of LLM-Emu code plus around 173 LoC of vLLM wiring, according to the paper.Lower maintenance surface than reimplementing a full simulator, but still sensitive to vLLM API drift.
Public artifactsOfficial GitHub repository is Apache-2.0 and includes plugin code, vLLM patch/overrides, example profiling data, and reproduction docs.Strong reproducibility signal relative to simulator/emulator papers without public implementations.

Placement In The Inference-Emulation Landscape

LLM-Emu belongs on GPU Inference Optimization as a serving-native emulator. It occupies a distinct point between Vidur, LLMServingSim 2.0, and Revati:

  • Vidur / LLMServingSim / Frontier / TokenSim / MIST: discrete-event or runtime-driven simulators that model serving dynamics and often rely on profiling, analytical models, or learned predictors.
  • Revati: emulator that runs real serving-framework code, but virtualizes CUDA and advances virtual time by predicted kernel durations.
  • LLM-Emu: wall-clock vLLM emulator that preserves the online endpoint and serving stack, avoids CUDA interception, and replaces only GPU forward execution with profile-sampled latency and synthetic output tokens.

The user’s supplied research landscape also points to a broader hybrid simulator recipe: keep a deterministic event loop or serving-native runtime path, but learn workload arrivals, output length, kernel/server latency, and uncertainty from traces. LLM-Emu is not itself ML-heavy; its value is the engineering insertion point where learned latency or output models could replace sampled profiles.

The following list preserves only research context from Alex’s prompt, not business framing.

BranchWorkLinkWhy it matters here
Deterministic / event-driven simulatorVidur / Vidur-BencharXiv:2405.05465Baseline large-scale simulator; operator performance uses profiling and predictive modeling.
Deterministic / event-driven simulatorLLMServingSim 2.0arXiv:2602.23036Runtime-driven heterogeneous/disaggregated serving simulator.
Deterministic / event-driven simulatorMISTarXiv:2504.09775Multi-stage inference simulator for RAG, KV retrieval, reasoning, prefill, and decode stages.
Deterministic / event-driven simulatorTokenSimarXiv:2503.08415, GitHubHardware/software exploration simulator for scheduling and memory-management studies.
Deterministic / event-driven simulatorFrontierarXiv:2605.21312, GitHubModern discrete-event simulator for disaggregated serving, complex parallelism, runtime optimizations, and stateful workloads.
Learned latency / throughput surrogateIBM ALAarXiv:2505.09319Analytical with Learning Augmentation framework for throughput prediction and uncertainty estimation.
Learned latency / throughput surrogateRoofline-driven ML latency predictionIBM ResearchCombines an LLM-specific Roofline model with regression trained on historical runtime data.
Learned latency / throughput surrogatellm-d predicted-latency schedulingblog, guideOnline XGBoost TTFT/TPOT model using prompt length, queue/server state, cache, and concurrency features.
Learned latency / throughput surrogateMaverIQACM DOI, GitHubFingerprint-guided latency/configuration extrapolation and fragmentation-aware deployment.
Learned latency / throughput surrogateLENSarXiv:2606.18042NPU latency estimator that avoids needing hidden microarchitecture/compiler details.
Workload generatorServeGenarXiv:2505.09999, USENIX NSDI 2026, GitHubProduction-informed generator for arrivals, input/output lengths, and per-client composition.
Output-length modelS3 output-length modelarXiv:2306.06000Predicts output length for scheduling and KV-cache utilization.
Output-length modelSSJF / proxy model sequence-length predictionarXiv:2404.08509, author page/code linkUses a lightweight proxy model for output-length prediction and speculative shortest-job-first scheduling.
Output-length modelvLLM-LTRarXiv:2408.15792, GitHub, project blogLearns relative output-length ranks when exact generation length is hard to predict.
Output-length modelMagnusarXiv:2406.04785Predicts generation length from input length plus application/user semantic features for batching.
Output-length modelForeLen / entropy-guided representationsarXiv:2602.11812Uses on-the-fly activations and entropy-guided pooling for output-length prediction and introduces ForeLen.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Dynamic compute and servingadjacentLLM-Emu makes serving-time batching, queueing, KV-cache state, prefill/decode phases, and request arrivals explicit in an online emulator.Evidence is LLM serving, not TSFM serving with long observation histories or stateful updates.
Observability and event streamsadjacentThe emulator evaluates request arrivals, queue/admission effects, concurrency, latency, and throughput as time-varying operational signals.Does not provide a general observability benchmark, incident labels, or action-conditioned system-dynamics model.
Control and counterfactualsadjacentEnables what-if testing of vLLM serving configurations under emulated GPU-forward latency.Counterfactual validity depends on profile match, output-token realism, and held-out real-GPU validation.
Benchmark validitywarningShows that keeping real framework control logic can reduce simulator drift.The emulator can still drift across vLLM releases, GPUs, workload distributions, and profile gaps.

Limitations And Gotchas

  • LLM-Emu is an arXiv preprint / workshop-shaped paper at ingest time.
  • It validates single-node serving only; multi-GPU, multi-node, and cluster-level emulation remain future work.
  • It still needs GPU time to collect a profile for each model, hardware platform, attention backend, and serving configuration.
  • The profile and evaluation workloads are matched; out-of-distribution request mixes need separate validation.
  • TTFT is less stable than TPOT/ITL/E2E/throughput because queueing and startup effects are sensitive to small timing differences.
  • Synthetic output tokens preserve the output path, but they are not a learned or semantic output model.
  • The plugin targets vLLM 0.18.1, so API drift checks are part of the maintenance contract.
  • It is not a learned event simulator by itself; learned/hybrid simulation would require replacing or augmenting the profile oracle, output-length source, and workload generator.

Open Questions

  • Can LLM-Emu’s profile-sampled oracle be replaced with an uncertainty-aware learned latency model such as an ALA-style or llm-d-style predictor?
  • Can output-length models such as SSJF, vLLM-LTR, Magnus, or ForeLen produce synthetic decode traces that match real online workloads better than fixed benchmark reference lengths?
  • What held-out validation protocol should decide when a profile pack or learned surrogate is stale after vLLM, CUDA, attention backend, GPU, or workload changes?
  • How should a hybrid emulator propagate uncertainty from workload generation, output-length prediction, and latency prediction into scheduler or capacity-planning decisions?
  • Can an emulator that preserves serving-engine code serve as a controllable world model for agentic serving-policy search without overfitting profile gaps?