SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

Source

Raw Markdown: paper_sageserve-2025.md
PDF: paper_sageserve-2025.pdf
Preprint: arXiv:2502.14617
ACM DOI: 10.1145/3771576
Official code / traces: shashwatj07/SageServe. Local snapshots: papers/sageserve-2025/source_repo_metadata.json and papers/sageserve-2025/source_github_readme.md.

Status And Credibility

SageServe is an arXiv preprint first posted on 2025-02-20 and last revised as v3 on 2025-11-12. The paper has an ACM POMACS journal reference, “Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 9, No. 3, Article 61, December 2025,” and an ACM DOI. The author list spans UIUC, Georgia Tech, IISc, and Microsoft.

Treat it as credible systems evidence for large-scale LLM serving because it combines a peer-reviewed ACM publication, Microsoft O365 production workload characterization, a public repository, and explicit cost/SLA evaluation. It is not a general proof that every cloud LLM serving workload has the same forecastability or multi-region scaling economics.

Core Claim

SageServe argues that separating fast interactive workloads and slower non-interactive workloads into siloed GPU pools wastes expensive accelerator capacity when demand shifts across pools, regions, and models.

The system uses a forecast-aware, multi-timescale controller:

short-term request routing across data centers;
longer-lead-time GPU VM scaling;
model placement decisions;
a traffic forecast model plus an integer linear programming formulation to co-optimize routing and resource allocation.

In the terminology of this wiki, incoming requests are an event stream; forecasted demand is context; routing, scaling, and model-placement decisions are actions or control inputs; SLA and GPU-hour usage are outcomes.

flowchart LR
  Trace[production LLM request event stream] --> Forecast[traffic forecast]
  Forecast --> ILP[resource-routing optimization]
  Capacity[regional GPU VM capacity] --> ILP
  Models[model placement constraints] --> ILP
  ILP --> Route[short-term request routing]
  ILP --> Scale[long-term GPU VM scaling]
  Route --> SLA[SLA and tail latency]
  Scale --> Cost[GPU-hour utilization]

Evidence And Results

The paper characterizes Microsoft Office 365 LLM serving workloads at over 10 million requests per day across regions and workload classes.
Evaluation uses real runs and realistic simulations on 10 million production requests across three regions and four open-source models.
Reported result: up to 25% GPU-hour savings compared with the baseline deployment.
Reported result: 80% reduction in GPU-hour wastage caused by inefficient autoscaling, translating to potential monthly savings up to $2.5M while maintaining tail latency and SLAs.
The official repository exposes a simulator harness, traces, and scheduling logic, making the paper useful as both a serving-optimization source and a workload-analysis source.

Why It Matters For GPU Inference Optimization

SageServe belongs in GPU Inference Optimization as the demand-forecasting and resource-control branch rather than the micro-kernel or inference-engine branch. It treats GPU inference efficiency as a closed-loop control problem over workload forecasts, regional capacity, VM provisioning lead times, model placement, and SLA constraints.

For time-series and world-model readers, the key lesson is that serving optimization depends on modeling both observations and actions. A passive request-volume forecast is not enough; the serving controller needs a forecast that supports action-conditioned decisions about where capacity should be placed and how requests should be routed.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Context interface	adjacent	Uses forecasted traffic, region, model, workload class, SLA, and capacity context to condition serving decisions.	Does not study foundation time-series model pretraining or general context interfaces beyond LLM serving traces.
Control and counterfactuals	adjacent	Evaluates routing, scaling, and placement decisions under simulated production traces.	Requires domain-specific assumptions about cloud provisioning, regional demand, and simulator fidelity.
Observability and event streams	adjacent	Production request traces are treated as time-varying workload signals with SLA outcomes.	Raw production trace details are necessarily limited; transfer to other providers is not guaranteed.
Dynamic compute and serving	partially closes	Shows a concrete multi-timescale GPU-serving control loop with cost and tail-latency outcomes.	Does not address low-level kernel, KV-cache, quantization, or framework-level inference optimization.

Limitations And Gotchas

The strongest evidence is tied to Microsoft O365 and Azure-style regional serving; other providers may have different workload mixes, provisioning delays, and SLA economics.
The optimization layer depends on traffic forecast quality. Forecast errors can become bad control inputs, not just worse metrics.
Simulation fidelity matters because several results depend on realistic simulations rather than every candidate policy being run live at full scale.
The paper optimizes serving operations, not model architecture, quantization, KV-cache compression, or inference-engine kernels.

Alex Open Research Wiki

Explorer

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling

Source

Status And Credibility

Core Claim

Evidence And Results

Why It Matters For GPU Inference Optimization

Foundation TSFM Relevance

Limitations And Gotchas

Links Into The Wiki

Graph View

Table of Contents

Backlinks