SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
Source
- Raw Markdown: paper_sageserve-2025.md
- PDF: paper_sageserve-2025.pdf
- Preprint: arXiv:2502.14617
- ACM DOI: 10.1145/3771576
- Official code / traces: shashwatj07/SageServe. Local snapshots:
papers/sageserve-2025/source_repo_metadata.jsonandpapers/sageserve-2025/source_github_readme.md.
Status And Credibility
SageServe is an arXiv preprint first posted on 2025-02-20 and last revised as v3 on 2025-11-12. The paper has an ACM POMACS journal reference, “Proceedings of the ACM on Measurement and Analysis of Computing Systems, Vol. 9, No. 3, Article 61, December 2025,” and an ACM DOI. The author list spans UIUC, Georgia Tech, IISc, and Microsoft.
Treat it as credible systems evidence for large-scale LLM serving because it combines a peer-reviewed ACM publication, Microsoft O365 production workload characterization, a public repository, and explicit cost/SLA evaluation. It is not a general proof that every cloud LLM serving workload has the same forecastability or multi-region scaling economics.
Core Claim
SageServe argues that separating fast interactive workloads and slower non-interactive workloads into siloed GPU pools wastes expensive accelerator capacity when demand shifts across pools, regions, and models.
The system uses a forecast-aware, multi-timescale controller:
- short-term request routing across data centers;
- longer-lead-time GPU VM scaling;
- model placement decisions;
- a traffic forecast model plus an integer linear programming formulation to co-optimize routing and resource allocation.
In the terminology of this wiki, incoming requests are an event stream; forecasted demand is context; routing, scaling, and model-placement decisions are actions or control inputs; SLA and GPU-hour usage are outcomes.
flowchart LR Trace[production LLM request event stream] --> Forecast[traffic forecast] Forecast --> ILP[resource-routing optimization] Capacity[regional GPU VM capacity] --> ILP Models[model placement constraints] --> ILP ILP --> Route[short-term request routing] ILP --> Scale[long-term GPU VM scaling] Route --> SLA[SLA and tail latency] Scale --> Cost[GPU-hour utilization]
Evidence And Results
- The paper characterizes Microsoft Office 365 LLM serving workloads at over 10 million requests per day across regions and workload classes.
- Evaluation uses real runs and realistic simulations on 10 million production requests across three regions and four open-source models.
- Reported result: up to 25% GPU-hour savings compared with the baseline deployment.
- Reported result: 80% reduction in GPU-hour wastage caused by inefficient autoscaling, translating to potential monthly savings up to $2.5M while maintaining tail latency and SLAs.
- The official repository exposes a simulator harness, traces, and scheduling logic, making the paper useful as both a serving-optimization source and a workload-analysis source.
Why It Matters For GPU Inference Optimization
SageServe belongs in GPU Inference Optimization as the demand-forecasting and resource-control branch rather than the micro-kernel or inference-engine branch. It treats GPU inference efficiency as a closed-loop control problem over workload forecasts, regional capacity, VM provisioning lead times, model placement, and SLA constraints.
For time-series and world-model readers, the key lesson is that serving optimization depends on modeling both observations and actions. A passive request-volume forecast is not enough; the serving controller needs a forecast that supports action-conditioned decisions about where capacity should be placed and how requests should be routed.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Context interface | adjacent | Uses forecasted traffic, region, model, workload class, SLA, and capacity context to condition serving decisions. | Does not study foundation time-series model pretraining or general context interfaces beyond LLM serving traces. |
| Control and counterfactuals | adjacent | Evaluates routing, scaling, and placement decisions under simulated production traces. | Requires domain-specific assumptions about cloud provisioning, regional demand, and simulator fidelity. |
| Observability and event streams | adjacent | Production request traces are treated as time-varying workload signals with SLA outcomes. | Raw production trace details are necessarily limited; transfer to other providers is not guaranteed. |
| Dynamic compute and serving | partially closes | Shows a concrete multi-timescale GPU-serving control loop with cost and tail-latency outcomes. | Does not address low-level kernel, KV-cache, quantization, or framework-level inference optimization. |
Limitations And Gotchas
- The strongest evidence is tied to Microsoft O365 and Azure-style regional serving; other providers may have different workload mixes, provisioning delays, and SLA economics.
- The optimization layer depends on traffic forecast quality. Forecast errors can become bad control inputs, not just worse metrics.
- Simulation fidelity matters because several results depend on realistic simulations rather than every candidate policy being run live at full scale.
- The paper optimizes serving operations, not model architecture, quantization, KV-cache compression, or inference-engine kernels.