---
abstract: |
  Recently, there has been an extensive research effort in building efficient large language model (LLM) inference serving systems. These efforts not only include innovations in the algorithm and software domains but also constitute developments of various hardware acceleration techniques. Nevertheless, there is a lack of simulation infrastructure capable of accurately modeling versatile hardware-software behaviors in LLM serving systems without extensively extending the simulation time.

  This paper aims to develop an effective simulation tool, called `\simname`{=latex}, to support future research in LLM serving systems. In designing `\simname`{=latex}, we focus on two limitations of existing simulators: (1) they lack consideration of the dynamic workload variations of LLM inference serving due to its autoregressive nature, and (2) they incur repetitive simulations without leveraging algorithmic redundancies in LLMs. To address these limitations, `\simname `{=latex}simulates the LLM serving in the granularity of iterations, leveraging the computation redundancies across decoder blocks and reusing the simulation results from previous iterations. Additionally, `\simname `{=latex}provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. Our experiments demonstrate that `\simname `{=latex}produces simulation results closely following the performance behaviors of real GPU-based LLM serving system with less than 14.7% error rate, while offering 91.5$\times$ faster simulation speed compared to existing accelerator simulators.
author:
- |
  `\IEEEauthorblockN{Jaehong Cho}`{=latex}\
  `\IEEEauthorblockA{\textit{School of Computing} \\
  \textit{KAIST}\\
  Daejeon, South Korea \\
  \href{mailto:jhcho@casys.kaist.ac.kr}{\textcolor{blue}{jhcho@casys.kaist.ac.kr}}}`{=latex}
- |
  `\IEEEauthorblockN{Minsu Kim}`{=latex}\
  `\IEEEauthorblockA{\textit{School of Computing} \\
  \textit{KAIST}\\
  Daejeon, South Korea \\
  \href{mailto:mskim@casys.kaist.ac.kr}{\textcolor{blue}{mskim@casys.kaist.ac.kr}}}`{=latex}
- |
  `\IEEEauthorblockN{Hyunmin Choi}`{=latex}\
  `\IEEEauthorblockA{\textit{School of Computing} \\
  \textit{KAIST}\\
  Daejeon, South Korea \\
  \href{mailto:hmchoi@casys.kaist.ac.kr}{\textcolor{blue}{hmchoi@casys.kaist.ac.kr}}}`{=latex}\
  `\linebreakand`{=latex}\
  `\IEEEauthorblockN{Guseul Heo}`{=latex}\
  `\IEEEauthorblockA{\textit{School of Computing} \\
  \textit{KAIST}\\
  Daejeon, South Korea \\
  \href{mailto:gsheo@casys.kaist.ac.kr}{\textcolor{blue}{gsheo@casys.kaist.ac.kr}}}`{=latex}
- |
  `\IEEEauthorblockN{Jongse Park}`{=latex}\
  `\IEEEauthorblockA{\textit{School of Computing} \\
  \textit{KAIST}\\
  Daejeon, South Korea \\
  \href{mailto:jspark@casys.kaist.ac.kr}{\textcolor{blue}{jspark@casys.kaist.ac.kr}}}`{=latex}
bibliography:
- IISWC24.camera_ready/paper.bib
title: |
  `\simnametitle`{=latex}: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale\
---

```{=latex}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
```
```{=latex}
\renewcommand\DTstyle{\ttfamily\small}
```
```{=latex}
\newcommand{\todo}[2]{\textcolor{red}{TODO[#1]}: \textcolor{blue}{#2}}
```
```{=latex}
\newcommand{\niparagraph}[1]{\noindent\textbf{#1}}
```
```{=latex}
\newcommand*\circled[1]{\tikz[baseline=(char.base)]{\node[shape=circle,fill,inner sep=0.7pt,minimum size=1.0em] (char) {\textcolor{white}{#1}};}}
```
```{=latex}
\newcommand{\bluetext}[1]{\textcolor{blue}{#1}}
```
```{=latex}
\newcommand{\simname}{\textit{LLMServingSim}\xspace}
```
```{=latex}
\newcommand{\simnametitle}{LLMServingSim\xspace}
```
```{=latex}
\newcommand{\simnamesection}{LLMServingSim\xspace}
```
```{=latex}
\newcommand{\linebreakand}{%
  \end{@IEEEauthorhalign}
  \hfill\mbox{}\par
  \mbox{}\hfill\begin{@IEEEauthorhalign}
}
```
```{=latex}
\maketitle
```
```{=latex}
\begin{IEEEkeywords}
Large language model (LLM), Inference serving, Simulation infrastructure, Neural processing unit (NPU), Processing-in-memory (PIM), Heterogeneous system
\end{IEEEkeywords}
```
# Introduction

Currently, there is a significant surge in efforts to exploit large language model (LLM) as a crucial component in real-world applications [@naveed2024comprehensive; @zhao2023survey]. Given the prohibitively high costs associated with building on-premise infrastructure for LLM inference, the common practice is to offload LLM inference to multi-tenant \`\`inference serving" systems in the cloud, exemplified by OpenAI's ChatGPT service [@chatgpt]. The massive compute and memory requirements (both bandwidth and capacity) are forcing these systems to be equipped with many AI accelerators (or NPUs).

There has been a large body of research works that aim to develop efficient hardware and software for LLM inference serving systems. Some works target to develop customized hardware techniques for accelerating LLM inference serving [@hong2022dfx; @park2024lpddr], while others focus on developing optimized system software on GPU-based scale-out systems [@vllm; @dao2023flashattention2; @patel2023splitwise; @exegpt; @miao2024spotserve]. Recently, a few pioneering works propose to take into consideration both hardware and software together for designing holistic end-to-end accelerated systems [@neupims; @attacc]. However, there is currently a lack of simulation infrastructure that allows researchers to explore their hardware-software proposals in a scale-out setting. This limitation not only makes it difficult for architecture researchers to explore scalable accelerator solutions, but also forces system software researchers to exclusively rely on GPU-based system software in the era of specialized hardware.

This paper sets out to address this limitation and develop a LLM inference serving system simulator, called `\simname`{=latex}, that jointly simulates the behaviors of LLM-customized accelerators and LLM inference serving system software. `\simname `{=latex}is built on top of an existing AI system simulator, ASTRA-sim [@astrasim], which jointly models both hardware and software for AI workloads. However, there are primarily two algorithmic differences, making the design principles of `\simname `{=latex}and ASTRA-sim largely different, as described below.

-   **Autoregressive nature of LLM generation.** ASTRA-sim focuses on distributed training, which entails millions of \`\`identical" iterations of computing that simplify the simulation. On the contrary, we target LLM inference serving that involves autoregressive token generations, producing dynamically changing behaviors across different iterations, requiring independent simulation runs.

-   **Redundancies across decoder blocks in LLMs.** Unlike ASTRA-sim that targets general models, we focus on LLMs, thus offering opportunities to customize simulations for LLM's model architecture. Modern LLMs constitute a set of decoder blocks that share common compute patterns, while the hyperparameters can vary.

To this end, we design `\simname `{=latex}in such a way that it prudently compromises simulation accuracy for achieving feasible simulation time, effectively bridging the so-called \`\`real2sim" gap, and facilitating future research in LLM inference serving systems. To accomplish these objectives, we exploit the following three major techniques.

-   **Iteration-level hardware-system simulation.** As each iteration takes different input prompts, `\simname `{=latex}simulates the iterations one by one temporally and aggregates the entirety of resulting statistics at the end. For each iteration, `\simname `{=latex}first performs prompt scheduling that determines tasks for accelerators, then analyzes the accelerator behaviors using hardware simulator, and finally sweeps through the stages in the system pipeline to simulate overall system behaviors. For hardware simulation, we employ GeneSys [@genesys], an open-source end-to-end NPU simulator that comes with a full software stack. Note that although we use GeneSys for prototyping purposes, any NPU simulator can be integrated into `\simname`{=latex}, as the system simulation workflow remains consistent regardless of the specific simulator employed. The aforementioned three steps are repeated over the iterations progressively.

-   **Compiler and simulator optimization through computation reuse.** We notice that the hardware simulator experiences a substantial bottleneck at the compilation and hardware simulation phases. `\simname `{=latex}addresses this bottleneck by optimizing implementations exploiting the redundancy of common LLM architecture and employing computation reuse techniques. Exploiting the property that the decoder-based LLM architecture consists of repeated transformer blocks, `\simname `{=latex}compiles just one transformer block and replicates it, significantly reducing the overall compile time required. In addition, we also reduce simulation time by separating simulations of attention layers from non-attention layers since it is the only computational difference between the initiation and generation phases.

-   **Operator mapping and scheduling for heterogeneous accelerator simulations.** As modern LLM serving systems are often equipped with heterogeneous accelerators such as GPU, NPU, and PIM, accurately simulating the heterogeneous systems is an important challenge. While ASTRA-sim is capable of simulating heterogeneous accelerators, their operator mapping and scheduling are manually and statically determined, which would not work for LLM serving since the tasks dynamically change. `\simname `{=latex}provides a flexible framework that allows users to plug in any accelerator compiler-and-simulation stacks for exploring various system designs with heterogeneous processors. To achieve this goal, `\simname `{=latex}comes with a \`\`skeleton" interface where the simulator users can fill the system-specific operator mapping and scheduling mechanisms. The interface connects the mapping-scheduling mechanisms with the accelerator's compiler and system simulators.

Our experiments demonstrate that the simulation results produced by `\simname `{=latex}experience average 14.7% error rate, showing a similar trend as in the real LLM inference serving system, vLLM [@vllm], equipped with multiple GPUs. Note that we observe that `\simname `{=latex}consistently produces accurate simulation results as we vary LLM architectures, parallelization schemes, number of NPUs, and different heterogeneity. `\simname `{=latex}achieves the high level of accuracy, while offering 34.7$\times$ to 491.0$\times$ faster simulation speed compared to three existing accelerator simulators: mNPUsim [@mnpusim], GeneSys [@genesys], and NeuPIMs [@neupims]. These promising results suggest that `\simname `{=latex}has a significant potential to be an effective system simulation tool for LLM serving system research, in hardware, software, or both. `\simname `{=latex}is available at `\bluetext{\url{https://github.com/casys-kaist/llmservingsim}}`{=latex}.

# Background

![Architecture of large language model.](figure/transformer.png){#transformer width="1.0\\linewidth"}

## Characteristics of LLM Model Architecture {#sec:llm-characteristics}

Most modern large language models (LLMs) employ decoder-based transformer architecture [@vaswani2023attention], as described in Figure `\ref{transformer}`{=latex}. This architecture consists of its fundamental building blocks: the embedding layer, transformer blocks, and language modeling (LM) head. Each transformer block constitutes three main components: Query-Key-Value (QKV) generation, multi-head attention, and feedforward networks.

Decoder-based transformer model operates in two distinct phases during their inference: *initiation* and *generation* phase. The initiation phase begins with receiving the prompt as input and generates QKV for all input tokens. Generated QKV values pass through subsequent multi-head attention layers and feed-forward networks. This phase predominantly involves General Matrix Multiply (GEMM) operations, which handle the bulk of computation by processing multiple data points collectively. Once the initiation phase is completed, the model outputs one token and transitions to the generation phase, with the generated token as the new input. This generation phase has autoregressive characteristics where each output token is passed to the next iteration, and the generation continues sequentially. In this phase, QKV values for newly generated tokens need to be computed, while utilizing the cached key-values of previous tokens, known as KV cache. Consequently, this phase is characterized by General Matrix-Vector Multiply (GEMV) in *Score* and *Attend* operations of multi-head attention, which involve handling single vector calculations against the entire matrix of keys and values.

## Batching and Memory Management for LLM

To minimize latency and maximize hardware utilization, LLM inference serving system often employs *batching*, which involves grouping multiple requests into a single group. However, it presents a challenge, particularly with the multi-head attention layer, which makes batching difficult. Additionally, it faces the drawback of needing to complete all requests before proceeding to the next batch, which can lead to inefficiencies. To tackle this challenge, Orca [@orca] proposes two techniques: selective batching and iteration-level scheduling. Selective batching allows batching in specific layers, such as QKV generation and feed-forward networks, while in multi-head attention layers, it allows a batch to be divided and allocated to multiple workers individually. Iteration-level scheduling involves rescheduling the batch at each iteration, removing completed requests, and adding new ones. This technique enhances hardware utilization and reduces latency by dynamically updating the batch to include only active requests, thereby streamlining the process.

Another challenge in the scale-out inference serving system is to effectively handle KV cache. Conventional LLM serving allocates KV cache based on the maximum possible sequence length, and this results in underutilized memory spaces and limited batch sizes. vLLM [@vllm] introduces a paging scheme for memory management that functions similarly to the virtual memory of operating systems. Managing memory on a page-by-page basis, vLLM effectively reduces memory fragmentation, enabling larger batch size and higher throughput.

## Processing-in-Memory (PIM) for LLM

In the generation phase, LLM inferencing heavily relies on General Matrix-Vector multiplication (GEMV) operations, especially within the multi-headed attention layers. These GEMV operations are characterized by being memory-bound with low arithmetic intensity due to the lack of matrix reuse. Processing-in-Memory (PIM) techniques are recognized for their ability to accelerate memory-intensive operations such as GEMVs. PIM optimizes memory-intensive tasks by reducing data movement through the placement of compute unit in each memory bank. This approach utilizes aggregated bandwidth to read intermediate values, execute computations, and send only the results to the host system.

There has been significant research on using PIM to accelerate LLM inference. TransPIM [@transpim] has proposed a PIM-focused solution specifically for speeding up end-to-end Transformer inference. More recently, AttAcc [@attacc], IANUS [@ianus], and NeuPIMs [@neupims] have developed approaches that integrate PIM for GEMV and activation function computations, alongside compute-centric accelerators such as NPUs and GPUs, aiming to improve the overall efficiency of LLM inference.

# Motivation

<figure id="fig:motive">
<p><span><img src="figure/motivation.png" /></span></p>
<figcaption>(a) Simulation time comparison between mNPUsim, GeneSys, and NeuPIMs. (b) Roofline analysis on the arithmetic intensity of LLM inference operations.</figcaption>
</figure>

## Need for LLM Serving System Simulators

LLMs with parameters ranging from a few hundred million to several hundred billion or even trillions, require enormous computational and memory capabilities for inference. To handle batched requests from multiple users, the scale-out serving systems often constitute hundreds of nodes, each equipped with multiple high-performance AI accelerators with high-bandwidth and high-capacity memories [@park2024lpddr; @nvidia-h100; @neupims; @attacc]. Recently, several studies have explored solutions involving software, hardware, or both of them for such large-scale LLM serving systems [@hong2022dfx; @lu2020ha; @ham2021elsa]. However, the absence of an effective system-level simulator for scale-out LLM serving systems remains a major barrier for researchers and engineers who continue to explore solutions.

## Limitation of Existing AI System Simulators

```{=latex}
\niparagraph{ASTRA-sim.}
```
We are not the first one who propose to develop scale-out system simulators for AI workloads. ASTRA-sim [@astrasim] is an effective open-source tool for simulating scale-out system for AI workloads and can be considered as an alternative to `\simname`{=latex}. However, ASTRA-sim focuses on *training* with repetitive yet identical iterations, which does not align well with the nature of LLM inference serving, where each iteration processes different batches with varying compositions of variable-length prompts. While ASTRA-sim is insufficient for our purposes as it stands, we notice that it offers essential features for simulating a scale-out AI system. Therefore, we decided to avoid reinventing the wheel and integrate ASTRA-sim as a module within `\simname `{=latex}to simulate a single iteration.

```{=latex}
\niparagraph{LLM inference simulators.}
```
Rather disjointly, there have been recent research efforts to build LLM inference simulators [@genesys; @neupims; @mnpusim; @onnxim], while they are not suitable for system-level simulation at scale. The main reason is that the current simulators operate at a slow pace, rendering them insufficient for simulating large-scale LLM inference serving, which is inherently iterative in nature. Figure `\ref{fig:motive}`{=latex}(a) shows the simulation time for one inference iteration of existing LLM simulators, including mNPUsim [@mnpusim], GeneSys [@genesys], and NeuPIMs [@neupims]. We observe that NPU simulators, mNPUsim and GeneSys, take about 10 and 1.5 hours respectively, while NeuPIMs, which simulates NPU-PIM hardware, takes roughly 2 hours. To fully process requests, multiple iterations are required until the generation phase ceases, significantly extending the entire simulation time, exceeding far beyond the simulation time reported above. Thus, it is apparent that exploring the LLM serving system designs with these inefficient simulators is nearly infeasible. This insight motivates us to devise a new LLM serving system simulator that enables efficient system-level exploration within feasible simulation time, while accurately evaluating the hardware-software behaviors.

<figure id="topo">
<p><span><img src="figure/topology.png" /></span></p>
<figcaption>Example system topology of configured with hybrid parallelism, consisting of 4 pipeline parallel groups and 4 tensor parallel NPU nodes.</figcaption>
</figure>

## Need for Simulators with Heterogeneity

As discussed in Section `\ref{sec:llm-characteristics}`{=latex}, one of the notable characteristics of LLM inference is that compute-intensive operations and memory-intensive operations are intermixed. We analyze and compare the computation and memory usage of each operation using GPT3-7B model with NVIDIA RTX 3090 GPU. Figure `\ref{fig:motive}`{=latex}(b) shows the result of roofline analysis comparing arithmetic intensity of operations during inference. We notice that operators of multi-head attention and layer normalization have low arithmetic intensity and are bandwidth-bound. On the contrary, operators of QKV generation and feed-forward networks have high arithmetic intensity and are compute-bound. These two types of operators require high memory bandwidth and high compute capability, respectively. Meanwhile, another characteristic of LLM inference is that key-value (KV) cache imposes a significant overhead on memory capacity [@hooper2024kvquant; @zhao2024alisa; @vllm], since keys and values are generated for every token of the entire sequence and every transformer block.

Requirements for high memory bandwidth, memory capacity, and computation power of LLM inference make it difficult to find \`\`one-fits-all" solution for acceleration. GPUs equipped with high-bandwidth memory such as NVIDIA H100 [@nvidia-h100] appear to be this solution, but they have small and limited scalability of capacity. Several recent studies have proposed solutions with heterogeneous accelerators for LLM inference serving [@neupims; @attacc; @hbm-pim; @patel2023splitwise]. A system using heterogeneous accelerators maps operators with conflicting properties to devices with different characteristics. For instance, with inference acceleration solutions using PIM and NPU [@neupims; @attacc], operators with low arithmetic intensity are mapped to PIM devices, and other operators are mapped to NPU devices. Following this pioneering research, various combinations of heterogeneous accelerators are being explored [@neupims; @attacc; @smart-infinity], highlighting the need for simulator frameworks to support these efforts. This phenomenon suggests that `\simname `{=latex}must provide rich flexibility in system configurations, while supporting simulation of various hardware in a plugin manner.

# `\simnamesection`{=latex}

<figure id="fig:llmsim">
<p><span><img src="figure/LLM-Sim.png" /></span></p>
<figcaption>Workflow of .</figcaption>
</figure>

We design `\simname`{=latex}, a novel system-level simulator for LLM inference workloads that jointly simulates LLM serving system software and heterogeneous hardware accelerators. For simplicity of explanation, we provide an example where `\simname `{=latex}simulates a distributed and heterogeneous system consisting of one host node, NPU nodes, and PIM nodes. Figure `\ref{topo}`{=latex} illustrates an example `\simname `{=latex}system topology configured to utilize hybrid parallelism with 4 NPU groups and 16 NPU nodes. Note that `\simname `{=latex}can be flexibly configured with various system topologies and combinations of heterogeneous accelerators.

## Simulator Design {#sec:infrastructure}

```{=latex}
\niparagraph{Overview.}
```
Figure `\ref{fig:llmsim}`{=latex} depicts the `\simname `{=latex}workflow, which is designed to perform iteration-level simulation for distributed system with heterogeneous hardware. `\simname `{=latex}consists of the following components:

```{=latex}
\begin{description}[labelindent=0.0em,nolistsep,leftmargin=1.5em]
\item[(1)] \textbf{Scheduler}
receives and organizes user requests into feasible batches based on the scheduling, KV cache management, and operator mapping strategy. It also makes the next scheduling decision based on the results of ASTRA-sim.
%
\item[(2)] \textbf{Execution engine stack}
compiles the model according to the batch configuration created by Scheduler, and performs hardware simulation for a single device. Each heterogeneous accelerator has distinct engine and produces distinct trace. Execution engine stack schedules operators from multiple traces and reconstructs them into a single trace.
%
\item[(3)] \textbf{Graph converter}
generates execution graphs using the given trace from execution engine stack, according to the configured parallelism strategy.
%
\item[(4)] \textbf{ASTRA-sim~\cite{astrasim}}
takes execution graph represented in Chakra graphs~\cite{chakra} as inputs, performs system simulation, and returns results back to the scheduler.
\end{description}
```
```{=latex}
\niparagraph{Iteration-level scheduling.}
```
LLM processes input prompts autoregressively by generating one token at a time during inference. To efficiently process the iterations, a state-of-the-art LLM serving system, Orca [@orca], proposes iteration-level scheduling. We employ this technique in `\simname `{=latex}by designing the simulation workflow as repeated alternations of prompt batch scheduling, hardware simulation, and system simulation at the iteration level.

`\simname `{=latex}scheduler first receives requests and compares their arrival times to the scheduler's timer to select *batchable* requests. In response to the dynamic changes in requests, the scheduler leverages execution engine stack, consisting of the engine-specific compilers and simulators, to simulate the behavior of accelerators. In a heterogeneous environment, operators are dealt in different accelerators, so the scheduler offloads operators to each execution engine according to the mapping strategy. Each execution engine compiles the model and simulates the hardware with specified input configurations. After hardware simulation, the graph converter converts the simulation results to an execution graph that maps the hardware to the system. This graph is then fed into ASTRA-sim to simulate and analyze the system behavior comprehensively. System simulation results are fed back to the scheduler, and the scheduler's timer, which is used to assemble a new batch for the next iteration, is updated accordingly. This cyclical interaction enables `\simname `{=latex}to progress through iterations efficiently.

```{=latex}
\niparagraph{Supporting for LLM parallelism strategies.}
```
`\label{sec:parallelism}`{=latex} In the context of LLM inference, parallelism that distributes the model weights and layers of substantial size is crucial for enhancing performance. There are three major types of model parallelism: tensor parallelism, pipeline parallelism, and hybrid parallelism [@megatron]. Tensor parallelism distributes the weight matrix across multiple workers. Pipeline parallelism assigns different layers of the model to different workers. Hybrid parallelism combines features of both tensor and pipeline parallelism.

`\simname `{=latex}can be configured to utilize a specific parallelism strategy by setting the number of accelerator groups according to the system's topology. When graph converter receives the output trace from execution engine stack, it identifies configured parallelism strategy and constructs execution graph accordingly for each accelerator. For tensor parallelism, it distributes tensors across the entire nodes and inserts *ALL-REDUCE* operators to the execution graph for intermediate synchronization. For pipeline parallelism, it allocates decoder blocks to nodes in sequence, allowing chained computation across them. Hybrid parallelism combines both parallelism strategies by distributing tensors and layers within and across accelerator groups, respectively.

To employ selective batching, where attention layers are processed in parallel across different workers, the execution engine stack and graph converter work together. Hardware simulator assigns unique identifiers to the attention layers and records them in the output trace. Graph converter then assigns these attention layers to different nodes based on their identifiers. As illustrated in Figure `\ref{topo}`{=latex}, each node within an accelerator group independently processes distinct inputs with different sequence lengths, parallelizing batch processing.

<figure id="fig:npupim">
<p><span><img src="figure/npupim.png" /></span></p>
<figcaption>Two example system topology of with NPU and PIM hardware.</figcaption>
</figure>

```{=latex}
\niparagraph{KV cache-aware memory modeling.}
```
While ASTRA-sim has a simple memory model in its implementation, it lacks some memory constraints such as capacity and memory fragmentation. However, LLM inference is sensitive to memory capacity due to their significant memory usage of model parameters and KV cache. `\simname `{=latex}uses detailed memory modeling scheme with several memory constraints to reduce the gap with actual systems. Memory model of `\simname `{=latex}includes management of KV cache and generated tokens by incorporating demand paging technique from vLLM [@vllm].

The management of KV cache and generated tokens in `\simname `{=latex}scheduler is intertwined with iteration-level scheduling, which conducts batch reconstruction each iteration, by checking generated tokens and KV cache size of each batch. First, scheduler assesses the length of incoming requests to determine the required number of KV cache pages and allocates them to the local memory of accelerators accordingly to form a single batch. After an iteration completed, the scheduler reassesses the requests. If increased sequence length due to generated tokens requires additional page or incoming requests need to be added to the batch, new page is allocated on demand. If there is insufficient memory capacity for new pages, the entire page for KV cache and sequence of the last added requests are evicted to host memory. When memory availability permits, evicted KV cache blocks are reloaded from host memory for processing in subsequent batches.

The graph converter inserts operators into the execution graph for page eviction and reloading based on the decision of the scheduler. Whenever page eviction or reloading occurs, it inserts memory store or load operators embedded with the time taken to transfer the pages between accelerator device memory and host memory into the graph. This interaction between the scheduler and graph converter enables `\simname `{=latex}to effectively utilize page-based memory modeling.

## Simulating Heterogeneity in LLM Serving

```{=latex}
\niparagraph{Heterogeneous system overview.}
```
`\simname `{=latex}supports simulation of heterogeneous systems composed of two or more different types of accelerator hardware beyond homogeneous systems. In this paper, we use example systems consisting of NPU devices for compute-bound operations and PIM devices for memory-bound operations. While we use these particular systems as running examples, it is worthwhile to note that `\simname `{=latex}also supports simulation with hardware accelerators other than NPU or PIM by adding new execution engine to `\simname `{=latex}infrastructure in a plug-in manner.

As discussed in Section `\ref{sec:infrastructure}`{=latex}, `\simname `{=latex}can flexibly configure the system topology. Figure `\ref{fig:npupim}`{=latex} illustrates two example systems: a heterogeneous system where NPU and PIM devices are directly connected, and a heterogeneous system where there are separate pools of NPU devices and PIM devices. For both example systems, accelerator nodes are connected to other accelerator nodes and hosts through high bandwidth interconnects such as CXL [@cxl].

Algorithm `\ref{alg:scheduler-overview}`{=latex} describes the overall workflow of `\simname `{=latex}scheduler, execution engine stack, and graph converter to generate execution graph with the given request batch in the context of operator mapping and scheduling. In the following section, we discuss how `\simname`{=latex}'s operator mapping and scheduling decisions are made depending on the system topology or computational properties of acceleration hardware.

```{=latex}
\niparagraph{Operator mapping.}
```
The components that perform operator mapping could be different depending on the heterogeneous system's topology and configuration. In `\simname`{=latex}, the components responsible for operator mapping are execution engine, scheduler, and graph converter. To understand how these three components interplay, we describe operator mappings in the two example systems depicted in Figure `\ref{fig:npupim}`{=latex}.

```{=latex}
\niparagraph{\circled{1} Operator mapping in execution engine.}
```
For instance, in a heterogeneous system consisting of NPU and PIM devices, memory-bound operations such as *Attend* and *Score* of multi-head attention layers are mapped to the PIM module. And, remaining compute-bound operations are mapped to the NPU module. However, as the NPU and PIM devices are directly connected each other, they act as one node at the system-level, so there is only one execution engine in `\simname`{=latex}. Therefore, in the simulation of NPU-PIM system, mapping decision is done in the internal scheduler of execution engine.

```{=latex}
\niparagraph{\circled{2} Operator mapping in scheduler.}
```
On the other hand, in a heterogeneous system consisting of NPU and PIM pools, operator mapping is done in two components, scheduler and graph converter, rather than execution engine. In Line `\ref{line:operator_mapping}`{=latex} of Algorithm `\ref{alg:scheduler-overview}`{=latex}, the scheduler decides which operator will be mapped to which device by considering the characteristics of both the operators and the hardware devices, and creates simulation plans. Then the scheduler delivers simulation plans composed of various operators mapped to the appropriate execution engine based on the operator mapping strategy. For instance, in the simulation of a heterogeneous system consisting of NPU and PIM pools, `\simname `{=latex}scheduler creates a simulation plan consisting of memory-bound GEMV operations and delivers it to the PIM execution engine. Conversely, a simulation plan consisting of the remaining compute-bound operations is delivered to the NPU execution engine. Finally, the scheduler triggers an execution engine stack consisting of compilers and simulators, each of which executes the scheduler's simulation plan to generate a trace, the input to `\simname`{=latex}'s graph converter.

```{=latex}
\SetKwInOut{Input}{input}
```
```{=latex}
\SetKwInOut{Output}{output}
```
```{=latex}
\SetKwInput{KwData}{Parameter}
```
```{=latex}
\SetKwComment{Comment}{\textsf{//} }{}
```
```{=latex}
\SetKwComment{CommentTwo}{/* }{~*/}
```
```{=latex}
\SetKwFunction{Formbatch}{\textnormal{Batch\_formatting}}
```
```{=latex}
\SetKwFunction{Splitbatch}{\textnormal{Batch\_partitioning}}
```
```{=latex}
\SetKwFunction{Profileops}{\textnormal{Operator\_profiling}}
```
```{=latex}
\SetKwFunction{Mapops}{\textnormal{Operator\_mapping}}
```
```{=latex}
\SetKwFunction{Executionengine}{\textnormal{Execution\_engine}}
```
```{=latex}
\SetKwFunction{append}{\textnormal{append}}
```
```{=latex}
\SetKwFunction{Batchschedule}{\textnormal{Operator\_scheduling}}
```
```{=latex}
\begin{algorithm}[tb]\footnotesize
\caption{Operator Mapping and Scheduling}
\label{alg:scheduler-overview}
\LinesNumbered

\KwIn{$
    {L_{req}} \ \ \ \ :\ \small\textit{A list of proceeding request information}\newline
    {L_{dev}} \ \ \ \ :\ \small\textit{A list of devices for mapping operators}\newline
    \ \ \ {Mem_{free}}:\ \small\textit{Available memory for storing KV cache}\newline
    \ \ \ {Time_{cur}}:\ \small\textit{Current system clock time}\newline
    \ \ \ {Criteria}:\ \small\textit{Key criteria to evenly partition batch}
$}
\KwOut{${G_{exec}}:\ \small\textit{Execution graph for system simulation}
$}


$Batch\ =\ \Formbatch{\ensuremath{L_{req}}, \ensuremath{Mem_{free}}, \ensuremath{Time_{cur}}}$\;
$L_{sub\_batch}\ =\  \Splitbatch{\ensuremath{Batch,\  Criteria}}$\;
\label{line:batch-partitioning}
$L_{sub\_batch\_sim}\ \leftarrow\ []$\;

\ForEach{$sub\_batch$ in \ $L_{sub\_batch}$}{


    $L_{ops}\ =\ \Profileops{\ensuremath{sub\_batch}}$\;


    $L_{ops\_mapped}\ =\ \Mapops{\ensuremath{L_{ops}}, \ensuremath{L_{dev}}}$\;
    \label{line:operator_mapping}
    $L_{sim\_ops}\ \leftarrow\ []$\;

    \ForEach{$(operator, device)$ in\ $L_{ops\_mapped}$}{


        $Ops_{sim}\ =\ \Executionengine{\ensuremath{operator, device}}$\;
        $L_{sim\_ops}.\append{\ensuremath{Ops_{sim}}}$\;
    }
    $L_{sub\_batch\_sim}.\append{\ensuremath{L_{sim\_ops}}}$\;
}

$Trace = \Batchschedule{\ensuremath{L_{sub\_batch\_sim}}}$\;
\label{line:operator-scheduling}
$G_{exec}= {\textnormal{Graph\_converter}}(Trace)$\;
\label{line:graph_converter}
\Return{$G_{exec}$}
\end{algorithm}
```
```{=latex}
\niparagraph{\circled{3} Operator mapping in graph converter.}
```
`\simname `{=latex}graph converter translates output trace into an execution graph, in Line `\ref{line:graph_converter}`{=latex} in Algorithm `\ref{alg:scheduler-overview}`{=latex}. It embeds the type and ID of compute node where each operator will be executed into the execution graph, allowing accurate simulation based on the mapping result. The graph converter also inserts data transfer operators to indicate the exchange of intermediate results between different accelerator pools. In the example system shown in Figure `\ref{fig:npupim}`{=latex}(b), the NPU pool and PIM pool are connected through high bandwidth interconnects, so the graph converter inserts data transfer operators before and after the GEMV operator, which is processed in the PIM pool. Operator mapping is done orthogonally to the parallelism strategy discussed in Section `\ref{sec:parallelism}`{=latex}, and distinct network topologies and parallelism strategies can be applied to each accelerator pool.

```{=latex}
\niparagraph{Operator scheduling.}
```
Due to dependency between operators, serial execution of a batch inevitably leads to under-utilization of heterogeneous accelerators. To overcome this limitation, `\simname `{=latex}scheduler performs batch partitioning in Line `\ref{line:batch-partitioning}`{=latex} of Algorithm `\ref{alg:scheduler-overview}`{=latex}. `\simname `{=latex}scheduler splits request batch into independent sub-batches to exploit overlapping between sub-batches across heterogeneous accelerators while satisfying criteria such as fairness of computation load or memory accesses. Subsequently, each sub-batch undergoes the operator mapping.

After operator mapping, each execution engine compiles and simulates mapped operators, and creates output trace. These output traces include mapping and simulation information for each operator and for each hardware. Operator scheduler of execution engine stack performs operator scheduling within a given batch by utilizing this information. In Line `\ref{line:operator-scheduling}`{=latex} of Algorithm `\ref{alg:scheduler-overview}`{=latex}, operator scheduling decides the execution order of operators using a greedy heuristic by considering dependencies between operators and the availability of heterogeneous accelerators. It maximizes hardware utilization of heterogeneous accelerators by allowing overlapping between operators and sub-batches.

## Techniques for Fast Simulation

LLM typically necessitates lengthy compile and hardware simulation time. To solve the problem, we introduce result-reusing techniques, which reduce computation redundancy.

```{=latex}
\niparagraph{Model redundancy reuse.}
```
First, we achieve significant time savings by exploiting the redundancy of common LLM architecture. As described in Figure `\ref{transformer}`{=latex}, decoder-based LLM architecture consists of an embedding layer followed by repeated transformer blocks. `\simname `{=latex}compiles one transformer block and replicates it, largely reducing the overall compile time required. Another optimization to reduce simulation time involves separating attention layers from non-attention layers. The initiation phase and the generation phase differ only in attention layers, depending on the presence or absence of KV cache. Therefore, `\simname `{=latex}compiles and simulates the time-consuming non-attention layers just once, and subsequently, it simply swaps out the less time-intensive attention layers, cutting down on the total processing time.

```{=latex}
\niparagraph{Computation reuse.}
```
Given the dynamic nature of input and output lengths in LLM inference, models typically need to be continuously compiled and simulated. `\simname `{=latex}adopts a strategy of reusing previously simulated results through caching. For effective caching, it manages the non-attention and attention layers differently. Non-attention layers take longer than other layers to be processed but can be reused frequently. However, attention layers require more frequent compilation and simulation but take less time. We conduct an evaluation to evaluate the impact of this caching strategy and demonstrate that our optimization technique is effective in reducing the overall simulation time.

# Discussion

## Usability of `\simname`{=latex} {#sec:usability}

```{=latex}
\niparagraph{Pluggability to 3rd-party accelerators.}
```
`\simname`{=latex}'s infrastructure allows for high configurability in system configuration and simulators of various hardware can be seamlessly attached via interfaces to the `\simname `{=latex}scheduler and graph converter. Therefore, integration with hardware simulators for various third-party accelerators other than NPU or PIM devices introduced in this paper is also possible. Beyond acceleration hardware, it is possible to extend memory features, for instance, by adding storage capabilities like HDDs and SSDs, or incorporating computational storage nodes such as SmartSSDs. This flexibility makes `\simname `{=latex}a highly versatile tool for simulation and development.

```{=latex}
\niparagraph{Compatibility with existing machine learning frameworks.}
```
`\simname `{=latex}takes the ONNX[@onnx] model format as an input, enabling interoperability with various machine learning frameworks. It allows users to seamlessly integrate and simulate widely-used open-source ONNX models written for frameworks such as PyTorch [@torch] and TensorFlow [@tensorflow]. These models can be converted into ONNX format for use within `\simname`{=latex}, facilitating a broad range of model experimentation and deployment scenarios.

```{=latex}
\begin{figure*}

    {\includegraphics[width=1.0\linewidth]{figure/tendency.pdf}}
    \caption{Comparison of throughput over time between vLLM with GPUs and \simname using request pattern following a Poisson distribution.}
    \label{fig:tendency}
\end{figure*}
```
## Limitations and Future Works

With the rapid advancement in the fields of machine learning and large language models, new variant architectures of LLM such as multi-modal [@clip; @minigpt4; @palme; @nextgpt; @chameleonteam2024chameleon; @peng2023kosmos2; @liu2024llavanext; @chen2023minigptv2], mixture of experts (MoE) [@grok; @dai2024deepseekmoe; @deepseekai2024deepseekv2], and retrieval augmented generation (RAG) [@chen2023benchmarking; @lewis2020retrieval; @shao2023enhancing; @jiang2023active; @xu2024retrieval; @ram2023context] have been developed to address limitations of the original architecture. Additionally, lightweight techniques such as quantization [@xiao2023smoothquant; @frantar-gptq; @yao2022zeroquant; @dettmers2024qlora; @ma2024era] and pruning [@frantar2023sparsegpt; @ma2023llm; @sun2023simple], and fine tuning [@hu2021lora; @hu2023llm] techniques, offer traditional but effective solutions for optimizing LLMs. LLMs using these architectures or technologies have different mathematical and computational characteristics compared to traditional decoder-based architectures, and ongoing studies are exploring accelerator architectures and inference systems to support these new models [@patel2023splitwise; @miao2024spotserve; @agrawal2023sarathi; @moe-lina].

Although `\simname `{=latex}currently focuses on traditional decoder-based LLM architectures, it can support new model variants with slight modification or even without any modification. This is due to its inherent flexibility at the system level and its use ONNX models, which allow for flexible model construction. For example, `\simname `{=latex}can support MoE models by assigning each expert to one node and configuring the network topology to route to one of the expert nodes based on the inference results of the gating network. In addition, for RAG models, it is possible to configure the system such that vector storage is simulated in a storage node, and the results retrieved from this storage are used for further inference. Novel systems supporting these models may adopt new scheduling strategies as suggested in existing solutions [@wu2023fast; @duan2024muxserve; @exegpt], but we believe they can be accommodated within `\simname`{=latex}'s simulation infrastructure.

# Evaluation {#sec:evaluation}

![image](table/methodology_table.png){width="1.0\\linewidth"} `\label{tab:methodology}`{=latex}

## Methodology {#sec:methodology}

```{=latex}
\niparagraph{System baselines.}
```
We use a homogeneous system consisting of only NPU and a heterogeneous system consisting of NPU and PIM for the validation and evaluation of `\simname`{=latex}. Throughout our evaluation, we use a GPU system equipped with 4 NVIDIA RTX 3090 GPUs with 24GB VRAM and Intel Xeon Gold 6326 CPU as the actual inference serving system baseline. We use vLLM [@vllm] framework as LLM inference serving system software. This GPU system corresponds to a homogeneous system composed of multiple NPU devices, and in simulations using `\simname`{=latex}, the performance of NPU device is set to be similar to that of the GPU. Additionally, we use NeuPIMs [@neupims], an NPU-PIM heterogeneous LLM inference acceleration system, as one of the evaluation baselines.

```{=latex}
\niparagraph{\simname configuration.}
```
`\label{sec:llmconfig}`{=latex} For running `\simname`{=latex}, we use a CPU system equipped with an Intel Xeon Gold 6226R CPU with 96GB DRAM. NPU and PIM are integrated into `\simname `{=latex}as simulator plug-ins to the execution engine stack. For NPU, we use PolyMath compiler [@polymath] and GeneSys simulator [@genesys], and for PIM, we use an in-house PIM simulator. Table `\ref{tab:methodology}`{=latex} lists the specifications of NPU and PIM hardware used throughout the evaluation. We configure the hardware architecture of the NPU in `\simname `{=latex}as a 128x128 systolic array with a clock speed of 1GHz to achieve similar performance to the GPU baseline, NVIDIA RTX 3090 GPU. We set inter-device link bandwidth and latency is set to be equivalent to PCIe 4.0 $\times$`<!-- -->`{=html}16 bandwidth at 64GB/s and latency of 100ns, respectively. In evaluation using NPU-PIM heterogeneous system, we use the same PIM hardware specification as used in NeuPIMs.

```{=latex}
\niparagraph{Simulator baselines.}
```
We also compare the simulation time of `\simname `{=latex}with other hardware simulators that support LLM inference. We use mNPUsim [@mnpusim] and GeneSys [@genesys] for NPU simulation, and NeuPIMs [@neupims] for NPU-PIM heterogeneous accelerator simulation. As these baselines lack features for LLM inference serving, the simulation time for one iteration is used for comparison.

## Simulator Validation

We evaluate the simulation accuracy of `\simname `{=latex}against the real LLM serving systems with homogeneous or heterogeneous accelerators.

```{=latex}
\niparagraph{NPU homogeneous system.}
```
Figure `\ref{fig:tendency}`{=latex} shows the fluctuation in throughput using a dynamic request pattern for GPT-3 [@gpt3] and LLaMA [@llama] models, with parameter size 7B and 30B. We synthesize request arrival patterns using Poisson distribution by sampling them from ShareGPT [@sharegpt]. We set the tensor parallelism degree as 1 and 4 depending on the model size.

In the throughput trend of initiation phase, as shown in the upper row of Figure `\ref{fig:tendency}`{=latex}, we observe a high degree of similarity in the prompt throughput trends between `\simname `{=latex}and GPU-based vLLM system. Specifically, throughput of initiation phase is influenced not only by the scheduling decision to form a request batch but also by the system's capability to accommodate the incoming requests' KV cache in memory. Therefore, these trend results demonstrate that the iteration-level prompt scheduling and detailed memory modeling of `\simname `{=latex}closely mirrors the behavior observed in homogeneous GPU-based system.

The lower row of Figure `\ref{fig:tendency}`{=latex} depicts the throughput trend in generation phase. We observe that `\simname `{=latex}also follows the generation throughput trend of the vLLM baseline system. However, unlike the trend observed in the intitiation phase, there are some performance discrepancies, which can be attributed to several factors. First, it is challenging to configure NPU architecture to precisely match the performance of the GPU. Additionally, the degree of kernel operation optimization varies between GPU-based system and `\simname`{=latex}. While GPU systems often employ kernel optimization techniques such as FlashAttention [@dao2023flashattention2], the absence of such kernel optimization in `\simname `{=latex}leads to throughput differences, especially under request-intensive conditions. Despite the deviations, the overall throughput trend of `\simname `{=latex}resembles that of the GPU-based vLLM, confirming that `\simname `{=latex}can effectively simulate LLM serving systems.

<figure id="fig:pimvalid">
<p><span><img src="figure/neupims.png" /></span></p>
<figcaption>Throughput comparison of and NeuPIMs.</figcaption>
</figure>

```{=latex}
\niparagraph{NPU-PIM heterogeneous system.}
```
Figure `\ref{fig:pimvalid}`{=latex} shows the throughput comparison results between `\simname `{=latex}and NPU-PIM heterogeneous accelerator, NeuPIMs [@neupims]. For the workload, we sample requests from Alpaca[@alpaca] and use 256 requests for each experiment. We attach in-house PIM simulator to `\simname `{=latex}in order to configure a heterogeneous NPU-PIM system. We set the number of NPU and PIM devices depending on the model size, and use various parallelization schemes. Overall, `\simname `{=latex}shows lower throughput than NeuPIMs, since `\simname `{=latex}focuses on implementing detailed systems features such as inter-device link and synchronization at the system level. However, despite these features, the performance trend shown by `\simname `{=latex}is similar to that of the NPU-PIM heterogeneous accelerator system. Comparing using various models and parallelism schemes, both `\simname `{=latex}and NeuPIMs demonstrate similar throughput, with error margins below 20% and a geometric mean error rate of 8.88%. This indicates that `\simname `{=latex}can effectively simulate LLM serving system with heterogeneous accelerator under varying configurations.

## Simulation Time Speedup

Figure `\ref{fig:simcomp}`{=latex} compares the simulation time of various LLM simulators including mNPUsim [@mnpusim], GeneSys [@genesys], NeuPIMs [@neupims], and our simulator `\simname`{=latex}. We measure the simulation time for one iteration of processing inputs with a batch size of 32 and a sequence length of 512 across all simulator baselines and `\simname`{=latex}. Also, we use GPT3 model with parameter sizes ranging from 7B to 30B.

Throughout this experiment, mNPUsim shows the longest simulation time compared to the baseline simulators and `\simname`{=latex}. Following mNPUsim, NeuPIMs, and GeneSys, `\simname `{=latex}shows the fastest simulation times. `\simname `{=latex}show an average speedup of 490.98$\times$ over mNPUsim, 34.71$\times$ over GeneSys, and 44.97$\times$ over NeuPIMs. Note that, in this experiment, we assume that the initial requests have just arrived, removing the opportunities for computation reuse optimization. Exploiting model redundancy reuse optimization has significantly boosted our simulator by skipping repeated transformer blocks. These results illustrate `\simname`{=latex}'s superior efficiency and capability in handling large-size LLMs.

<figure id="fig:simcomp">
<p><span><img src="figure/sim_compare.png" /></span></p>
<figcaption>Simulation time comparison of three LLM simulators mNPUsim, GeneSys, NeuPIMs with .</figcaption>
</figure>

<figure id="fig:runtime">
<p><span><img src="figure/sim_breakdown.png" /></span></p>
<figcaption>Breakdown of simulation time with and without computation reuse using varying parallelism strategies.</figcaption>
</figure>

## Simulation Time Reduction

Figure `\ref{fig:runtime}`{=latex} shows the simulation time and its breakdown to each component of `\simname `{=latex}with various system configurations. In this measurement, we use GPT-3 30B model and measure the simulation time to complete one iteration with batch size of 64 and sequence length of 1024 input. As depicted in the graph, the running time of `\simname `{=latex}varies significantly depending on whether reuse optimization was utilized or not. Without reuse, running time ranges from 198.0 to 215.7 seconds, but when the optimization is enabled, it ranges from 16.3 to 33.6 seconds, demonstrating a substantial speedup of 6.4$\times$ to 12.2$\times$. Computation reuse optimization eliminates the need to rerun the compilers and simulators in execution engine stack for each iteration. This highlights the significant performance benefits of computation reuse optimization applied to `\simname`{=latex}.

Figure `\ref{fig:runtime}`{=latex} also compares the total simulator running time using five parallelism strategies. Execution time of ASTRA-sim for system-level simulation is the longest when using tensor parallelism solely, as it requires more synchronization operations than other parallelism strategies. As the number of tensor parallel nodes decreases, the total simulation time also decreases, and the shortest simulation time is achieved when only pipeline parallelism strategy is used. While there is a variance in simulation time among `\simname`{=latex}'s parallelism strategies, the difference is minimal. This allows for the simulation of various system configurations within a feasible time, facilitating hardware and software exploration for LLM inference serving system.

<figure id="fig:simnpu">
<p><span><img src="figure/sim_npu.png" /></span></p>
<figcaption>Simulation time of while sweeping the number of NPU configured for tensor parallelism.</figcaption>
</figure>

## Simulation Time Scalability

Figure `\ref{fig:simnpu}`{=latex} depicts simulation times as we sweep the number of NPUs, ranging from 8 to 2048, with the system configured to use tensor parallelism. Additionally, we compare the simulation time using various GPT-3 models with parameter sizes including 7B, 30B, and 175B. We use inputs with a batch size of 64 and a sequence length of 1024, and measure simulation time of one iteration. To isolate the effect of model size and system configuration on simulation time for scalability analysis, we do not use computation reuse optimization by assuming there is no cached results for the input.

Figure `\ref{fig:simnpu}`{=latex} shows that the trend of simulation time tends to be proportional to the number of NPUs. As the system is configured to use only tensor parallelism, all nodes use the same compilation and simulation results. However, as the number of NPUs increases, system becomes more complex, requiring `\simname `{=latex}longer time to coordinate and simulate each component. Thus, the increased execution time is primarily due to system-level coordination and simulation of ASTRA-Sim and graph converter. Even when scaling up the system to a vast extent using GPT-3 175B and 2048 NPUs, `\simname `{=latex}takes 4.13 hours to simulate, outperforming other LLM simulators. This result demonstrates the scalability of our simulator in effectively managing extensive computational loads, even at scale.

# Related Work

```{=latex}
\niparagraph{NPU simulators.}
```
Several simulators have been proposed to accurately model NPU behavior for ML workloads. These simulators can either focus on a single core [@samajdar2019scalesim; @gem5_aladdin] or model interactions between cores in a multi-core NPU [@onnxim; @gemmini-dac; @mnpusim]. However, current simulators only consider the behavior of individual NPU chips and don't model systems with multiple interconnected NPU chips.

```{=latex}
\niparagraph{Non-NPU simulators.}
```
With the emergence of research [@ham2020a3; @ham2021elsa; @spatten; @gobo; @edgebert] on accelerating Attention operations in Transformer [@vaswani2023attention] and PNM (Processing near Memory) / PIM (Processing in Memory) research [@upmem-hpca24; @attacc; @neupims; @ianus; @sait-pimsim; @park2024lpddr] on accelerating memory-bound operations in LLMs, there is a lot of research on measuring performance with simulation. However, these studies have only modeled the performance of a single accelerator and lack simulations of the system.

```{=latex}
\niparagraph{ML system simulators.}
```
In the realm of distributed system simulators, tools have been developed to cater to a range of needs from general-purpose workload simulators [@simtool; @dist-gem5; @COSSIM] to those specifically designed for neural networks [@dts; @astrasim; @shen2023mars]. Recently, a specialized simulator tailored for LLM training [@bang2023vtrain] has emerged.

```{=latex}
\niparagraph{LLM inference serving simulators.}
```
Recently, reflecting the growing interest in LLM inference, a variety of simulators have been introduced [@splitwise; @distserve; @vidur]. However, these simulators are GPU-based and perform approximate simulation through methods like ML prediction, mathematical modeling, and using latency database instead of cycle-accurate simulations.

*To overcome the limitations of these existing studies, we propose `\simname`{=latex}. Any hardware simulator that supports LLM operators can be integrated into our system simulator as an execution engine to perform system simulation. `\simname `{=latex}is system simulator that supports multi-device and heterogeneous system configurations, rather than simulating a single hardware device. In addition, `\simname `{=latex}successfully tackles them by exploiting techniques including iteration-level scheduling [@orca], KV cache paging [@vllm], and the interaction between hardware and system simulators.*

# Conclusion

The absence of system simulator for LLM inference serving presents challenges for researchers in system or hardware architecture exploration. In this paper, we address these challenges by introducing `\simname`{=latex}, a fast and accurate hardware-software co-simulation infrastructure for LLM inference serving systems, leveraging unique algorithmic characteristics of LLM serving. We believe that simulation tools for scale-out and heterogeneous LLM serving systems are crucial for accelerating research progress in this field, and `\simname `{=latex}successfully makes a significant initial contribution towards meeting these needs.

# Acknowledgments {#acknowledgments .unnumbered}

This work was supported by the National Research Foundation of Korea (NRF) (No.RS-2024-00342148), Institute of Information & communications Technology Planning & Evaluation (IITP) (No.RS-2024-00459797, No.RS-2024-00396013, No.2022-0-01037), and the Graduate School of Artificial Intelligence Semiconductor (No.RS-2023-00256472), funded by the Korea government (MSIT). This work was also partly supported by HyperAccel.

```{=latex}
\bibliographystyle{IEEEtranS}
```
```{=latex}
\balance
```
```{=latex}
\clearpage
```
```{=latex}
\appendix
```
# Artifact Appendix

## Abstract

`\simname `{=latex}is a fast and accurate hardware-software co-simulation infrastructure for LLM inference serving systems written with C++ and Python. `\simname `{=latex}receives several system configurations and request traces from the user, calculates the cycles and throughput of the system composed of various accelerators, and measures the inference latency for each request.

## Artifact Check-list (Meta-information)

```{=latex}
\small
```
-   **Compilation:** g++ v7.5.0

-   **Run-time environment:** Ubuntu 18.04 Kernel v4.15.0

-   **Hardware:** x86-64

-   **Output:** standard output, *TSV* files

-   **How much disk space required (approximately)?:** Artifact evaluation requires up to 30GB of disk space, but depending on the models and datasets, it may require 1GB $\sim$ 400GB of disk space.

-   **How much time is needed to prepare workflow (approximately)?:** 5 minutes

-   **How much time is needed to complete experiments (approximately)?:** Artifact evaluation takes approximately 12 hours, but depending on the models and datasets, it may take 30 seconds $\sim$ 24 hours.

-   **Publicly available?:** Yes

-   **Code licenses (if publicly available)?:** Creative Commons Attribution 4.0 International, MIT License

-   **Workflow framework used?:** No

-   **Archived (provide DOI)?:** Yes (10.5281/zenodo.12803583)

## Description

### How to access

-   Zenodo: `\simname `{=latex}is published on Zenodo:

    `\bluetext{\url{https://doi.org/10.5281/zenodo.12803583}}`{=latex}

-   GitHub: `\simname `{=latex}is available on GitHub:

    `\bluetext{\url{https://github.com/casys-kaist/llmservingsim}}`{=latex}

### Hardware dependencies

`\simname `{=latex}requires an x86-64 architecture, and the simulation time may be affected by hardware differences. For similar simulation time results, we recommend using the hardware specified in Section `\ref{sec:methodology}`{=latex}.

### Software dependencies

`\simname `{=latex}has been tested on Ubuntu 18.04 with Python 3.9 and requires gcc and g++ versions 7.5.0 or higher. Additionally, it requires the software prerequisites of ASTRA-Sim [@astrasim], Chakra [@chakra], and Polymath [@polymath]. To meet these software prerequisites, we use the latest version of Conda. We provide the instructions for Conda installation in Appendix `\ref{sec:installation}`{=latex} and *README* file. Also, it can be downloaded individually from the following link: `\bluetext{\url{https://repo.anaconda.com/archive/}}`{=latex}.

### Data sets

We use ShareGPT [@sharegpt] and Alpaca [@alpaca] datasets to generate arbitrary request trace.

### Models

We use GPT3 [@gpt3] and LLaMA [@llama] with model size of 7B to 175B for our evaluation. Their model architecture follows the decoder-based transformer model.

## Installation {#sec:installation}

-   Clone the `\simname `{=latex}repository.

    ``` {.bash language="bash"}
    $ git clone --recurse-submodules https://github.com/casys-kaist/LLMServingSim.git
    $ cd LLMServingSim
    ```

-   Conda install (optional).

    ``` {.bash language="bash"}
    $ curl -O https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
    $ bash Anaconda3-2024.06-1-Linux-x86_64.sh
    ```

-   Install dependencies.

    ``` {.bash language="bash"}
    $ conda env create -p ./env -f ./environment.yml
    $ conda activate ./env
    ```

-   Build submodules.

    ``` {.bash language="bash"}
    $ cd astra-sim
    $ ./build/astra_analytical/build.sh
    $ cd extern/graph_frontend/chakra
    $ pip install .
    $ cd ../../../../execution_engine/polymath
    $ pip install .
    $ cd ../..
    ```

## Experiment Workflow

The workflow of `\simname `{=latex}is well illustrated in Section `\ref{sec:infrastructure}`{=latex}, particularly in Figure `\ref{fig:llmsim}`{=latex}. To explain this in detail, the simulator first receives the system configuration and request trace from the user. Then, the scheduler selects requests that can be batched in each iteration using the KV cache information and creates a simulation plan that maps operators to each execution engine. Each engine simulates the operators using this simulation plan, and the results are combined into a single trace through operator scheduling. This trace goes through a graph converter to become an execution graph [@chakra], which is then simulated at the system level by ASTRA-Sim [@astrasim]. Finally, the execution results are returned to the scheduler, which uses them to proceed to the next iteration. In this process, the simulator calculates the system's throughput and latency.

## Evaluation and Expected Results

The evaluation conducted in this paper can be categorized into five parts, as described in Section `\ref{sec:evaluation}`{=latex}. To facilitate the execution of these five evaluations, we have created five separate scripts, stored in the *evaluation* folder, each for running an individual experiment. We also provide a script (*evaluation_all.sh*) to run all of them at once.

-   Move to *evaluation* folder.

    ``` {.bash language="bash"}
    $ cd evaluation
    ```

-   Run each evaluation one by one.

    ``` {.bash language="bash"}
    $ ./evaluation1.sh
    $ ./evaluation2.sh
    ...
    $ ./evaluation5.sh
    ```

-   Run all evaluation at once.

    ``` {.bash language="bash"}
    $ ./evaluation_all.sh
    ```

The results of each script are stored in their respective evaluation folders. Each command within a script generates three files, including (1) *text* file containing the redirected standard output and (2) two *TSV* files storing throughput and simulation time.

To facilitate verification of the results used in the paper, we provide an *Excel* file (*evaluation.xlsx*) in the *evaluation* folder. The *Excel* file contains the numbers and figures from the paper, along with instructions to manipulate the raw data. For more information, please refer to Section `\ref{sec:evaluation}`{=latex} and the *README* files in each folder. Figure `\ref{fig:dirtree}`{=latex} illustrates the directory tree of `\simname `{=latex}with the evaluation scripts, their outputs, *Excel* file, and *README* files.

<figure id="fig:dirtree">

<figcaption>Directory tree of .</figcaption>
</figure>

## Experiment Customization

### Input configurations

`\simname `{=latex}takes configuration files for hardware and network as input.

-   *NPU config*: a *json* file that contains configurations of the NPU. It is located at *execution\_engine/codelets\_src/ codelets/examples/genesys/configs/* folder.

-   *network config*: a *json* file that contains configurations of the system network topology. It is located at *astra-sim/inputs/network/analytical/* folder.

### Input dataset

`\simname `{=latex}takes LLM inference request datasets with various request patterns as input.

-   *dataset*: a *TSV* file that contains the input token length, output token length, and arrival time. It is located at *astra-sim/dataset/* folder.

### Input parameters

`\simname `{=latex}has a total of 16 parameters for various simulation configurations. *README* file provides usage instructions and examples.

-   *model\_name*: Name of the LLM model. Default value is \`gpt2'.

-   *npu\_num*: Number of NPUs in the system. Default value is 16.

-   *max\_batch*: Maximum batch size. Default value is 0, which indicates no limit.

-   *batch\_delay*: Delay of batching. Default value is 0.

-   *scheduling*: The method of scheduling. Default value is \`orca', which refers to the iteration-level scheduling technique proposed in Orca [@orca].

-   *parallel*: The method of parallelism. There are three methods: \`pipeline', \`tensor', and \`hybrid'. Default value is \`hybrid'.

-   *npu\_group*: Number of NPU groups used in hybrid parallelism. Default value is 1.

-   *npu\_mem*: Local memory size of the NPU in GB. Default value is 40.

-   *kv\_manage*: The method of KV cache management. Default value is \`vllm', which refers to the paged attention technique proposed in vLLM [@vllm].

-   *pim\_type*: The method of using PIM. There are three methods: \`none', \`local', and \`pool'. Default value is \`none', which indicates no use of PIM.

-   *sub\_batch*: The method of scheduling when using PIM. It is a flag that turns on for the sub-batch interleaving technique proposed in NeuPIMs [@neupims].

-   *dataset*: The path of the dataset.

-   *network*: The path of the network configuration file.

-   *output*: The path of the output *TSV* files.

-   *gen*: The flag that indicates the initiation phase should be skipped when enabled.

-   *fast\_run*: The flag that turns on the model compilation bypassing that facilitates the fast reproduction of evaluation 1 (simulation time: 32 hours $\rightarrow$ 20 minutes).

### Result files

`\simname `{=latex}provides three outputs.

-   *standard output*: It shows which requests are being processed in each iteration of the simulator and displays the measured throughput at regular intervals. Additionally, it provides a summary of throughput and simulation time at the end.

-   *{output\_filename}-throughput.tsv*: The file stores prompt and generation throughput at regular intervals.

-   *{output\_filename}-simulation-time.tsv*: The file stores each simulation component's simulation time in milliseconds.

## Notes

More information can be found in the *README* file of each directory.

```{=latex}
\balance
```