---
author:
- Ang Li
- Sean McLeish
- Haozhe Chen
- Nimit Kalra
- Zaiqian Chen
- Artem Gazizov
- Venkata Anoop Suhas Kumar Morisetty
- Bhavya Kailkhura
- Harshitha Menon
- Zhuang Liu
- Brian R. Bartoldson
- Tom Goldstein
- Sanae Lotfi
- Micah Goldblum
- Pavel Izmailov
bibliography:
- refs.bib
title: End-to-End Context Compression at Scale
---

\newcommand{\TODO}[1]{\textcolor{red}{\textbf{TODO} #1}}
\newcommand{\inlineimage}[1]{\raisebox{-0.05\height}{\includegraphics[height=2.5em]{#1}}}
\newcommand{\todo}{{\color{red}{todo}}\xspace}
\DeclareRobustCommand\onedot{\futurelet\@let@token\@onedot}
\def\@onedot{\ifx\@let@token.\else.\null\fi\xspace}
\def\aka{\emph{a.k.a}\onedot}
\def\Eg{\emph{A.k.a}\onedot}
\def\eg{\emph{e.g}\onedot}
\def\Eg{\emph{E.g}\onedot}
\def\ie{\emph{i.e}\onedot}
\def\Ie{\emph{I.e}\onedot}
\def\cf{\emph{c.f}\onedot}
\def\Cf{\emph{C.f}\onedot}
\def\etc{\emph{etc}\onedot}
\def\vs{\emph{vs}\onedot}
\def\wrt{w.r.t\onedot}
\def\dof{d.o.f\onedot}
\def\etal{\emph{et al}\onedot}
\newcommand{\xxcomment}[4]{\textcolor{#1}{[$^{\textsc{#2}}_{\textsc{#3}}$ #4]}}
\newcommand{\micah}[1]{\xxcomment{red}{M}{G}{#1}}
\newcommand{\df}[1]{\xxcomment{teal}{D}{Fu}{#1}}
\newcommand{\bz}[1]{\xxcomment{violet}{B}{Z}{#1}}
\newcommand{\leon}[1]{{\color[rgb]{0.505, 0, 0.921}{Leon: \footnotesize\sf[#1]}}}
\renewcommand{\paragraph}[1]{\textbf{#1}}
\renewcommand{\paragraph}[1]{\sansbf{#1}}
\newcommand{\tokcount}{350B\xspace}
\newcommand{\enctokcount}{165B\xspace}
\newcommand{\llmtokcount}{185B\xspace}
\newcommand{\names}{LCLMs\xspace}
\newcommand{\name}{LCLM\xspace}
\newcommand{\fullnames}{Latent Context Language Models\xspace}
\newcommand{\fullname}{Latent Context Language Model\xspace}
\newcommand{\nyu}{\ensuremath{\textcolor[HTML]{57068C}{\spadesuit}}}
\newcommand{\umd}{\mbox{\textcolor[HTML]{E21833}{\scalebox{0.6}{\faHeart}}}}
\newcommand{\princeton}{\ensuremath{\textcolor[HTML]{E77500}{\clubsuit}}}
\newcommand{\columbia}{\ensuremath{\textcolor[HTML]{009EFF}{\blacklozenge}}}
\newcommand{\modal}{\ensuremath{\textcolor[HTML]{00A86B}{\scalebox{1.0}{$\blacktriangle$}}}}
\newcommand{\meta}{\ensuremath{\textcolor[HTML]{0064E0}{\scalebox{0.7}{$\blacksquare$}}}}
\newcommand{\llnl}{\mbox{\textcolor[HTML]{003C71}{\raisebox{-0.15ex}{\scalebox{1.25}{$\bullet$}}}}}
\newcommand{\harvard}{\ensuremath{\textcolor[HTML]{A51C30}{\bigstar}}}
\maketitle
\blankfootnote{$^{1}$Work completed during an internship at Modal.}
\blankfootnote{$^{2}$Served solely in an advisory role for this paper. All model training and data processing were conducted entirely outside of Meta.}

# Introduction {#sec:intro}

Reasoning over long contexts is a crucial capability for state-of-the-art Large Language Models (LLMs), as it enables them to parse long documents, engage in multi-turn conversations, and perform long-horizon agentic tasks. However, in production systems, the input context, working horizon, and memory can grow to millions of tokens, making inference increasingly constrained by memory and latency due to the growth of the KV cache [@hooper2024kvquant]. Even when inputs fit within the model's maximum context window, models often struggle to reliably use information distributed across long contexts [@liu2024lost; @an2025why]. As a result, context and memory management have emerged as critical systems and modeling challenges for deploying long-horizon LLM systems.

A natural response is to reduce the size of the KV cache directly. KV cache compression methods reduce the memory footprint of cached activations by evicting cache entries [@li2024snapkv; @devoto2025expected; @kim2025kvzip; @zweiger2026fast] or training a compact cache offline [@cartridges]. However, they have important limitations for general long-context inference. Many methods still require the full context to be prefilled before compression, and some require substantial additional time or memory beyond the original prefill. Query-dependent methods produce caches that are difficult to reuse across turns [@li2024scbench], while methods that evict caches non-uniformly across attention heads and layers [@kim2025kvzip] are difficult to integrate with modern inference engines that assume a shared sequence length across the cache.

<figure id="fig:cover" data-latex-placement="t!">
<img src="figs/pareto_curve.png" />
<figcaption><strong>Our achieve high quality compression while being fast and memory efficient.</strong> We show RULER <span class="citation" data-cites="hsieh2024ruler"></span> (left) and LongBench <span class="citation" data-cites="bai2024longbench"></span> (right) accuracy vs. TTFT (time to first token) per sample on a single H200. All methods are based on the same decoder <code>Qwen3-4B-Instruct-2507</code> <span class="citation" data-cites="qwen3"></span>. We see that our models lie on a new Pareto frontier, compressing contexts much faster while maintaining high accuracy, especially for high compression ratios. KV cache compression baselines appear as vertical lines because the context is prefilled the same way regardless of compression ratio, and the eviction operation is much faster than the prefill time. </figcaption>
</figure>

Encoder-decoder soft-token methods [e.g. @refrag; @in-context-autoencoder; @auto-compressor; @e2llm; @cepe] offer a promising way to address the limitations of KV cache compression. Rather than manipulating the KV cache directly, these models encode the raw input tokens into a much shorter sequence of continuous embeddings, which are then provided to the LLM in place of the original context. In principle, this approach is highly attractive: compression is parallelizable, supported by standard LLM inference engines, and works with standard LLM decoding while extending the decoder far beyond its native context length.

However, existing soft-token compression methods typically face a trade-off: they either substantially degrade the capabilities of the base language model or depend on domain-specific training to remain effective. This motivates a fundamental question: **can we train a general, task-agnostic compressor that preserves a model's original capabilities?** To this end, we introduce *`\fullnames`{=latex}* (`\names`{=latex}), a family of encoder-decoder compressors trained end-to-end at scale. We initialize both the encoder and decoder with pre-trained models and jointly train them on up to a total of `\tokcount `{=latex}tokens for the combined encoder-decoder. Crucially, we demonstrate that `\names `{=latex}can preserve decoder performance while reducing memory used for input context.

Our contributions are summarized as follows:

1.  **Training recipe:** We develop a fine-grained training recipe for encoder-decoder context compression. Accordingly, we curate continual pre-training and finetuning data for training compressors and optimize training details across multiple training stages. See `\Cref{sec:training}`{=latex}.

2.  **Comprehensive architectural search:** We conduct a large-scale architecture search to identify optimal design choices for encoder-decoder compression models in a controlled setting. See `\Cref{sec:arch}`{=latex} and `\Cref{fig:arch_sweep}`{=latex}.

3.  **New memory/speed-performance frontier:** We train a family of models across different compression ratios at scale. Our experiments identify a Pareto frontier that effectively balances memory, latency, and downstream accuracy. We open-source this suite of models and all code. See `\Cref{sec:results}`{=latex} and `\Cref{fig:cover,fig:pareto_optimal,fig:more_benchmarks}`{=latex}.

4.  **Agent with latent context:** We create an agentic system with a high compression ratio, where the agent can select which compressed chunk to expand. This agentic harness substantially enhances the model's performance on challenging needle-in-the-haystack tasks. See `\Cref{sec:agentic}`{=latex} and `\Cref{fig:agentic_bar_plot}`{=latex}.

# Related Work {#sec:related}

Context compression approaches generally fall into three categories: hard-token, soft-token, and KV cache compression. We defer the discussion of hard-token approaches to `\Cref{app-sec:extended-rel-work}`{=latex}, since they generally underperform compared to the latter two.

**KV cache compression.** KV cache compression methods evict entries from the KV cache. Most methods rely on hand-crafted or learned heuristics to select which entries to drop.

Prompt-agnostic methods [@kim2025kvzip; @devoto2025expected; @zweiger2026fast] prune the context without knowledge of the query, whereas prompt-dependent methods such as SnapKV [@li2024snapkv] require explicit context-prompt pairs and yield query-specific caches that limit multi-turn applicability [@li2024scbench; @kim2025kvzip]. An alternative is to anticipate future queries at compression time: Cartridges [@cartridges] distills a fixed-size KV cache per corpus. However, this comes at a substantial training cost per corpus, with an ICL-quality cache requiring approximately 30 minutes on an 8$\times$H100 node for an 8B model.

Although KV cache compression methods are well-studied, they are not widely adopted in inference engines such as vLLM [@kwon2023efficient] or SGLang [@zheng2024sglang]. Compression methods such as KVzip allocate eviction budgets non-uniformly across attention heads and layers; therefore, they cannot reduce the sequence-length dimension of the KV cache. Instead, KV cache compression methods mask evicted positions in the attention computation, forfeiting the memory and throughput benefits of these paged-attention engines. Some methods further require prefill passes that are two to three times longer than the original input [@kim2025kvzip]. We note that soft token compression methods are fully compatible with these open-source inference frameworks.

**Soft-token compression.** Despite promising results, most prior soft-token methods rely on offline preprocessing and do not convincingly preserve the base model's broad in-context behavior. Prior work is mostly evaluated on domain- or task-specific finetuning that is not transferable across tasks [@GMSA; @mean-pooling; @dai2025pretraining; @refrag; @xrag]. To the best of our knowledge, these methods are not evaluated on long-context benchmarks with information-dense tasks that require fine-grained details throughout the whole context, such as RULER [@hsieh2024ruler], to demonstrate complex long-context understanding.

In `\Cref{sec:arch}`{=latex}, we discuss the architectural decisions of prior work on soft-token compression as part of our architectural search and analysis. Our work shows that, with large-scale staged training, a single online soft-token compressor can robustly handle general heterogeneous inputs while closely matching uncompressed behavior.

# The `\fullname `{=latex}Architecture

An `\name `{=latex}consists of an encoder with a pooling method that maps token chunks to soft tokens, an adapter to project to the decoder's embedding dimension, and a decoder that consumes the soft tokens as latent context. Given a sequence of input tokens $x_{1:T}$ and a compression ratio $N$, the encoder maps each contiguous block of $N$ input tokens to a single latent token. Let $W$ denote the encoder window size, i.e., the number of input tokens processed by the encoder in one forward pass. We split the input sequence into $I=\lceil T/W\rceil$ encoder windows, each containing at most $W$ tokens: $$\begin{equation}
w_i
=
x_{
    (i-1)W+1
    \; : \;
    \min(iW,T)
},
\qquad
i = 1,\dots,I.
\end{equation}$$ For each window $w_i$, the encoder computes the final hidden states $h^{(i)}_{1:|w_i|}$ for each input token in the window. A pooling operator then aggregates these hidden states into $M_i = \left\lceil \frac{|w_i|}{N} \right\rceil$ latent tokens $z^{(i)}_{1:M_i}$. When $W=N$, each encoder window contains a single compression block, yielding one latent token per forward pass. When $W=T$, the encoder processes the entire input sequence in one forward pass and produces $\lceil T/N\rceil$ latent tokens. We concatenate the latent tokens from all encoder windows to obtain the full compressed latent sequence $z_{1:M}$, where $M$ is the total number of latent tokens after compression.

Because the encoder and decoder may have different hidden dimensions, an adapter $a(\cdot)$ projects the compressed latent sequence $z_{1:M}$ from the encoder hidden dimension into the decoder hidden dimension, producing latent tokens $s_{1:M}$. The decoder then consumes these projected latent tokens in place of the corresponding input tokens.

# Training Recipe {#sec:training}

We now describe our multi-stage recipe and data for training. In contrast to prior context-compression work that trains specialized models on small-scale in-domain datasets [@GMSA; @mean-pooling; @arc-encoder; @xrag; @e2llm], our goal is to preserve the strong performance of a powerful LLM across downstream tasks. To this end, we curate three types of data: continual pre-training data, Supervised Fine-Tuning (SFT) data, and auxiliary reconstruction data.

<figure id="fig:data_example" data-latex-placement="t">
<img src="figs/data.png" style="width:100.0%" />
<figcaption><strong>Examples of the three data types used to train .</strong> We curate continual pre-training data with interleaved compressed/uncompressed blocks of tokens, supervised fine-tuning data with compressed prompts and long documents, and auxiliary reconstruction data that asks the decoder to reproduce the original context from the compressed latents.</figcaption>
</figure>

## Training Data

#### Continual pre-training dataset.

We construct a carefully curated high-quality pre-training mixture covering Common Crawl text, code, science and reasoning, long-context data, and instruction-style data. We partition each sequence into multiple segments and alternate between compressed segments and standard token segments. We then compute the next-token prediction loss on the uncompressed tokens *only*. This interleaved format differs from the simple first-half compressed, second-half trainable split used in prior work [@refrag; @in-context-autoencoder; @auto-compressor; @cepe]. By distributing compressed spans throughout the sequence, the model learns to condition on latent context at multiple positions rather than only at the beginning of the context. We provide a visual example of this data format in `\Cref{fig:data_example}`{=latex}. The exact pre-training mixture is described in `\Cref{app-subsec:cont-pre-train-data}`{=latex}, with additional details on compression formatting in `\Cref{compression-data-details}`{=latex}.

#### SFT data.

To recover performance after continual pre-training [@gao2025train] and further enhance instruction following capability with latent context, we further post-train the model with supervised fine-tuning to improve reasoning, instruction following, and long-context understanding when conditioning on compressed inputs. We construct an SFT mixture spanning three components: (i) reasoning, (ii) long-context instruction following, and (iii) general instruction following and multi-turn conversation. Some completions in the original SFT data were generated by outdated models. To improve response quality, we relabel a subset of the SFT data using `Qwen3-30B-A3B-Instruct-2507` and `Qwen3-235B-A22B-Instruct-2507`, whose model checkpoints are released under the Apache-2.0 license. We report the exact SFT mixture and other details in `\Cref{app-subsec:posttrain-data}`{=latex}.

#### Auxiliary reconstruction data.

To accelerate training and encourage the latent representations to preserve fine-grained information, we introduce an auxiliary reconstruction task. Given a sampled document, we compress the text and ask the model to repeat the original content. To increase diversity, we include a variety of sources spanning code, text, long documents, math, and LaTeX, as described in `\Cref{app-subsec:aux-recon-data}`{=latex}. To reduce prompt overfitting, we use a bank of $100$ prompt templates per source. This auxiliary task encourages the compressor not only to support semantic understanding but also to retain fine-grained details needed for tasks such as exact retrieval. We include auxiliary reconstruction data in both continual pre-training and SFT.

## Training Stages

We adopt a multi-stage training recipe to improve stability, preserve the original model's capabilities, and mitigate catastrophic forgetting. The stages progressively increase the number of trainable components, starting with adapter and encoder warm up, followed by end-to-end continual pre-training and SFT.

- **Stage 0: Adapter warmup.** We freeze both the encoder and decoder and update only the adapter.

- **Stage 1: Encoder training.** We unfreeze the encoder while keeping the decoder frozen.

- **Stage 2: End-to-end continual pre-training.** We unfreeze the decoder with a small learning rate.

- **Stage 3: Supervised fine-tuning.** We train on the SFT mixture and increase the decoder learning rate.

This staged procedure is analogous to Vision-Language Model (VLM) training pipelines that first align pre-trained representations across modalities [@llava; @tong2024cambrian] and then fine-tune for downstream tasks. For stages 0, 1, and 2, we uniformly sample from the continual pre-training mixture; for stage 3, we train on the SFT mixture. In early experiments, we consider directly training the full model end-to-end from the beginning, but find that this approach underperforms the multi-stage recipe. At the beginning of training, the decoder is not accustomed to taking in the output of the embedding model, so both encoder and decoder parameter gradients can be large, leading to model degradation. The staged approach, where we first train the adapter before unfreezing the encoder and decoder, keeps gradients smooth and training more stable, preventing this degradation. We therefore adopt the multi-stage recipe for all scaled experiments. We also consider parameter-efficient variants that keep the decoder fully frozen [@in-context-autoencoder; @xrag; @cepe] or use LoRA [@e2llm; @GMSA; @mean-pooling], but we find them to substantially underperform compared to full-parameter training in early experiments; we therefore focus on end-to-end training. Details of the continual pre-training and SFT data mixtures are provided in `\Cref{app-tab:data-mixture-pre-training}`{=latex} and `\Cref{app-tab:SFT-mixture}`{=latex}, respectively. All other training details, including token budgets, learning rates, and sequence lengths, are provided in `\Cref{tab:training_recipe}`{=latex}.

\begin{table*}[t!]\small

    \setlength{\tabcolsep}{8pt}
    \caption{\textbf{Training recipe for compression model for four different stages.} We report, for each stage, which module(s) are optimized (adapter / encoder / decoder), peak learning rates, learning-rate schedule, optimizer, global batch size (in tokens), sequence length, training-token budget, and the data source used.}
    \label{tab:training_recipe}
    \resizebox{\linewidth}{!}{
    \begin{tabular}{l|cccc}
    \toprule
     & \textbf{Stage 0} & \textbf{Stage 1} & \textbf{Stage 2} & \textbf{Stage 3} \\
     & \textbf{Adapter Training} & \textbf{Encoder Training} & \textbf{LLM Training} & \textbf{SFT} \\
    \hline
    \\[-4pt]
    \textbf{Training Config} \\
    Adapter Peak LR  & $1.0 \times 10^{-3}$ & $6.0 \times 10^{-5}$ & $6.0 \times 10^{-5}$ & $3.0 \times 10^{-5}$ \\
    Encoder Peak LR & NA & $6.0 \times 10^{-5}$ & $6.0 \times 10^{-5}$ & $3.0 \times 10^{-5}$ \\
    LLM Peak LR & NA & NA & $1.0 \times 10^{-5}$ & $3.0 \times 10^{-5}$ \\
    LR scheduler & 5\% warmup steps with cosine decay to $1.0 \times 10^{-6}$ \\
    Optimizer & AdamW ($\beta_1=0.9$, $\beta_2=0.95$) \\
    \midrule
    \textbf{Data} \\
    LLM Batch size (tokens) & 4 Million \\
    LLM Sequence Length & 16384 \\
    Encoder Tokens & $17.54$ B & $35.07$ B & $82.43$ B & $29.73$ B \\
    LLM Tokens (16x) & $21.29$ B & $42.58$ B & $100.08$ B & $21.32$ B \\
    Total Tokens (16x) & $38.83$ B & $77.65$ B & $182.51$ B & $51.05$ B \\
    Data Source & pre-training Mixture & pre-training Mixture & pre-training Mixture & SFT \\
    \bottomrule
    \end{tabular}
    }
\end{table*}

<figure id="fig:arch_sweep" data-latex-placement="t!">
<img src="figs/arch_loss.png" style="width:100.0%" />
<figcaption><strong>A from-scratch pre-training sweep identifies the best encoder-decoder compressor architecture.</strong> <strong>Left:</strong> We explore how to best train an encoder LM to compress raw text into latent vectors and a decoder LM that uses the vectors as latent context. <strong>Right:</strong> We sweep many variants of encoder-decoder compressors end-to-end with 38B training tokens at a compression ratio of <span class="math inline">16×</span> and compare pre-training loss to find the best architecture. Since loss is computed only on the uncompressed tokens, the pre-training loss is substantially lower than a standard next-token prediction loss over all tokens. Extended experiments are in .</figcaption>
</figure>

# Architectural Design Space of Latent Encoders {#sec:arch}

We conduct a controlled architecture search over encoder-decoder context compressors by pre-training `\name `{=latex}variants from scratch. Our cleanroom setting lets us isolate the effect of architectural choices without confounding from pre-trained initialization. This is particularly important for choices such as the encoder attention mask and pooling operators: for example, the pre-trained `Qwen3-Embedding-0.6B` encoder uses causal attention and EOS pooling, which could bias the architectural search. Specifically, we use `Qwen3-0.6B` as the base architecture for both the encoder and decoder, randomly initializing all weights with the same seed, and training all components end-to-end with no frozen parameters. Each variant is trained on the pre-training mixture described in `\Cref{sec:training}`{=latex} for 38B tokens at a compression ratio $N=16$. This comprehensive search covers nearly all major architectural choices explored in prior soft-token compression work, along with additional underexplored choices. We describe each design axis and its empirical effect below.

#### Pooling operator.

Prior work on context compression has primarily focused on two pooling strategies. Token-based pooling appends or prepends designated tokens, such as `EOS` or `CLS`, and uses their final hidden states as compressed representations [@refrag; @e2llm; @auto-compressor]. Mean pooling averages encoder hidden states over each block of $N$ input tokens [@in-context-autoencoder; @GMSA], a design that is common in vision and multimodal models [@gemma4_2025]. We additionally consider concatenation, a widely used alternative in vision-style architectures but less explored for context compression [@qwen3vl; @tong2024cambrian]. Concatenation preserves the $N$ sequential encoder hidden states by stacking them into a single wide vector, after which the adapter projects the resulting representation back to the LLM hidden dimension. The formal definitions of pooling operators can be found in `\Cref{app-subsec:encoder-defs}`{=latex}.

`\Cref{fig:arch_sweep}`{=latex} compares the three pooling operators in our pre-training sweep. Mean pooling consistently improves pre-training loss over token-based pooling, but is empirically indistinguishable from concatenation. This agrees with prior evidence that mean pooling can outperform token-based pooling for compression [@mean-pooling], and with findings that mean pooling can be surprisingly effective despite being a lossy operation in text embedding models [@hara2026mean]. It is also consistent with related observations in vision architectures [@chu2021conditional]. While concatenation can, in principle, preserve distinct representations that mean loses, and has been found to outperform mean pooling in other representation-learning settings [@chen2025comodo], we do not observe this advantage in our small-scale experiment. We compare these two operators at scale in `\Cref{app-fig:pool-abl-at-scale,app-subsubsec:pooling-operator}`{=latex}, where concatenation outperforms mean pooling at low compression ratios and mean pooling outperforms concatenation at high compression ratios, while the gap between the two narrows as context length increases.

#### Encoding granularity.

Ideally, one would set the encoder window size to the full context length, allowing the encoder to jointly contextualize all input tokens, analogous to ViT-style encoders [@dosovitskiy2020image] that process the entire image at once. For language inputs, however, the context length can be substantially larger, making full-context encoding prohibitive in both memory and time.

To study this trade-off, we vary the encoder window size over $W \in \{N, 256, 1024\}$ for mean pooling in `\Cref{fig:arch_sweep}`{=latex}. We find that increasing $W$ from $N\;(16)$ [@refrag; @e2llm] to $256$ leads to a large improvement, while increasing $W$ further from $256$ to $1024$ yields an additional smaller gain. Since $W=1024$ does not introduce significant memory or runtime overhead in our setting, we adopt $W=1024$ as the default. This setting enables the encoder to attend over a larger local context and produce richer compressed representations. One limitation of using $W < T$ is that the input must be chunked into multiple windows, so information crossing a boundary is split across two encoder forward passes. To address this, we also design a boundary-overlap variant that allows each window to see neighboring tokens during encoding. However, this approach does not improve pre-training loss while increasing encoder compute, so we use no overlap by default and defer details to `\Cref{app-subsec:arch-overlap}`{=latex}.

#### Encoder attention mask.

Unlike standard decoder language modeling, where causal masking is required to match autoregressive inference, context compression is applied to the prompt before decoding. The encoder can therefore use either causal or bidirectional attention, as in prefix-LMs [@raffel2020exploring]. In `\Cref{fig:arch_sweep,app-fig:adapter-ov-loss}`{=latex}, we compare these two masking choices and find that causal masking consistently achieves lower pre-training loss.

#### Adapter design.

The adapter projects encoder latents from dimension $d_{\mathrm{enc}}$ to the decoder embedding dimension $d_{\mathrm{dec}}$. Prior encoder-decoder compression work often assumes matching encoder and decoder hidden sizes, but a wide range of encoder-decoder models pair smaller encoders with larger decoders, making the adapter a non-trivial design choice. We compare a lightweight MLP projection against an attention-based adapter that adds a single self-attention layer over the latent sequence before the MLP projection. In contrast to prior findings [@GMSA], we observe in `\Cref{app-fig:adapter-ov-loss}`{=latex} that the MLP-only adapter achieves lower pre-training loss while using less compute, and we adopt it as the default. Formal definitions are provided in `\Cref{app-subsec:adapter}`{=latex}.

<figure id="fig:pareto_optimal" data-latex-placement="t!">
<img src="figs/scaling_time_memory.png" style="width:100.0%" />
<figcaption><strong>have lower TTFT and peak GPU memory as context length increases.</strong> We plot time to first token (left) and peak GPU memory (right) for a single sample. We note the baselines use the same time and memory regardless of compression ratio; hence, there is only one line per baseline method. We record all measurements on the same H200 GPU, and truncate lines when the method cannot be performed within the 141GB of memory. For peak GPU memory, our method plateaus at longer contexts for larger compression ratios: in this regime, the encoder’s batched processing dominates memory usage over the decoder. Memory begins to grow again once the decoder’s prefill memory surpasses the encoder’s activation per forward pass. </figcaption>
</figure>

#### Optimal architecture at scale. {#par:optimal-arch-at-scale}

Overall, our from-scratch pre-training experiments in `\Cref{fig:arch_sweep}`{=latex} serve as a lightweight architecture sweep, allowing us to narrow the search space to a smaller set of promising design choices. We then scale these candidates using the full training pipeline, both to identify the strongest architecture at scale and to validate that the small-scale pre-training findings remain predictive under large-scale training. For our main experiments, we use `Qwen3-Embedding-0.6B` as the encoder with `Qwen3-4B-Instruct-2507` as the decoder. Across the continual pre-training loss curves in `\Cref{app-fig:pool-abl-at-scale,app-fig:adapter-ov-loss,app-fig:large-scale-sweep}`{=latex} and the downstream ablations in `\Cref{app-subsec:arch-sweep}`{=latex}, we find that the optimal architecture at scale uses an encoder window $W=1024$, causal masking, and an MLP adapter. The choice of pooling operator depends on the compression ratio: mean pooling has a small edge over concatenation at $16\times$ compression, while concatenation slightly outperforms mean pooling at $4\times$. Since we primarily focus on $16\times$ compression in our analysis, we adopt mean pooling as the default in all subsequent experiments. Full results and per-task breakdowns are reported in `\Cref{app-sec:training-recipe-sweeps}`{=latex}.

We also study alternative model choices, including whether to initialize the encoder from an embedding model or a language model, in `\Cref{app-subsec:encoder_lm_or_emb}`{=latex}. In `\Cref{app-sec:scaling model size}`{=latex}, we analyze the performance of scaling the sizes of the encoder and decoder.

# Results: Improving Latency- and Memory-Performance Tradeoffs {#sec:results}

We predominantly focus on long-context benchmarks: we evaluate on RULER [@hsieh2024ruler], LongBench [@bai2024longbench], and LongHealth [@adams2025longhealth]. For the long-context evaluation suite, we maintain instructions as uncompressed tokens, while we compress long-context segments. We report which parts of each benchmark are compressed and which are left as standard uncompressed tokens in `\Cref{app-tab:eval-benchmarks}`{=latex}.

We measure two efficiency axes: peak memory during compression, and compression time. Peak memory includes both the encoder and decoder for a single sample at each context length. We report the mean over 30 measurements on the same H200 GPU, using 20 warmup steps to reduce kernel-level noise. We define compression time as time-to-first-token (TTFT), i.e., the time required for a method to reach the point at which it can generate the first token. We adapt the KVPress [@devoto2025expected] timing code[^1] to support our method. All methods are benchmarked using Hugging Face Transformers [@hf_transformers] implementations.

Since `\names `{=latex}process input tokens in encoder-window chunks, we can parallelize compression by batching encoder forward passes before passing the resulting soft tokens to the decoder. For all figures in this paper, we use an encoder batch size of $128$; with encoder window size $W=1024$, this corresponds to processing $131{,}072$ input tokens per batched encoder pass. We find this batch size provides a good trade-off between compression speed and encoder memory usage, improving GPU utilization without exceeding memory limits for ultra-long contexts. Moreover, because our method preserves the standard KV cache structure, prefill and generation can be further accelerated with standard inference frameworks such as vLLM [@kwon2023efficient] and SGLang [@zheng2024sglang], as well as efficient attention implementations [@linearattention; @kimilinear; @deepseekv2; @deepseekv4]. Therefore, the throughput measurements reported here are conservative relative to what optimized serving implementations could achieve.

<figure id="fig:more_benchmarks" data-latex-placement="t!">
<img src="figs/pareto_curve_extra.png" style="width:100.0%" />
<figcaption><strong>establish a new Pareto frontier on long-context benchmarks in terms of time to first token.</strong> We plot accuracy across RULER and LongHealth, and find that our compress samples faster with equivalent or higher accuracy than other compression methods. Full results can be viewed in .</figcaption>
</figure>

## Baselines

We consider a range of KV cache compression baselines, including Expected Attention [@devoto2025expected], Attention Matching [@zweiger2026fast], KVzip [@kim2025kvzip], FastKVzip [@kim2026fast], and SnapKV [@li2024snapkv]. Each method has different native support for query-aware, query-agnostic, and self-study compression configurations.

Query-aware methods, such as SnapKV, are designed to compress a context given a known query; however, the resulting compressed cache may not generalize well to unseen queries or future turns. Query-agnostic methods, such as KVzip and FastKVzip, compress a cache without access to the downstream query, often using reconstruction-style objectives such as repeating the original context. Self-study, as introduced by @cartridges, performs compression using synthetic queries to expand coverage over potential inference-time queries. These synthetic queries are typically produced by prompting the target decoder to generate questions about the context. In Attention Matching [@zweiger2026fast], the authors experiment with a mixed query-agnostic and self-study configuration and find that self-study has limited impact on the repeat-prefill prompt. Since self-study introduces additional compression-time overhead, we do not apply it to RULER.

It is also important to distinguish algorithmic and theoretical cache reduction from systems-realizable cache reduction. Some KV cache methods produce an actual smaller cache that can, in principle, reduce decoding memory and latency. Others, such as @kim2025kvzip [@kim2026fast], are evaluated by masking or subsampling entries from a full cache, or by using non-uniform eviction patterns across heads and layers. These approaches are useful for measuring the quality of a compressed cache, but they do not necessarily translate into wall-clock speedups or memory savings in standard inference engines without specialized kernels that have not been released or even implemented.

We use the KVPress [@devoto2025expected] implementation for SnapKV, KVzip, FastKVzip, and Expected Attention. We use the official implementation for Attention Matching. We discuss all baselines further in `\Cref{sec:related}`{=latex} and `\Cref{app-sec:extended-rel-work}`{=latex}.

## A New Pareto Frontier

We train `\names `{=latex}with compression ratios of $16\times$, $8\times$, and $4\times$. In `\Cref{fig:cover,fig:more_benchmarks}`{=latex}, `\names `{=latex}establish a new Pareto frontier in compression time and accuracy over KV cache compression baselines on RULER, LongBench, and LongHealth. KV cache compression methods appear as nearly vertical lines in these plots because their compression time is largely independent of the target compression ratio. These methods first materialize the full KV cache and then perform eviction, masking, or compaction, whose cost is small relative to the full-context prefill. In contrast, `\names `{=latex}reduce the sequence length before decoder prefill, so higher compression ratios directly reduce the amount of decoder-side computation and memory.

**Time and memory scaling with context length.** In `\Cref{fig:pareto_optimal}`{=latex}, we visualize compression time and peak GPU memory for our method and the baselines across context lengths from 4K to 1M. `\names `{=latex}achieve the fastest compression time and, at longer context lengths, substantially lower peak GPU memory. Attention Matching runs out of memory at 1M tokens and fails at 512K tokens due to numerical instability in the linear solver. All other methods run out of memory at 512K and 1M tokens. This efficiency stems from two properties of `\names`{=latex}: compression is performed over fixed-size encoder windows rather than through a full-context decoder prefill, and the encoder is much smaller than the decoder. Since our encoder processes at most 128K input tokens per batched forward pass, peak memory is initially dominated by encoder activations. As a result, peak memory remains nearly flat for the $16\times$ model from 128K to 512K tokens, and for the $8\times$ model from 128K to 256K tokens. At longer contexts, the decoder prefill over the compressed latent sequence becomes the dominant memory cost, causing peak memory to increase.

**Fine-Grained Compression.** In `\Cref{app-fig:gsm}`{=latex}, we use GSM8K to analyze fine-grained compression quality, where we compress the entire prompt and context. We find that `\names `{=latex}achieve the highest accuracy across all compression ratios on GSM8K, with particularly strong gains over baselines at higher compression ratios.

# Agent Scaffolding With Latent Context {#sec:agentic}

Agentic systems are commonly used to handle large amounts of information [@zhang2025rlm]. However, agents usually rely on lexical or semantic search to locate the information needed. These information retrieval methods can fail when the search keyword is not obvious before reading through the information. For example, in a large codebase, a bug reported in the \`\`dashboard login flow" may actually originate in an indirectly called entitlement module that never mentions \`\`dashboard" or \`\`login". Ideally, an agent should read all information at once before acting, but context limits of existing models prevent this from happening.

<figure id="fig:agentic_bar_plot" data-latex-placement="t!">
<img src="figs/agent.png" style="width:100.0%" />
<figcaption><strong>can use tools to retrieve compressed context and improve exact string-match accuracy.</strong> <strong>Left:</strong> We equip with a tool that retrieves compressed context and inserts it into their working context uncompressed. <strong>Right:</strong> We test the agent on challenging NIAH tasks, and the agent substantially improves exact string-match accuracy on retrieval tasks. </figcaption>
</figure>

LCLMs' more efficient long-context capacity makes it possible to place, for example, an entire code repository in an agent's context. We further enhance the agent's capability to locate precise information by allowing the `\name `{=latex}agent to expand compressed segments into raw text on demand. We segment the input into fixed-size chunks of 512 tokens, compress each chunk, and assign it an integer identifier. The model receives the entire compressed sequence in a single prompt, together with one `EXPAND` tool. At each turn, the model can decide to make a tool call of the form `EXPAND(i)`, and the tool returns the original text of the expanded segment.

We use the needle-in-a-haystack tasks from RULER as a testbed for evaluating this agentic retrieval mechanism with latent context. As shown in `\Cref{fig:agentic_bar_plot}`{=latex}, our agent substantially improves performance over the raw `\name `{=latex}with $16\times$ compressed context, and in some settings matches the performance of the original uncompressed context. These results suggest that compressed latent context can provide broad corpus-level visibility, while tool-based expansion supplies the exact fine-grained information needed. The marked success of this initial agentic attempt suggests the promise of adaptive expansion. LCLMs can skim a large quantity of context globally before deciding which relevant subset to zoom in on and read more carefully. Future work may explore new mechanisms for determining which context to expand or how to learn this behavior end-to-end.

# Conclusions {#sec:discuss}

Across long-context benchmarks, standard knowledge and instruction-following tasks, and agentic settings, our family of `\fullnames `{=latex}demonstrates that learned compression can preserve strong in-context capabilities while providing substantial efficiency gains, making it a practical building block for efficient long-context systems.

`\fullnames `{=latex}open a broad design space for future work and are naturally compatible with agentic frameworks such as Recursive Language Models [@zhang2025rlm]. More generally, `\names `{=latex}provide a promising substrate for long-horizon agents with large, persistent working memories by dramatically reducing the size of the inputs at scale. Future iterations of `\names `{=latex}could further improve the quality-efficiency trade-off by compressing inputs at multiple granularities and dynamically allocating capacity based on information density or input perplexity. Such adaptive compression could allow models to preserve fine-grained details where needed while maintaining a compact global context, reducing reliance on explicit expansion tools. Another promising direction is to extend compression beyond static input context to the model's generated state, including long chain-of-thought, tool observations, and an agent's accumulated working history, which can grow to dominate the context budget in long-horizon tasks.

# Acknowledgments

MG and ZC were supported by IBM, the Google Research Award, and the NVIDIA Academic Grant Program for Researchers. SM and TG are supported by DOE Office of Science's ASCR AI for Science initiative, the NSF TRAILS Institute (2229885), and Coefficient Giving. NK is supported by the NSF Graduate Research Fellowship.

Computing resources for this project were supported in part by the Swiss National Supercomputing Center (CSCS). Prepared in collaboration with LLNL under Contract DE-AC52-07NA27344 and supported by the LLNL-LDRD Program under Project No. 24-ERD-010 (LLNL-CONF-2019739). This manuscript has been authored by Lawrence Livermore National Security, LLC under Contract No. DE-AC52-07NA27344 with the U.S. Department of Energy. The United States Government retains, and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, through solicitation DE-FOA-0003264, \`\`Advancements in Artificial Intelligence for Science," under Award Number DE-SC0025598.

Special thanks to Shengbang Tong for extensive discussions on architecture design and encoder-decoder model training. We would like to thank Preston Zhou and Eitan Borgnia for their early discussions and support for the project. We also thank Vadim Bereznyuk, John Kirchenbauer, and Neel Jain for providing feedback on the draft.

\bibliographystyle{plainnat}
\newpage
\appendix
\onecolumn

# Impact Statement {#app-sec:impact}

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be highlighted here.

# Extended Related Work {#app-sec:extended-rel-work}

#### Hard-token compression.

Hard-token methods achieve compression by deleting or rewriting input tokens. Token pruning approaches remove tokens via importance heuristics [@li2023compressing; @jiang2023llmlingua; @jiang2024longllmlingua], while summarization and rewriting methods compress by paraphrasing into fewer tokens [@chuang2024learning; @yoon2024compact]. While widely used in practice [e.g., context compaction in Claude Code and Codex; @claude_code; @codex] and very effective for reducing prompt length, these methods are intrinsically lossy: once information is dropped or paraphrased in the discrete token space, exact lexical and structural detail is unrecoverable. This limits their ability to support code fidelity, format-sensitive reasoning, and lookup-style in-context learning in long-horizon settings.

Relatedly, many agent systems externalize memory via summarization, retrieval, or structured stores. Examples include hierarchical memory management [@packer2023memgpt], long-term conversational memory [@zhong2023memorybank; @wang2023scm], and agentic memory frameworks [@xu2025amem]. While effective in extending interaction horizons, these approaches typically rely on lossy summaries or task-specific memory representations that are external to the base model.

#### KV cache compression.

KV cache compression reduces the size of the KV cache by evicting entries; most methods rely on a hand-crafted or learned heuristic to select which entries to drop. SnapKV [@li2024snapkv] compresses the cache by removing entries with low aggregated query attention on a per-head basis. SnapKV can be either prompt-agnostic, where pruning is performed without knowledge of the prompt, or prompt-dependent, where pruning is performed with knowledge of the prompt. Prompt-dependent methods have the downside of requiring labeling of the context and prompt for inputs, and pruning is highly specific to each individual input, limiting multi-turn chat applicability [@li2024scbench]. To overcome these limitations, self-study [@zweiger2026fast] can be used to generate synthetic query signals from the context itself, enabling prompt-agnostic pruning without requiring labeled prompt-context pairs.

KVzip [@kim2025kvzip] performs query-agnostic KV cache compression by using teacher-forced context reconstruction as a proxy objective, assigning each KV pair an importance score from the maximum attention it receives during reconstruction, and pruning low-scoring entries. Fast KV Compaction via Attention Matching [@zweiger2026fast] compresses the KV cache by fitting a smaller set of keys and values, together with per-entry bias terms, that match the original cache's attention outputs and attention mass on reference queries. Expected Attention [@devoto2025expected] predicts KV importance by approximating future queries with a Gaussian distribution, computing each key's expected attention in closed form, and pruning entries with the lowest expected contribution. There are many other KV cache compression techniques and we refer the reader to Awesome KV cache Compression Github [@October2001_AwesomeKVCacheCompression_2025] for a comprehensive review.

We note that KV cache compression methods also pool before eviction: both @li2024snapkv and @zweiger2026fast use max pooling, as does @devoto2025expected. @li2024snapkv further note that the choice of pooling operator does not significantly impact their performance.

It is also possible to do targeted KV cache compression offline for specific prompts. This approach applies to the setting where it is logical to expend more compute to compress the cache once, as this cost will be amortized over many user queries. @synk2025exploiting compute the full KV cache before applying top-k cache compression and reuse the smaller cache. @cartridges use self-study to learn small KV caches per corpus.

#### Soft-token compression.

A complementary direction encodes long contexts into fewer learned continuous tokens. The most successful examples of soft-token approaches to compression require training on the target context, such as prefix tuning [@li2021prefix; @gist; @lloco]. These methods can achieve comparable in-context performance to the base model with large compression ratios for the specific contexts they are trained on. Other soft-token approaches that do not require training on individual contexts are in-context autoencoders [@in-context-autoencoder] and encoder-decoder compression frameworks [@auto-compressor; @e2llm; @dai2025pretraining; @li2025500xcompressor].

#### Efficient long sequence modeling.

A complementary line of work targets the cost of long-context inference by modifying the underlying sequence model, the attention mechanism, or the positional encoding, rather than compressing the input. Early work scales Transformers to long inputs through segment-level recurrence and external memory: Transformer-XL [@transformerxl] caches hidden states from previous segments to extend the effective receptive field, and the Compressive Transformer [@rae2019compressive] further compresses older memory entries into a smaller compressed memory.

A second direction replaces softmax attention with formulations that scale linearly in sequence length. Linear-attention Transformers [@linearattention] cast attention as a kernel feature map to obtain $O(T)$ inference, state-space models such as S4 [@s4] and Mamba [@mamba] use recurrent or selective-scan formulations with constant-size hidden state, and more recent hybrids such as Kimi Linear [@kimilinear] interleave linear-attention blocks with sparse full-attention blocks to recover long-range recall while keeping most of the per-layer cost linear. These approaches lower the asymptotic cost of attention but typically trade off exact long-range recall on retrieval-style tasks.

Within the Transformer family, recent DeepSeek models compress the attention computation itself: Multi-head Latent Attention in DeepSeek-V2 [@deepseekv2] projects keys and values into a low-rank latent KV cache, and DeepSeek-V4 [@deepseekv4] introduces a hybrid attention architecture that interleaves Compressed Sparse Attention (CSA), which consolidates every $m$ tokens of the KV cache into a single entry and then applies sparse top-$k$ selection over those entries, with Heavily Compressed Attention (HCA), which uses a much larger compression rate $m' \gg m$ but retains dense attention over the resulting entries, jointly reducing both KV memory and per-token attention FLOPs at million-token context lengths.

A separate thread extends the usable context of pre-trained models through their positional encoding: rotary position embeddings [@rope] provide relative attention with smooth extrapolation, and YaRN [@yarn] rescales RoPE frequencies to extend the context window with minimal additional training.

Finally, parallel context encoding methods such as CEPE [@cepe] take a different route: a small encoder produces dense representations of long-context chunks in parallel, and a frozen decoder consumes them through inserted cross-attention layers, exposing long contexts to the decoder without full self-attention over the raw tokens.

These approaches reduce the per-token cost of attention, the cost of positional indexing, or the effective sequence length seen by self-attention, rather than the number of tokens reaching the decoder; they are largely orthogonal to LCLMs and can in principle be composed.

# Dataset Details {#app-sec:data}

## Continual Pre-training Dataset Curation {#app-subsec:cont-pre-train-data}

In early experiments, we only train on reconstruction data and hope the model can learn generalizable patterns. However, we find that training on such data alone allows the model to reconstruct the full sequence with negligible loss, but it generalizes poorly beyond reconstruction-style tasks. In fact, it cannot perform any other tasks, even when the LLM decoder is kept frozen. This might not seem surprising, as we are essentially performing a general version of prefix tuning. Thus, the learned representation, though general for different contexts, collapses to the task of reconstruction. On the other hand, training solely on next-token prediction enables the model to leverage compressed context for downstream generation and reasoning. However, the resulting representations often lack fine-grained fidelity (e.g., exact-string details), leading to weaker reconstruction and brittleness on tasks requiring precise recovery.

Motivated by this trade-off, we train on a mixture of next-token prediction and reconstruction data. Our working hypothesis is that next-token prediction provides task-aligned learning signals that teach the decoder to operate on compressed context, while reconstruction acts as an auxiliary objective that encourages information preservation and can accelerate early-stage representation learning. Theoretically, sufficiently large-scale next-token training alone may recover fine-grained fidelity, but we find that in practice, the reconstruction data serves as an auxiliary data source that accelerates the training of the general compressor.

\begin{table*}[ht]

\small
\setlength{\tabcolsep}{5pt}
\renewcommand{\arraystretch}{1.12}
\caption{
Dataset mixture for continual pre-training. Pre/Post-Comp are token counts before/after compression.
We refer to the unique Nvidia license as ``Nvidia,'' the license can be found at \url{https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.1/blob/main/LICENSE.md}.
}
\label{app-tab:data-mixture-pre-training}
\resizebox{\linewidth}{!}{
\begin{tabular}{l l l r r r r r}
\toprule
\textbf{Data Name} &
\textbf{License} &
\textbf{Data Description} &
\textbf{Num Samples} &
\textbf{Tokens} &
\textbf{\%} \\
\cmidrule(lr){5-7}
& & & & \textbf{Pre-Comp} & \textbf{Post-Comp} & \textbf{Trainable} & \\
\midrule
Nemotron CC & Nvidia & Continual pre-training (CC Text)      & 84.84M  & 60.69B & 30.38B & 30.08B & 20.37 \\
Nemotron Code & Nvidia & Continual pre-training (Code)         & 46.85M  & 51.18B & 25.21B & 25.01B & 16.90 \\
Nemotron Reasoning & Nvidia & Continual pre-training (Reasoning)    & 16.62M  & 52.32B & 25.61B & 25.50B & 17.17 \\
Longmino & odc-by & Continual pre-training (Long-Context) & 0.548M  & 26.62B & 4.24B  & 4.23B  & 2.84  \\
SFT & cc-by-4.0 & SFT with Compressed Prompt           & 20.67M  & 42.21B & 38.02B & 37.54B & 25.49 \\
Reconstruction & Nvidia & Reconstruction from Compression      & 23.53M  & 50.75B & 25.69B & 25.18B & 17.23 \\
\midrule
\textbf{Total} & -- & & \textbf{192.06M} & \textbf{283.78B} & \textbf{149.16B} & \textbf{147.54B} & \textbf{100.00} \\
\bottomrule
\end{tabular}
}
\end{table*}

#### Curation.

Our primary sources are NVIDIA's Nemotron pre-training datasets [@nvidia2025nvidianemotronnano2; @nvidia_nemotron_nano_v3_2025]. We sample 50M data points from Nemotron's pre-training code and cc code, 50M examples from high-quality and synthetic high-quality cc text, and 50M examples from Nemotron specialized SFT mixtures that cover diverse reasoning traces. A substantial portion of the synthetic data from Nemotron consists of rewrites or generations produced by strong teacher models (e.g., `Qwen3-30B-A3B`, `Qwen3-235B-A22B` [@qwen3], `GPT-oss-120B` [@agarwal2025gpt], and `DeepSeek-R1` [@guo2025deepseek]), which alleviate the problem of forgetting. Furthermore, to expose the compressor to longer documents, we sample approximately 500K examples from the OLMo-3 longmino collection [@olmo2025olmo3]. Finally, to better preserve instruction-following and chat-template behavior, we also include Nemotron pre-training SFT data and convert it into a standardized instruction tuning and multi-turn chat format. Exact dataset composition is reported in `\Cref{app-tab:data-mixture-pre-training}`{=latex}.

## Auxiliary Reconstruction Dataset Curation {#app-subsec:aux-recon-data}

We construct reconstruction data spanning three broad categories: text, code, and *LaTeX*. Text is sampled from Nemotron CC / math, FineWiki [@penedo2025finewiki], and RedPajama ArXiv [@together2023redpajama]; code is sampled from RedPajama GitHub, The Stack v2 (we concatenate files from the repo into one document) [@lozhkov2024starcoder], and CodeParrot; and LaTeX is sampled from TexTeller-OCR. Our goal is to maximize data diversity so that the compressor retains sufficient coverage for downstream tasks that may involve any of these modalities.

## SFT Dataset Curation {#app-subsec:posttrain-data}

For reasoning, we curate math, code, and science supervision from the Nemotron SFT datasets and OLMo-3 Dolci-Think [@NemotronPostTrainingDatasetV2; @NemotronPostTrainingDatasetV1; @nvidia_nemotron_nano_v3_2025; @olmo2025olmo3].

For long-context instruction tuning, we combine long-context and RAG-style QA datasets [@he2025scaling; @jin2019pubmedqa; @liu2024chatqa; @xu2024chatqa; @yang2018hotpotqa; @liu2024raginstructboostingllmsdiverse; @joshi2017triviaqa] with long-context code instruction datasets, including SWE-Fixer [@xie2025swe], NextCoder [@aggarwal2025nextcoder], PyResBugs [@11052783], CommitPack [@muennighoff2023octopack], and RepoBench [@liu2023repobench]. Since some public datasets contain outdated or low-quality responses, we regenerate targets by labeling them with `Qwen3-235B-A22B-Instruct-2507`. In addition, we generate three synthetic long-context corpora: (1) repository summarization, using summarization question templates as instructions and concatenated repository files from The Stack v2 [@lozhkov2024starcoder] as the document; (2) document summarization, using long documents from FineWiki [@penedo2025finewiki] and RedPajama-ArXiv [@together2023redpajama]; and (3) repository code generation, where we hold out one file and ask the model to reconstruct it given the remaining repository context. All synthetic completions are generated by `Qwen3-30B-A3B-Instruct-2507`.

For instruction following, we include multi-turn conversation data from WildChat and LMSYS-Chat-1M [@zhao2024wildchat; @zheng2023lmsys], along with additional alignment-style data obtained by re-labeling with `Qwen3-30B-A3B-Instruct-2507`. We also add OLMo-3 instruction tuning, Nemotron instruction tuning, and the Tulu-3 mixture [@lambert2024tulu3]; since these datasets are already generated by strong teacher models, we keep their original responses unchanged. Finally, we retain a small amount of reconstruction data to stabilize training and mitigate drift when optimizing on compressed inputs. Exact dataset composition is reported in `\Cref{app-tab:SFT-mixture}`{=latex}.

\begin{table*}[t]

    \small
    \setlength{\tabcolsep}{5pt}
    \renewcommand{\arraystretch}{1.12}
    \caption{SFT data mixture statistics. Pre/Post-Comp are token counts before/after compression; Weight (\%) is computed over total Post-Comp tokens. During packing, some ultra-long examples are truncated out, leaving us training with 30B tokens for the encoder and 20B for the decoder.
    We refer to the unique Nvidia license as ``Nvidia,'' the license can be found at \url{https://huggingface.co/datasets/nvidia/Nemotron-CC-v2.1/blob/main/LICENSE.md}.
    }
    \label{app-tab:SFT-mixture}
    \resizebox{\linewidth}{!}{
    \begin{tabular}{l l l r r r r r}
    \toprule
    \textbf{Data Name} &
    \textbf{License} &
    \textbf{Description} &
    \textbf{Num Samples} &
    \textbf{Tokens} &
    \textbf{(\%)} \\
    \cmidrule(lr){5-7}
    & & & & \textbf{Pre-Comp} & \textbf{Post-Comp} & \textbf{Trainable} & \\
    \midrule
    \textbf{Reasoning} & & & \textbf{2.67M} & \textbf{4.57B} & \textbf{3.87B} & \textbf{3.83B} & \textbf{17.88} \\
    \quad Nemotron Reasoning & cc-by-4.0 & Nemotron reasoning data & 2.20M & 3.90B & 3.33B & 3.31B & 15.41 \\
    \quad Dolci Think & odc-by & OLMo reasoning data & 0.36M & 0.50B & 0.40B & 0.40B & 1.87 \\
    \quad Nemotron Science & cc-by-4.0 & Nemotron science & 0.11M & 0.16B & 0.13B & 0.13B & 0.60 \\
    \midrule
    \textbf{Long-Context Instruction Tuning} & & & \textbf{7.32M} & \textbf{47.25B} & \textbf{10.63B} & \textbf{9.90B} & \textbf{49.16} \\
    \quad Repo Summarization & Apache & Synthetic repo-level summarization & 1.35M & 7.93B & 2.45B & 2.35B & 11.34 \\
    \quad Document Summarization & Apache & Synthetic long-document summarization & 1.75M & 18.49B & 3.14B & 3.09B & 14.52 \\
    \quad Repo Code Generation & Apache & Synthetic repo-level code generation & 2.02M & 7.83B & 3.54B & 3.38B & 16.36 \\
    \quad Long-Context QA & Apache & Long-context / RAG QA & 0.84M & 6.21B & 0.30B & 0.18B & 1.37 \\
    \quad Long-Context Code Instruct & Apache & Long-context code instruction tuning & 1.36M & 6.80B & 1.21B & 0.89B & 5.57 \\
    \midrule
    \textbf{General Instruction Following} & & & \textbf{5.26M} & \textbf{11.68B} & \textbf{4.52B} & \textbf{4.29B} & \textbf{20.91} \\
    \quad Multi-turn Conversation & odc-by & WildChat + LMSYS Chat 1M & 2.48M & 7.68B & 1.95B & 1.80B & 9.00 \\
    \quad Dolci Instruct & odc-by & OLMo instruction tuning & 1.46M & 2.06B & 1.76B & 1.73B & 8.15 \\
    \quad Tulu3 SFT Mixture & odc-by & Tulu3 instruction tuning & 0.84M & 0.69B & 0.40B & 0.38B & 1.83 \\
    \quad Nemotron Instruct & Nvidia & Nemotron instruction tuning & 0.35M & 1.07B & 0.32B & 0.28B & 1.46 \\
    \quad HelpSteer & CC-4.0 & Alignment & 0.13M & 0.17B & 0.10B & 0.10B & 0.48 \\
    \midrule
    \textbf{Reconstruction Data} & & & \textbf{2.08M} & \textbf{5.13B} & \textbf{2.61B} & \textbf{2.55B} & \textbf{12.05} \\
    \quad Reconstruction & Nvidia & Small reconstruction mixture & 2.08M & 5.13B & 2.61B & 2.55B & 12.05 \\
    \midrule
    \textbf{Total} & & & \textbf{17.32M} & \textbf{68.62B} & \textbf{21.63B} & \textbf{20.56B} & \textbf{100.00} \\
    \bottomrule
    \end{tabular}
    }
\end{table*}

## Compression Data Format {#compression-data-details}

#### Continual pre-training.

For the continual pre-training data, given a sequence, we split it into multiple segments, wrap every other segment with the special tokens `<|memory_start|>` and `<|memory_end|>`, and treat the wrapped spans as compressed content while leaving the remaining segments uncompressed.

For raw pre-training documents, we construct a fixed alternating pattern of wrapped and unwrapped segments, always beginning with a wrapped segment and ending with an unwrapped segment. Documents longer than 16,384 tokens retain 8,192 trainable (unwrapped) tokens, while shorter documents retain half of their tokens as trainable. The number of segment pairs depends on document length: documents shorter than 2,048 tokens receive 1 pair; documents between 2,048 and 4,096 tokens receive 1-2 pairs; documents between 4,096 and 8,192 tokens receive 2-4 pairs; documents between 8,192 and 16,384 tokens receive 4-6 pairs; and documents longer than 16,384 tokens receive 4-8 pairs. Within each sequence, the first wrapped segment contains 30%-40% of the total wrapped tokens, the last unwrapped segment contains 40%-50% of the total trainable tokens, and the remaining segments are allocated with a $\pm 20\%$ jitter. No segment is allowed to be shorter than 128 tokens.

#### Instruction tuning.

For conversational data, we apply wrapping independently to each message. With 50% probability, the entire message content is enclosed within a single pair of tags. Otherwise, with the remaining 50% probability, we insert 1-3 wrapped regions inside the message and leave the remaining tokens outside the tags. The number of wrapped regions depends on message length: messages shorter than 32 tokens are always fully wrapped; messages between 32 and 200 tokens receive 1 region; messages between 200 and 600 tokens receive 1-2 regions; and messages of 600 tokens or more receive 1-3 regions. Wrapped regions cover 80%-100% of the message tokens in aggregate, and we enforce a minimum 5% unwrapped gap whenever feasible. If the realized wrapped coverage falls below 70%, we fall back to fully wrapping the message. Since this decision is made independently for each message, multi-turn conversations naturally contain a mixture of fully wrapped turns and turns with small unwrapped spans, with a bias toward high overall coverage.

# Extended Architecture Search {#app-sec:arch}

<figure id="app-fig:pool-abl-at-scale" data-latex-placement="t!">
<img src="figs/arch.png" style="width:100.0%" />
<img src="figs/concat_vs_mean_ratios.png" style="width:80.0%" />
<figcaption><strong>Comparison between mean and concat at scale for all ratios.</strong> From , we identify mean and concat as the most promising pooling operators, with mean slightly outperforming concat in training loss. We train at larger scale and find that mean remains marginally better than concat at higher compression ratios, but loses to concat at lower compression ratios.</figcaption>
</figure>

## Formal Definitions of Pooling Operators {#app-subsec:encoder-defs}

We now give the formal definitions of pooling operators, and illustrate them visually in `\Cref{app-fig:arch_sweep}`{=latex}.

#### Token-based pooling.

The most common method is appending one or more learned special pooling tokens per latent output to each encoder window. Let $p_{i,1},\dots,p_{i,M_i}$ denote these pooling tokens. We encode $$\begin{equation}
(\tilde h^{(i)}_1,\dots,\tilde h^{(i)}_{|w_i|+M_i})
=
\mathrm{Enc}_\phi([w_i;p_{i,1};\dots;p_{i,M_i}]),
\end{equation}$$ and use the hidden states at the pooling-token positions: $$\begin{equation}
z^{(i)}_k
=
\tilde h^{(i)}_{|w_i|+k},
\qquad
k=1,\dots,M_i.
\end{equation}$$ This generalizes the common EOS-pooling strategy: when $W=N$, a single pooling token is appended for each compression block; when $W>N$, several pooling tokens summarize different regions of the same encoder window.

#### Mean pooling.

For each window $w_i$, we partition its hidden states into consecutive groups of size $N$. For latent index $k=1,\dots,M_i$, define $$\begin{equation}
G_{i,k}
=
\{(k-1)N+1,\dots,\min(kN,|w_i|)\}.
\end{equation}$$ The corresponding latent vector is the average hidden state over this group: $$\begin{equation}
z^{(i)}_k
=
\frac{1}{|G_{i,k}|}
\sum_{j\in G_{i,k}} h^{(i)}_j.
\end{equation}$$ Mean pooling directly aggregates the token-level representations assigned to each compression block. Across our architecture search, mean pooling consistently outperforms EOS-style token pooling, matching observations from prior text-compression and vision-encoder work.

#### Concat pooling.

Rather than averaging or selecting a designated token, concat pooling preserves the full per-token representation by concatenating consecutive encoder hidden states within each compression block. For window $w_i$ and latent index $k=1,\dots,M_i$, let $G_{i,k}=\{(k-1)N+1,\dots,\min(kN,|w_i|)\}$ as in mean pooling. The concatenated latent is $$\begin{equation}
z^{(i)}_k
=
\bigl[\,h^{(i)}_{j}\,\bigr]_{j\in G_{i,k}}\in\mathbb{R}^{N\,d_{\mathrm{enc}}},
\end{equation}$$ i.e., a vector of dimension $N\,d_{\mathrm{enc}}$ formed by stacking the $N$ encoder hidden states. The adapter is widened accordingly so that its first MLP layer takes inputs of dimension $N\,d_{\mathrm{enc}}$ and projects back to the decoder hidden dimension $d_{\mathrm{dec}}$, analogous to the patch-merger / spatial-concat scheme used in @qwen3vl.

<figure id="app-fig:adapter-ov-loss" data-latex-placement="t">
<img src="figs/adapter_overlap.png" />
<figcaption><strong>MLP adapters, causal masking, and no boundary overlap give the lowest pre-training loss across the architecture sweep.</strong> We sweep combinations of adapter, boundary-overlap, and encoder mask, and find that (1) causal masking is strictly better than bidirectional masking, (2) boundary overlap does not improve over non-overlapping windows while substantially increasing compute, and (3) the MLP adapter outperforms attention-based adapters. The same trends hold at scale (plot d).</figcaption>
</figure>

## Adapter Design {#app-subsec:adapter}

The adapter $a(\cdot)$ maps the compressed latent sequence $z_{1:M}\in\mathbb{R}^{M\times d_{\mathrm{enc}}}$ into a sequence $s_{1:M}\in\mathbb{R}^{M\times d_{\mathrm{dec}}}$ in the decoder embedding space. Prior encoder-decoder compression work often assumes $d_{\mathrm{enc}}=d_{\mathrm{dec}}$, but a wide range of encoder-decoder models pair smaller encoders with larger decoders, making the adapter an important design choice. We formally define the two adapter families in our sweep below.

#### MLP adapter.

The MLP adapter is a standard two-layer feed-forward network with a GELU non-linearity [@hendrycks2016gaussian] and an RMSNorm [@zhang2019root] pre-normalization, following the projection design used in LLaVA-style VLMs [@llava]: $$\begin{align}
\hat z_m \;&=\; \mathrm{RMSNorm}(z_m), \\
s_m \;&=\; W_2\,\mathrm{GELU}\!\bigl(W_1 \hat z_m + b_1\bigr) + b_2,
\qquad m = 1,\dots,M,
\end{align}$$ where $W_1\in\mathbb{R}^{d_h\times d_{\mathrm{enc}}}$ and $W_2\in\mathbb{R}^{d_{\mathrm{dec}}\times d_h}$ with hidden dimension $d_h$. Each latent token is projected independently and the adapter does not mix information across the latent sequence. When the encoder uses concat pooling (`\Cref{app-subsec:encoder-defs}`{=latex}), each $z_m$ has dimension $N\,d_{\mathrm{enc}}$ instead of $d_{\mathrm{enc}}$, so the input dimension of $W_1$ is widened to $N\,d_{\mathrm{enc}}$ and the first MLP layer is correspondingly $N\times$ larger.

#### Attention-based adapter.

The attention-based adapter adds a single multi-head self-attention layer over the latent sequence before the MLP projection, with an RMSNorm pre-normalization on the latents: $$\begin{align}
\hat z_m \;&=\; \mathrm{RMSNorm}(z_m), \\
\tilde z_{1:M} \;&=\; \mathrm{MHSA}\bigl(\hat z_{1:M}\bigr), \\
s_m \;&=\; W_2\,\mathrm{GELU}\!\bigl(W_1 \tilde z_m + b_1\bigr) + b_2,
\qquad m = 1,\dots,M.
\end{align}$$ The self-attention layer operates in the encoder hidden dimension $d_{\mathrm{enc}}$ and lets information mix across latent tokens before projection, at the cost of additional parameters and a compute term quadratic in $M$.

#### Comparison.

Across the adapter sweep in `\Cref{app-fig:adapter-ov-loss}`{=latex}, the MLP adapter achieves lower pre-training loss than the attention-based variant while adding less computation, in contrast to prior findings [@GMSA] that attention-based adapters perform better. We therefore use the MLP adapter as the default in all subsequent experiments.

## Boundary Context {#app-subsec:arch-overlap}

A potential limitation of fixed encoder windows is that information crossing a window boundary is split across two encoder forward passes. Increasing the window size $W$ reduces the frequency of such boundaries, but also increases compression-time cost. As an alternative, we implement boundary context overlapping, where each encoder window is extended with $O$ neighboring tokens while the number of latent tokens is kept unchanged.

For a target window $$w_i = x_{(i-1)W+1:\min(iW,T)},$$ we construct an overlapped encoder input $\tilde w_i$ by adding neighboring context tokens. For bidirectional attention, we include both left and right overlap when available: $$\begin{equation}
\tilde w_i
=
x_{\max(1,(i-1)W-O+1):\min(iW+O,T)}.
\end{equation}$$ We provide a visualization in the rightmost figure in `\Cref{app-fig:arch_sweep}`{=latex}. For causal attention, future context is unavailable, so we include only left overlap: $$\begin{equation}
\tilde w_i
=
x_{\max(1,(i-1)W-O+1):\min(iW,T)}.
\end{equation}$$

After encoding $\tilde w_i$, we apply pooling only to hidden states corresponding to the original non-overlapped region $w_i$, and discard pooled outputs that correspond to overlap tokens. Thus, overlap changes the encoder context available to each window but does not change the number of latent tokens passed to the decoder.

We sweep $O\in\{0,32,256\}$ for $W\in\{256,1024\}$ in `\Cref{app-fig:adapter-ov-loss}`{=latex}. Overlap does not noticeably improve pre-training loss in most settings, while increasing training cost because boundary tokens are encoded multiple times. For example, with $W=1024$ and $O=256$, each interior bidirectional window processes $1536$ tokens instead of $1024$. We therefore do not use overlap in the default architecture.

# Full-Scale Training Recipe Sweeps {#app-sec:training-recipe-sweeps}

## Pre-trained Encoder Representations {#app-subsec:encoder_lm_or_emb}

For the encoder initialization, to determine whether an embedding model's representation [@refrag; @e2llm; @cepe] is better than that of a standard language model [@GMSA; @auto-compressor; @mean-pooling], we conduct an experiment with the EOS pooling token. We can see from `\Cref{app-fig:large-scale-sweep}`{=latex} (d) that using the embedding model as the encoder outperforms using an LLM as the encoder. Note the embedding model was initialized from the LLM [@qwen3-embedding]. This shows that embedding model training provides a better representation for the encoder, though the gap gradually narrows with longer training. We also show downstream evaluation performance in `\Cref{tab:abl-encoder}`{=latex}.

<figure id="app-fig:large-scale-sweep" data-latex-placement="t">
<img src="figs/cpt_ablation.png" style="width:100.0%" />
<figcaption><strong>Architecture choices selected by the small-scale sweep continue to hold at scale.</strong> Continual pre-training loss for the full-pipeline runs at compression ratio <span class="math inline">16×</span>. Across (a) encoder window size, (b) encoder attention mask, (c) encoder initialization (LM vs. embedding), and (d) adapter family, the configuration chosen from the from-scratch sweep (<span class="math inline"><em>W</em> = 1024</span>, causal masking, embedding-initialized encoder, MLP adapter) yields the lowest pre-training loss.</figcaption>
</figure>

## LR Sweep {#app-sec:continunal pre-training lr sweep}

Language models are prone to catastrophic forgetting during continual pre-training and SFT [@luo2025empirical; @li2024revisiting]. To actively avoid forgetting, we sweep the peak learning rate for stages 2 and 3 of training. To test for catastrophic forgetting, we then benchmark the language model without compression after SFT and check that its performance is comparable to the benchmark performance of the language model before training. For stage 2, we sweep from $1\times10^{-6}$ to $1\times10^{-5}$ and find that a larger learning rate improves downstream benchmarks without causing forgetting. We show these results in `\Cref{app-fig:large-scale-sweep}`{=latex} (c). For stage 3, we sweep from $2\times10^{-5}$ to $5\times10^{-5}$. We do not observe significant differences in those SFT LRs across different models; we report the detailed numbers on downstream benchmarks in `\Cref{app-subsec:lr-sweep}`{=latex}.

<figure id="fig:stage_2_loss" data-latex-placement="t!">
<img src="figs/scaling_behavior.png" style="width:50.0%" />
<figcaption><strong>Scaling the decoder helps more than scaling the encoder, and lower compression ratios yield lower pre-training loss.</strong> <strong>Left:</strong> pre-training loss at the end of stage 2 with compression ratio <span class="math inline">16×</span> across encoder/decoder size combinations; increasing the decoder size produces a much larger drop in loss than increasing the encoder size. <strong>Right:</strong> pre-training loss at the end of stage 2 for our 0.6B-4B models across compression ratios; lower compression ratios yield lower pre-training loss. Note that the 4B-encoder and 8B-decoder runs use 256 ranks rather than 128 ranks. Because our stateful Parquet dataloader assigns work by rank, these runs complete slightly fewer optimization steps. For a clean comparison, we therefore truncate all loss curves at the same optimization step in the plot above.</figcaption>
</figure>

## Scaling Up Model Size {#app-sec:scaling model size}

To study scaling behavior, we ask two questions: (i) does a larger encoder improve compression quality, and (ii) does a larger decoder, which has a wider hidden dimension that could compress more information per token, improve performance under compression? Accordingly, we add two model configurations: (1) using the larger `Qwen3-Embedding-4B` as the encoder with the same size `Qwen3-4B-Instruct-2507` decoder; and (2) using the same size `Qwen3-Embedding-0.6B` encoder and the larger `Qwen3-8B` as the decoder. In `\Cref{fig:stage_2_loss}`{=latex} (a), we see that increasing the size of the decoder is much more beneficial in terms of pre-training loss than increasing the size of the encoder.

However, the scaling results in `\Cref{app-tab:scaling}`{=latex} are mixed. The $0.6$B encoder performs best across the RULER tasks, while the $4$B encoder performs best on the remaining evaluations. In contrast, the $8$B decoder achieves substantially lower pre-training loss but does not yield the downstream gains we expected. We suspect this is partly due to a mismatch between the training mixture and the decoder initialization: our data curation and recipe were tuned around the $4$B instruct decoder, whereas the $8$B decoder is a hybrid-thinking model and may require a different data distribution. Moreover, the $4$B instruct model already outperforms the $8$B model in the full-cache setting on several evaluations, which may limit the apparent benefit of scaling the decoder in our current setup.

\small
\setlength{\tabcolsep}{6pt}

  Enc / Dec                RULER 4K              RULER 8K             RULER 16K             LongBench           LongHealth5                 GSM8K
  ----------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  0.6B / 4B               **75.06**             **70.23**             **65.91**                 37.33   [67.50]{.underline}   [81.05]{.underline}
  0.6B / 8B                   71.20   [67.40]{.underline}   [62.22]{.underline}   [37.80]{.underline}                 64.80                 77.30
  4B / 4B       [71.30]{.underline}                 66.90                 62.00             **39.00**             **69.80**             **83.20**

  : Encoder/decoder scaling at 16$\times$ compression {#app-tab:scaling}

## Packed Sequences

Since our corpus contains many long-context examples with highly variable lengths, conventional batching would waste substantial computation on padding. To improve training throughput, we pack multiple examples into a single sequence. Because we continually pre-train an instruction-tuned model, we prevent cross-example information leakage by resetting the attention mask at example boundaries, yielding a block-diagonal attention pattern. We implement this efficiently using variable-length attention kernels [@dao2023flashattention; @dong2024flex]. `\clearpage`{=latex}

# Compute {#app-subsec:compute}

All pre-training-from-scratch experiments are run on 16 nodes of MI300A GPUs, each containing 4 GPUs. Full-scale continual pre-training experiments for 0.6B-4B models are trained on $32$ nodes of MI300A GPUs, each containing $4$ GPUs.

All evaluations are conducted on H200 GPUs.

All scaling experiments (`Qwen3-Embedding-4B` encoder, `Qwen3-8B` decoder) are trained on H200 clusters with 32 nodes to alleviate memory issues. `\clearpage`{=latex}

\FloatBarrier
\FloatBarrier
\clearpage

<figure id="app-fig:gsm" data-latex-placement="!t">
<h1 id="app-sec:benchmarks">Benchmarks and Results</h1>
<p></p>
<h2 id="gsm8k-performance-vs-compression-ratio">GSM8K Performance vs Compression Ratio</h2>
<p>The long-context benchmarks show that can compress extended contexts efficiently. We next evaluate on GSM8K <span class="citation" data-cites="cobbe2021training"></span> to test whether can also handle short, information-dense inputs, where nearly every token may be relevant to the answer. In , achieve the highest accuracy across all compression ratios on GSM8K, with particularly strong gains over baselines at higher compression ratios. These are aggressive compression settings: compression ratios of <span class="math inline">16×</span> and <span class="math inline">8×</span> remove 93.75% and 87.5% of the input tokens, respectively. Strong performance in this setting demonstrates that are not specialized only for long-context QA; they can also compress short, dense contexts, making them practical for general-purpose deployment.</p>
<img src="figs/pareto_curve_gsm8k.png" style="width:50.0%" />
<figcaption><strong>can compress small dense contexts with high accuracy.</strong> We plot GSM8K accuracy over different compression ratios and find that maintain much higher accuracy at larger compression ratios. Full data are provided in . We do not report compression time for GSM8K because the context is small enough that compression-time measurements are too noisy to support reliable conclusions.</figcaption>
</figure>

\clearpage
\FloatBarrier

## Results

\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrr}
\toprule
Model & RULER 4k & RULER 8k & RULER 16k & LB en16 & LB cn5 & LongHealth5 & GSM8K \\
\midrule
Qwen 4B Instruct (full KV) & 94.41 & 93.58 & 93.74 & 45.21 & 46.63 & 75.75 & 93.25 \\
\midrule
\textit{16x compression} \\
\midrule
LCLM 0.6b-4b (Mean)   & \textbf{75.06} & 70.23 & \textbf{65.91} & 39.08 & 31.74 & 67.50 & \textbf{81.05} \\
LCLM 0.6b-4b (Concat) & \underline{74.50} & \underline{70.30} & \underline{64.90} & 36.24 & 25.88 & \textbf{76.30} & \underline{78.90} \\
ExpAttn & 40.97 & 46.93 & 50.67 & 29.44 & 20.51 & 30.25 & 0.08 \\
SnapKV & 14.31 & 14.11 & 18.27 & 26.96 & 18.30 & 10.50 & 0.08 \\
SnapKV-QA & 20.54 & 39.09 & 61.96 & 38.52 & \textbf{42.26} & \underline{76.25} & 0.08 \\
KVzip & 62.73 & 61.73 & 57.97 & 25.86 & 17.12 & 67.00 & 0.00 \\
KVzipFast & 40.01 & 44.70 & 47.86 & 25.76 & 17.08 & 18.25 & 1.06 \\
AM-Fast & 53.09 & 55.45 & 37.17 & \textbf{40.49} & 31.51 & 70.50 & 31.39 \\
AM-Slow & 69.21 & \textbf{70.38} & 42.02 & \underline{40.26} & \underline{31.85} & 72.75 & 29.80 \\
\midrule
\textit{8x compression} \\
\midrule
LCLM 0.6b-4b (Mean)   & 85.42 & 84.48 & 82.47 & 42.23 & 34.61 & 71.75 & \underline{87.26} \\
LCLM 0.6b-4b (Concat) & 87.20 & 86.80 & 84.40 & 42.32 & 36.46 & \textbf{79.50} & \textbf{87.40} \\
ExpAttn & 65.08 & 65.28 & 65.41 & 38.13 & 30.56 & 51.50 & 4.25 \\
SnapKV & 19.64 & 19.42 & 25.39 & 32.28 & 25.18 & 13.75 & 0.15 \\
SnapKV-QA & 47.66 & 60.32 & 72.10 & 41.73 & \textbf{45.41} & \underline{76.00} & 0.15 \\
KVzip & \textbf{90.27} & \textbf{88.34} & \textbf{88.65} & 40.72 & 36.09 & 72.75 & 62.02 \\
KVzipFast & 86.08 & 84.59 & \underline{85.27} & 38.17 & 30.91 & 68.50 & 38.74 \\
AM-Fast & \underline{89.63} & 87.05 & 56.06 & \textbf{44.50} & \underline{41.52} & 74.50 & 30.40 \\
AM-Slow & 89.20 & \underline{87.05} & 59.34 & \underline{44.22} & 41.11 & 74.75 & 31.01 \\
\midrule
\textit{4x compression} \\
\midrule
LCLM 0.6b-4b (Mean)   & 91.76 & 91.03 & 89.96 & \textbf{46.04} & 40.92 & \underline{76.25} & \textbf{91.05} \\
LCLM 0.6b-4b (Concat) & 92.30 & 91.50 & 90.50 & 45.39 & 41.80 & \textbf{82.50} & \underline{89.90} \\
ExpAttn & 85.26 & 83.09 & 82.51 & 41.95 & 41.57 & 66.00 & 34.12 \\
SnapKV & 27.84 & 31.02 & 36.33 & 38.18 & 34.22 & 26.00 & 0.08 \\
SnapKV-QA & 69.24 & 76.16 & 81.06 & 43.74 & 46.19 & 75.75 & 0.08 \\
KVzip & \textbf{94.43} & \textbf{93.28} & \textbf{93.93} & 45.30 & 47.45 & 76.00 & 89.08 \\
KVzipFast & \underline{94.17} & \underline{93.15} & \underline{93.64} & 45.16 & \textbf{48.66} & \underline{76.25} & 88.40 \\
AM-Fast & 92.69 & 91.60 & 82.37 & 45.48 & \underline{48.39} & 74.50 & 32.68 \\
AM-Slow & 92.52 & 91.47 & 81.98 & \underline{45.78} & 48.07 & 73.50 & 32.37 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
Qwen 4B Instruct & 100.00 & 100.00 & 100.00 & 99.80 & 96.40 & 99.80 & 99.85 & 100.00 & 99.84 & 97.50 & 87.53 & 84.40 & 62.20 & 94.41 \\
\midrule
\textit{16x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \underline{96.40} & 89.40 & 54.60 & \textbf{85.00} & 80.80 & \underline{48.60} & \textbf{72.45} & \textbf{80.20} & 83.92 & \textbf{80.54} & \underline{88.27} & \underline{66.00} & \underline{49.60} & \textbf{75.06} \\
LCLM 0.6b-4b (Concat) & \underline{96.40} & \underline{91.80} & \underline{62.40} & \underline{83.80} & 84.20 & \textbf{54.00} & 59.50 & \underline{77.50} & 79.50 & 74.30 & \textbf{90.10} & 65.40 & 49.00 & \underline{74.45} \\
ExpAttn & \textbf{100.00} & 49.20 & 0.60 & 57.60 & 2.20 & 0.00 & 56.75 & 25.70 & 97.80 & 18.36 & 66.20 & 33.20 & 25.00 & 40.97 \\
SnapKV & 4.00 & 1.60 & 2.40 & 9.80 & 1.80 & 0.00 & 8.55 & 8.55 & 5.48 & 25.30 & 73.13 & 24.80 & 20.60 & 14.31 \\
SnapKV-QA & 16.60 & 7.20 & 2.40 & 10.20 & 4.80 & 0.00 & 9.10 & 10.50 & 15.80 & 25.14 & 57.67 & 60.60 & 47.00 & 20.54 \\
KVzip & \textbf{100.00} & 91.60 & \textbf{96.00} & 62.40 & 69.80 & 41.00 & 32.85 & 43.35 & \textbf{99.96} & 21.80 & 70.13 & 49.00 & 37.60 & 62.73 \\
KVzipFast & \textbf{100.00} & 38.80 & 31.60 & 23.00 & 65.40 & 26.00 & 7.70 & 9.80 & \underline{99.60} & 12.66 & 54.73 & 24.80 & 26.00 & 40.01 \\
AM-Fast & \textbf{100.00} & 52.80 & 7.80 & 33.80 & \underline{89.00} & 3.80 & 22.60 & 17.85 & 95.16 & \underline{76.82} & 81.40 & 63.20 & 46.00 & 53.09 \\
AM-Slow & \textbf{100.00} & \textbf{95.20} & 31.60 & 74.00 & \textbf{94.00} & 9.40 & \underline{65.40} & 56.50 & 94.96 & 74.52 & 81.40 & \textbf{70.60} & \textbf{52.20} & 69.21 \\
\midrule
\textit{8x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 98.60 & 94.20 & 64.20 & 93.00 & 98.20 & 82.20 & 92.30 & 90.70 & 87.40 & \underline{90.80} & \textbf{90.60} & 73.20 & 55.00 & 85.42 \\
LCLM 0.6b-4b (Concat) & 98.40 & 97.00 & 76.60 & 94.60 & 95.60 & 86.00 & 93.80 & 92.30 & 90.70 & \textbf{92.80} & \underline{88.40} & 74.80 & 53.20 & 87.25 \\
ExpAttn & \textbf{100.00} & 96.20 & 5.20 & 92.20 & 69.20 & 0.40 & 93.00 & 80.20 & \textbf{100.00} & 48.08 & 85.20 & 39.20 & 37.20 & 65.08 \\
SnapKV & 12.40 & 8.20 & 2.40 & 12.60 & 3.20 & 0.00 & 11.20 & 12.60 & 9.80 & 49.06 & 80.07 & 31.00 & 22.80 & 19.64 \\
SnapKV-QA & 54.40 & 97.40 & 2.40 & 46.80 & 14.40 & 2.20 & 21.25 & 79.65 & 52.60 & 43.22 & 70.20 & 78.20 & 56.80 & 47.66 \\
KVzip & \textbf{100.00} & 98.20 & \textbf{99.80} & 97.20 & \underline{98.60} & \underline{99.40} & 82.85 & 96.70 & \underline{99.96} & 77.48 & 82.67 & \textbf{79.60} & \textbf{61.00} & \textbf{90.27} \\
KVzipFast & \textbf{100.00} & \textbf{100.00} & \textbf{99.80} & 91.80 & 98.20 & \textbf{99.60} & 54.15 & 93.70 & \underline{99.96} & 66.52 & 81.27 & 76.40 & 57.60 & 86.08 \\
AM-Fast & \textbf{100.00} & \underline{99.40} & \underline{93.20} & \textbf{99.20} & 98.20 & 93.00 & \underline{96.90} & \textbf{98.45} & 96.60 & 65.90 & 84.53 & \underline{79.40} & \underline{60.40} & \underline{89.63} \\
AM-Slow & \underline{99.80} & \underline{99.40} & 90.60 & \underline{98.20} & \textbf{98.80} & 93.60 & \textbf{97.65} & \underline{97.35} & 97.04 & 64.80 & 84.13 & 78.60 & 59.60 & 89.20 \\
\midrule
\textit{4x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \underline{99.40} & 99.60 & 93.40 & 99.00 & 99.40 & 93.80 & 98.25 & 99.20 & 94.64 & 91.42 & 86.73 & 79.20 & 58.80 & 91.76 \\
LCLM 0.6b-4b (Concat) & 99.20 & \underline{99.80} & 94.40 & \underline{99.40} & \textbf{99.80} & 97.60 & 98.80 & 99.20 & 93.90 & 92.80 & 87.60 & 77.20 & 59.80 & 92.27 \\
ExpAttn & \textbf{100.00} & \underline{99.80} & 66.60 & \textbf{100.00} & 99.40 & 53.60 & 98.15 & 99.65 & \textbf{100.00} & 86.74 & 87.07 & 62.80 & 54.60 & 85.26 \\
SnapKV & 32.00 & 14.80 & 2.60 & 18.60 & 6.00 & 0.00 & 13.80 & 16.55 & 27.00 & 71.72 & 81.87 & 47.60 & 29.40 & 27.84 \\
SnapKV-QA & 95.00 & \textbf{100.00} & 2.60 & 98.60 & 30.20 & 40.40 & 54.30 & 99.35 & 85.04 & 72.14 & 78.73 & 81.60 & 62.20 & 69.24 \\
KVzip & \textbf{100.00} & 99.40 & \textbf{100.00} & \textbf{100.00} & \underline{99.60} & \underline{99.80} & \textbf{99.25} & \underline{99.80} & \underline{99.96} & \textbf{96.24} & 85.80 & \textbf{85.40} & 62.40 & \textbf{94.43} \\
KVzipFast & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & \underline{99.60} & \underline{99.80} & 98.50 & \textbf{100.00} & \underline{99.96} & \underline{95.96} & 84.40 & \underline{83.00} & \underline{63.00} & \underline{94.17} \\
AM-Fast & \textbf{100.00} & \underline{99.80} & 96.80 & 98.80 & \underline{99.60} & \textbf{100.00} & 98.75 & 99.40 & 99.08 & 78.28 & \textbf{89.40} & 81.80 & \textbf{63.20} & 92.69 \\
AM-Slow & \textbf{100.00} & \textbf{100.00} & \underline{97.60} & 99.00 & \underline{99.60} & 99.60 & \underline{99.10} & 99.20 & 98.56 & 76.98 & \underline{88.07} & 81.80 & \textbf{63.20} & 92.52 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
Qwen 4B Instruct & 100.00 & 100.00 & 100.00 & 99.80 & 97.00 & 99.40 & 99.50 & 100.00 & 99.24 & 95.24 & 87.40 & 80.40 & 58.60 & 93.58 \\
\midrule
\textit{16x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 96.00 & 90.20 & 55.20 & \textbf{83.40} & 65.40 & 36.60 & \underline{70.25} & \underline{77.65} & 80.52 & 55.44 & \textbf{95.67} & 61.80 & 44.80 & 70.23 \\
LCLM 0.6b-4b (Concat) & 97.60 & 93.20 & \underline{61.80} & \underline{82.80} & 72.80 & \underline{41.20} & 62.10 & \textbf{78.00} & 68.10 & 54.60 & \underline{94.70} & 62.60 & 44.80 & \underline{70.33} \\
ExpAttn & \textbf{100.00} & 76.60 & 1.00 & 81.20 & 2.40 & 0.00 & 69.30 & 55.85 & 99.08 & 10.52 & 60.73 & 30.40 & 23.00 & 46.93 \\
SnapKV & 8.60 & 1.20 & 2.40 & 10.00 & 0.80 & 0.00 & 9.80 & 9.80 & 5.04 & 24.78 & 78.60 & 15.40 & 17.00 & 14.11 \\
SnapKV-QA & 88.20 & 62.00 & 2.40 & 14.60 & 9.00 & 7.60 & 10.30 & 44.25 & 52.68 & 21.16 & 68.13 & \textbf{74.20} & \textbf{53.60} & 39.09 \\
KVzip & \textbf{100.00} & \textbf{98.00} & \textbf{95.80} & 64.80 & 56.40 & \textbf{45.20} & 31.25 & 51.05 & \textbf{100.00} & 8.16 & 73.80 & 43.20 & 34.80 & 61.73 \\
KVzipFast & \textbf{100.00} & 43.60 & 38.80 & 30.80 & 65.20 & 39.00 & 10.85 & 17.15 & \underline{99.96} & 7.42 & 77.73 & 23.80 & 26.80 & 44.70 \\
AM-Fast & \textbf{100.00} & 60.00 & 11.80 & 42.00 & \underline{91.24} & 8.80 & 28.55 & 31.70 & 93.56 & \textbf{72.90} & 73.47 & 62.80 & 44.00 & 55.45 \\
AM-Slow & \underline{99.20} & \underline{94.80} & 38.40 & 79.40 & \textbf{93.20} & 11.60 & \textbf{73.45} & 72.95 & 94.08 & \underline{71.22} & 75.07 & \underline{65.20} & \underline{46.40} & \textbf{70.38} \\
\midrule
\textit{8x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 98.80 & 96.40 & 71.40 & 90.80 & 96.80 & 74.60 & 90.40 & 90.95 & 89.40 & \underline{79.82} & \textbf{95.67} & 70.60 & 52.60 & 84.48 \\
LCLM 0.6b-4b (Concat) & 98.60 & 95.80 & 82.00 & 92.00 & 95.40 & 81.00 & 93.10 & 92.30 & 90.00 & \textbf{88.00} & \underline{94.60} & 71.60 & 53.60 & 86.77 \\
ExpAttn & \textbf{100.00} & 98.20 & 6.20 & \underline{97.00} & 69.60 & 0.20 & 91.95 & 94.70 & \textbf{100.00} & 34.12 & 78.73 & 42.60 & 35.40 & 65.28 \\
SnapKV & 37.80 & 2.40 & 2.40 & 11.60 & 1.60 & 0.00 & 10.40 & 10.50 & 19.56 & 34.82 & 86.33 & 16.80 & 18.20 & 19.42 \\
SnapKV-QA & \underline{99.60} & 96.60 & 2.40 & 63.20 & 22.00 & 40.40 & 24.85 & \underline{97.60} & 91.44 & 35.48 & 75.33 & \textbf{77.40} & \underline{57.80} & 60.32 \\
KVzip & \textbf{100.00} & \underline{99.40} & \textbf{99.80} & 88.80 & 97.60 & \underline{98.00} & 82.25 & \textbf{98.60} & \textbf{100.00} & 72.24 & 83.53 & 71.40 & 56.80 & \textbf{88.34} \\
KVzipFast & \textbf{100.00} & \textbf{99.80} & \textbf{99.80} & 79.80 & \textbf{98.60} & \textbf{99.00} & 59.50 & 95.50 & \textbf{100.00} & 58.46 & 83.67 & 67.40 & \textbf{58.20} & 84.59 \\
AM-Fast & 97.60 & 98.60 & 91.40 & \textbf{98.60} & 98.20 & 95.00 & \textbf{96.05} & 97.20 & 93.84 & 54.00 & 79.60 & 74.20 & 57.40 & \underline{87.05} \\
AM-Slow & 96.80 & \underline{99.40} & \underline{93.40} & 96.40 & \underline{98.40} & 95.20 & \underline{95.70} & 95.84 & \underline{94.28} & 52.88 & 80.60 & \underline{75.00} & \underline{57.80} & \underline{87.05} \\
\midrule
\textit{4x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 99.40 & \underline{99.80} & 95.60 & 98.60 & 98.40 & 91.60 & \underline{98.80} & 98.95 & 94.64 & 82.04 & \underline{90.40} & 76.20 & 59.00 & 91.03 \\
LCLM 0.6b-4b (Concat) & 99.20 & 99.60 & 96.60 & \textbf{99.20} & \underline{99.60} & 96.00 & 98.60 & 99.20 & 94.90 & 84.20 & 89.80 & 75.20 & 57.40 & 91.50 \\
ExpAttn & \textbf{100.00} & \textbf{100.00} & 60.60 & \textbf{99.20} & 98.60 & 48.60 & 97.65 & 99.80 & \textbf{100.00} & 77.74 & 85.13 & 60.80 & 52.00 & 83.09 \\
SnapKV & 67.00 & 23.60 & 2.40 & 18.80 & 3.80 & 1.00 & 18.20 & 17.90 & 47.52 & 60.22 & \textbf{92.87} & 23.00 & 27.00 & 31.02 \\
SnapKV-QA & \textbf{100.00} & \underline{99.80} & 2.40 & \underline{99.00} & 46.80 & 87.40 & 66.75 & \underline{99.95} & \underline{99.28} & 66.00 & 86.27 & 77.00 & 59.40 & 76.16 \\
KVzip & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & 98.40 & \textbf{100.00} & \underline{99.80} & \textbf{99.20} & \textbf{100.00} & \textbf{100.00} & \textbf{91.86} & 88.40 & 77.20 & 57.80 & \textbf{93.28} \\
KVzipFast & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & 98.40 & \textbf{100.00} & \textbf{100.00} & 98.75 & \underline{99.95} & \textbf{100.00} & \underline{91.84} & 86.27 & 75.60 & \textbf{60.20} & \underline{93.15} \\
AM-Fast & 99.40 & \textbf{100.00} & 98.00 & 98.60 & 99.40 & 99.20 & 96.40 & 98.90 & 96.96 & 78.62 & 86.47 & \textbf{79.20} & 59.60 & 91.60 \\
AM-Slow & \underline{99.60} & \textbf{100.00} & \underline{98.40} & 98.40 & 99.40 & 98.20 & 95.95 & 98.05 & 97.40 & 78.02 & 87.33 & \underline{78.60} & \underline{59.80} & 91.47 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
Qwen 4B Instruct & 100.00 & 100.00 & 100.00 & 99.60 & 95.40 & 99.80 & 98.95 & 99.95 & 99.52 & 88.04 & 98.93 & 79.60 & 58.80 & 93.74 \\
\midrule
\textit{16x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 95.80 & 88.00 & 57.40 & 75.60 & 62.20 & \underline{30.00} & \underline{70.70} & 73.05 & 77.32 & 24.78 & \underline{98.73} & 59.40 & 43.80 & \textbf{65.91} \\
LCLM 0.6b-4b (Concat) & 96.80 & 90.60 & \underline{65.00} & 75.60 & \textbf{65.00} & 28.80 & 63.10 & 70.30 & 58.20 & 26.60 & \textbf{98.90} & 60.80 & 43.60 & \underline{64.87} \\
ExpAttn & \textbf{100.00} & 83.00 & 1.60 & \underline{86.60} & 1.60 & 0.00 & \textbf{78.30} & \underline{76.35} & 99.32 & 8.52 & 71.07 & 28.40 & 24.00 & 50.67 \\
SnapKV & 41.80 & 4.20 & 2.40 & 11.20 & 2.00 & 0.20 & 10.00 & 10.85 & 17.60 & 16.64 & 92.20 & 12.40 & 16.00 & 18.27 \\
SnapKV-QA & \underline{99.80} & \textbf{99.40} & 2.40 & \textbf{88.40} & 11.60 & 27.00 & 30.75 & \textbf{99.15} & 93.68 & 22.22 & 98.33 & \textbf{74.80} & \textbf{58.00} & 61.96 \\
KVzip & \textbf{100.00} & \underline{92.20} & \textbf{91.60} & 55.20 & 37.60 & 18.40 & 35.70 & 53.10 & \textbf{99.96} & 3.18 & 88.87 & 41.20 & 36.60 & 57.97 \\
KVzipFast & \textbf{100.00} & 53.60 & 42.00 & 38.40 & \underline{64.40} & \textbf{44.00} & 19.05 & 25.40 & \underline{99.76} & 3.58 & 79.00 & 23.40 & 29.60 & 47.86 \\
AM-Fast & 52.05 & 33.00 & 0.20 & 28.40 & 20.60 & 0.00 & 19.40 & 23.70 & 54.92 & \textbf{55.44} & 94.67 & 58.60 & 42.20 & 37.17 \\
AM-Slow & 61.00 & 31.20 & 1.20 & 38.60 & 24.80 & 0.00 & 32.30 & 34.50 & 66.84 & \underline{52.28} & 93.33 & \underline{63.00} & \underline{47.20} & 42.02 \\
\midrule
\textit{8x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \underline{99.20} & 95.00 & 72.60 & 88.80 & \underline{94.20} & 69.20 & 88.10 & 90.10 & 89.04 & \underline{67.34} & 97.73 & 68.40 & 52.40 & 82.47 \\
LCLM 0.6b-4b (Concat) & 98.80 & 95.80 & 82.00 & 88.20 & 90.20 & 73.20 & \underline{91.00} & 90.90 & 90.00 & \textbf{77.50} & 98.50 & 69.40 & 52.20 & 84.44 \\
ExpAttn & \textbf{100.00} & 97.80 & 6.80 & \underline{97.00} & 58.00 & 0.60 & \textbf{93.65} & 96.90 & \underline{99.92} & 30.14 & 95.73 & 39.00 & 34.80 & 65.41 \\
SnapKV & 63.60 & 11.00 & 2.40 & 12.40 & 2.60 & 0.60 & 11.70 & 12.65 & 47.72 & 34.86 & 97.40 & 14.40 & 18.80 & 25.39 \\
SnapKV-QA & \textbf{100.00} & \textbf{100.00} & 2.40 & \textbf{99.20} & 27.40 & 67.60 & 65.70 & \textbf{99.90} & 99.00 & 40.72 & \textbf{99.53} & \textbf{75.40} & \textbf{60.40} & 72.10 \\
KVzip & \textbf{100.00} & \underline{99.80} & \underline{99.60} & 90.40 & 92.60 & \underline{93.00} & 83.95 & \underline{98.65} & \textbf{99.96} & 66.62 & \underline{98.67} & 71.40 & \underline{57.80} & \textbf{88.65} \\
KVzipFast & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & 76.80 & \textbf{95.60} & \textbf{96.00} & 65.85 & 96.95 & \textbf{99.96} & 58.10 & 97.07 & 68.40 & 53.80 & \underline{85.27} \\
AM-Fast & 51.20 & 60.20 & 10.40 & 58.20 & 75.40 & 13.40 & 56.45 & 58.20 & 55.24 & 67.10 & 96.20 & \underline{74.00} & 52.80 & 56.06 \\
AM-Slow & 53.80 & 62.20 & 13.00 & 70.60 & 78.60 & 17.80 & 63.25 & 66.80 & 60.84 & 65.30 & 95.20 & 71.80 & 52.20 & 59.34 \\
\midrule
\textit{4x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \underline{99.60} & 99.60 & 94.80 & 99.20 & 98.20 & 90.40 & 98.15 & 98.80 & 91.40 & 67.34 & \underline{99.53} & 75.60 & 56.80 & 89.96 \\
LCLM 0.6b-4b (Concat) & 99.20 & 99.20 & \underline{95.60} & \underline{99.60} & 99.00 & 92.40 & 98.30 & 98.70 & 92.80 & 69.90 & 98.50 & 75.60 & 57.60 & 90.49 \\
ExpAttn & \textbf{100.00} & \underline{99.80} & 57.40 & 99.00 & 98.00 & 34.60 & 98.95 & \underline{99.60} & \textbf{99.96} & 73.42 & 98.47 & 61.20 & 52.20 & 82.51 \\
SnapKV & 80.60 & 46.40 & 2.40 & 19.20 & 4.00 & 2.20 & 14.95 & 18.30 & 78.16 & 59.58 & 99.13 & 22.00 & 25.40 & 36.33 \\
SnapKV-QA & \textbf{100.00} & \underline{99.80} & 6.20 & \textbf{99.80} & 59.00 & 94.60 & 93.45 & \textbf{99.95} & \underline{99.88} & 66.60 & \textbf{99.73} & 75.40 & 59.40 & 81.06 \\
KVzip & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & 99.20 & \textbf{99.80} & \textbf{100.00} & \textbf{99.60} & \textbf{99.95} & \textbf{99.96} & \underline{85.00} & 98.80 & \underline{77.60} & \textbf{61.20} & \textbf{93.93} \\
KVzipFast & \textbf{100.00} & \textbf{100.00} & \textbf{100.00} & 98.60 & \underline{99.40} & \underline{99.80} & \underline{99.20} & \textbf{99.95} & \textbf{99.96} & \textbf{85.46} & 98.93 & 76.00 & \underline{60.00} & \underline{93.64} \\
AM-Fast & 60.00 & 94.40 & 80.20 & 93.00 & 91.60 & 91.20 & 91.60 & 92.45 & 63.56 & 79.80 & 96.46 & \textbf{78.80} & 57.80 & 82.37 \\
AM-Slow & 54.40 & 94.60 & 78.60 & 95.00 & 94.00 & 92.00 & 91.85 & 94.10 & 63.16 & 79.08 & 95.40 & \underline{77.60} & 56.00 & 81.98 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrrrrr|r}
\toprule
Model & narr & qasp & mfq & hpot & 2wiki & musq & govr & qmsm & mnews & trec & triv & samsm & pcnt & pr & lcc & rbp & AVG \\
\midrule
Qwen 4B Instruct & 30.18 & 43.60 & 46.63 & 52.05 & 36.18 & 17.26 & 30.32 & 22.91 & 23.79 & 75.50 & 85.14 & 40.03 & 3.50 & 94.91 & 65.00 & 56.36 & 45.21 \\
\midrule
\textit{16x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & 25.57 & 39.80 & \underline{47.98} & 47.96 & \textbf{38.44} & 21.17 & 28.56 & 22.42 & 23.48 & 45.00 & \textbf{89.43} & 32.50 & 3.00 & \underline{81.50} & 37.47 & 41.05 & 39.08 \\
LCLM 0.6b-4b (Concat) & 23.30 & 36.80 & \textbf{48.00} & \underline{49.00} & \underline{38.30} & 22.60 & 28.60 & 22.40 & 23.10 & 46.50 & \underline{88.20} & 32.70 & 3.50 & 37.00 & 39.90 & 39.90 & 36.24 \\
ExpAttn & 17.67 & 23.16 & 28.88 & 33.76 & 23.45 & 8.78 & 23.06 & 19.86 & 19.54 & 36.00 & 81.02 & 31.80 & \underline{7.36} & 30.32 & 28.24 & \textbf{58.14} & 29.44 \\
SnapKV & 12.97 & 13.68 & 22.14 & 20.93 & 16.13 & 5.61 & 21.61 & 17.66 & 14.70 & 32.75 & 85.87 & 36.57 & 3.25 & 17.22 & \textbf{53.42} & \underline{56.78} & 26.96 \\
SnapKV-QA & \textbf{28.36} & 28.81 & 40.37 & \textbf{53.20} & 29.33 & 17.33 & 21.57 & 22.78 & 14.06 & 35.00 & 83.92 & 37.66 & 2.37 & \textbf{92.50} & \textbf{53.42} & 55.60 & 38.52 \\
KVzip & 16.08 & 16.38 & 32.53 & 10.30 & 8.68 & 4.93 & 24.43 & 21.38 & 17.49 & 43.00 & 78.22 & 31.34 & 6.50 & 14.00 & 31.96 & 56.53 & 25.86 \\
KVzipFast & 11.96 & 18.26 & 24.60 & 23.57 & 18.08 & 4.31 & 20.15 & 20.48 & 15.03 & 34.00 & 72.78 & 34.22 & \textbf{7.44} & 8.25 & 46.89 & 52.12 & 25.76 \\
AM-Fast & \underline{26.31} & \textbf{42.64} & 45.22 & 48.31 & 30.57 & \underline{24.17} & \underline{31.59} & \underline{23.90} & \textbf{24.14} & \textbf{74.50} & 85.84 & \textbf{40.37} & 5.11 & 47.04 & 48.26 & 49.81 & \textbf{40.49} \\
AM-Slow & 24.83 & \underline{41.04} & 45.28 & 45.85 & 30.76 & \textbf{24.84} & \textbf{31.62} & \textbf{24.02} & \underline{24.08} & \underline{74.00} & 86.17 & \underline{39.60} & 5.08 & 47.38 & \underline{48.79} & 50.76 & \underline{40.26} \\
\midrule
\textit{8x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \underline{28.60} & 43.92 & \textbf{51.54} & \textbf{54.02} & \textbf{45.36} & \underline{28.05} & 29.05 & 22.85 & 23.56 & 62.00 & \underline{86.39} & 33.17 & \underline{5.45} & 71.50 & 43.57 & 46.63 & 42.23 \\
LCLM 0.6b-4b (Concat) & \textbf{30.30} & 42.70 & \underline{50.70} & 50.80 & \underline{44.30} & \textbf{28.20} & 29.20 & 22.90 & 23.90 & 62.50 & \textbf{89.00} & 33.10 & 4.60 & 70.50 & 45.20 & 49.20 & 42.32 \\
ExpAttn & 23.27 & 32.95 & 36.13 & 40.83 & 31.71 & 13.74 & 27.81 & 21.46 & 22.15 & 65.75 & 84.11 & 33.95 & 5.03 & 77.22 & 36.30 & 57.69 & 38.13 \\
SnapKV & 18.27 & 21.07 & 25.80 & 29.97 & 17.90 & 6.88 & 24.52 & 18.96 & 18.22 & 47.50 & 85.01 & 38.58 & 3.27 & 46.62 & \textbf{58.34} & 55.63 & 32.28 \\
SnapKV-QA & 28.50 & 34.45 & 45.16 & \underline{53.20} & 34.12 & 18.53 & 25.11 & 22.86 & 17.56 & 51.50 & 85.33 & 38.70 & 2.19 & \textbf{94.20} & \textbf{58.34} & \underline{58.00} & 41.73 \\
KVzip & 25.80 & 41.38 & 49.59 & 46.31 & 27.10 & 15.05 & 28.73 & 22.71 & 23.07 & 69.50 & 83.99 & 37.62 & 3.79 & 68.67 & 49.10 & \textbf{59.16} & 40.72 \\
KVzipFast & 20.72 & 35.80 & 47.81 & 45.06 & 29.11 & 15.01 & 27.96 & 23.00 & 22.73 & 66.00 & 83.63 & 36.81 & 3.33 & 41.96 & 54.86 & 56.91 & 38.17 \\
AM-Fast & 28.23 & \underline{44.29} & 49.74 & 52.61 & 36.39 & 25.60 & \underline{30.70} & \textbf{23.19} & \textbf{24.02} & \underline{73.50} & 85.27 & \textbf{39.87} & \textbf{5.67} & 86.50 & 54.38 & 52.05 & \textbf{44.50} \\
AM-Slow & 25.12 & \textbf{45.13} & 47.57 & 52.98 & 36.38 & 24.04 & \textbf{30.99} & \underline{23.07} & \underline{23.99} & \textbf{74.00} & 85.51 & \underline{39.39} & 4.05 & \underline{87.50} & \underline{55.42} & 52.34 & \underline{44.22} \\
\midrule
\textit{4x compression} \\
\midrule
LCLM 0.6b-4b (Mean) & \textbf{30.74} & \textbf{46.76} & \textbf{49.76} & 52.90 & \textbf{42.75} & \underline{28.61} & 30.12 & 22.89 & 23.83 & 68.50 & \underline{87.90} & 34.10 & \underline{6.67} & \textbf{100.00} & 52.01 & \underline{59.17} & \textbf{46.04} \\
LCLM 0.6b-4b (Concat) & 28.30 & \underline{46.00} & 48.70 & \textbf{56.80} & \underline{42.40} & \textbf{29.70} & 29.80 & 22.80 & \underline{24.00} & 66.00 & \textbf{88.50} & 34.20 & \textbf{7.50} & 98.00 & 49.20 & 54.30 & 45.39 \\
ExpAttn & 27.13 & 38.72 & 44.59 & 49.87 & 36.76 & 17.10 & 29.42 & 22.63 & 23.12 & 71.00 & 84.55 & 35.97 & 3.67 & 87.29 & 42.13 & 57.28 & 41.95 \\
SnapKV & 21.88 & 30.57 & 30.90 & 37.39 & 26.71 & 11.58 & 27.18 & 20.75 & 20.71 & 59.00 & 84.86 & \textbf{39.63} & 3.00 & 78.86 & \textbf{61.53} & 56.36 & 38.18 \\
SnapKV-QA & \underline{29.55} & 39.75 & 45.91 & 53.23 & 37.11 & 17.92 & 27.85 & \underline{23.22} & 20.13 & 66.50 & 84.94 & 38.60 & 2.96 & 93.54 & \textbf{61.53} & 57.14 & 43.74 \\
KVzip & 28.38 & 45.62 & 48.61 & \underline{54.37} & 40.26 & 18.45 & 29.86 & 23.03 & 23.86 & \underline{73.00} & 85.37 & 39.15 & 2.28 & 93.92 & 59.70 & 58.92 & 45.30 \\
KVzipFast & 28.61 & 44.93 & 48.54 & 51.89 & 36.63 & 18.63 & 30.24 & 22.98 & 23.67 & \underline{73.00} & 84.87 & 38.86 & 4.55 & 94.71 & \underline{60.82} & \textbf{59.66} & 45.16 \\
AM-Fast & 26.27 & 44.45 & \underline{49.18} & 53.99 & 37.78 & 22.66 & \textbf{30.74} & 22.83 & 23.89 & \textbf{74.50} & 83.87 & 39.06 & 5.50 & 99.00 & 58.13 & 55.85 & 45.48 \\
AM-Slow & 26.30 & 44.98 & 47.17 & 54.16 & 41.41 & 24.10 & \underline{30.57} & \textbf{23.40} & \textbf{24.05} & \textbf{74.50} & 85.19 & \underline{39.39} & 3.11 & \underline{99.17} & 58.30 & 56.67 & \underline{45.78} \\
\bottomrule
\end{tabular}
}
\small
\setlength{\tabcolsep}{6pt}

  Model                                     mfq                  dure                 vcsum                  lsht                    pr                   AVG
  ----------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  Qwen 4B Instruct                        61.60                 25.45                 11.38                 43.50                 91.23                 46.63
  *16x compression*
  LCLM 0.6b-4b (Mean)                     46.83                 23.63                 10.55                 18.67   [59.00]{.underline}                 31.74
  LCLM 0.6b-4b (Concat)               **51.00**                 23.50                 10.10                 19.30                 25.50                 25.88
  ExpAttn                                 25.52                 16.37                 11.22                 17.75                 31.67                 20.51
  SnapKV                                  20.36                 16.77                 10.44                 35.00                  8.93                 18.30
  SnapKV-QA                 [48.61]{.underline}                 21.46                 10.44             **41.50**             **89.28**             **42.26**
  KVzip                                   25.95                 16.30                 12.50                 23.75                  7.12                 17.12
  KVzipFast                               30.95                 17.04                 10.92                 19.00                  7.47                 17.08
  AM-Fast                                 48.55   [27.13]{.underline}   [13.62]{.underline}   [39.50]{.underline}                 28.76                 31.51
  AM-Slow                                 48.25             **28.02**             **13.75**                 39.25                 30.00   [31.85]{.underline}
  *8x compression*
  LCLM 0.6b-4b (Mean)                     51.27                 24.83                 10.79                 25.65                 60.50                 34.61
  LCLM 0.6b-4b (Concat)                   53.10                 24.10                 10.90                 26.70                 67.50                 36.46
  ExpAttn                                 32.75                 18.70                 11.02                 21.00   [69.32]{.underline}                 30.56
  SnapKV                                  26.07                 19.01                 10.81             **43.25**                 26.78                 25.18
  SnapKV-QA                               58.07                 23.37                 10.81                 42.50             **92.29**             **45.41**
  KVzip                                   58.26                 25.30                 11.98                 40.75                 44.17                 36.09
  KVzipFast                               57.35                 24.84                 12.77                 24.00                 35.60                 30.91
  AM-Fast                             **59.11**             **28.47**   [13.08]{.underline}   [42.75]{.underline}                 64.21   [41.52]{.underline}
  AM-Slow                   [58.88]{.underline}   [28.40]{.underline}             **13.18**   [42.75]{.underline}                 62.34                 41.11
  *4x compression*
  LCLM 0.6b-4b (Mean)                     45.52                 26.07                 11.00                 25.00   [97.00]{.underline}                 40.92
  LCLM 0.6b-4b (Concat)                   46.20                 26.80                 11.00                 30.00                 95.00                 41.80
  ExpAttn                                 54.32                 23.86                 11.63                 34.25                 83.81                 41.57
  SnapKV                                  33.48                 20.48                 11.07             **44.75**                 61.32                 34.22
  SnapKV-QA                               60.69                 24.33                 11.07                 43.50                 91.34                 46.19
  KVzip                     [62.35]{.underline}                 26.90                 11.86                 43.75                 92.37                 47.45
  KVzipFast                           **63.62**                 27.23                 11.95   [44.50]{.underline}                 96.02             **48.66**
  AM-Fast                                 58.73   [28.45]{.underline}             **12.97**                 43.75             **98.03**   [48.39]{.underline}
  AM-Slow                                 59.04             **29.09**   [12.61]{.underline}                 43.25                 96.37                 48.07

  : LongBench Chinese (cn5) per-subtask. Rows grouped by compression ratio (16x, 8x, 4x). {#app-tab:longbench-cn}

\FloatBarrier
\FloatBarrier

## LR Sweep {#app-subsec:lr-sweep}

### Continual Pre-training LR sweep

\small
\setlength{\tabcolsep}{6pt}

  Model                               RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  ---------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  $1.0 \times 10^{-6}$                   62.33                 58.75                 56.65   [33.29]{.underline}   [22.88]{.underline}                 60.25                 66.94
  $3.0 \times 10^{-6}$     [70.73]{.underline}   [65.41]{.underline}   [63.55]{.underline}                 32.89                 22.09   [62.50]{.underline}   [72.25]{.underline}
  $1.0 \times 10^{-5}$               **71.62**             **66.83**             **63.63**             **33.52**             **23.94**             **62.75**             **74.53**

  : Continual Pret-raining LR sweep summary (16 $\times$ ratio compression, bidirectional mask, mlp adapter, $O = 0$). {#tab:continual-lr-summary}

\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{Context length 4096} \\
\midrule
$1.0 \times 10^{-6}$ & 97.40 & 66.40 & 31.40 & \underline{62.00} & 47.20 & 17.40 & 71.70 & 62.10 & 84.88 & \textbf{69.16} & 88.00 & 62.80 & 49.80 & 62.33 \\
$3.0 \times 10^{-6}$ & \textbf{98.80} & \textbf{90.60} & \textbf{42.60} & \textbf{81.40} & \underline{54.40} & \underline{21.80} & \underline{84.50} & \underline{81.15} & \textbf{91.48} & 65.72 & \underline{88.07} & \underline{65.20} & \textbf{53.80} & \underline{70.73} \\
$1.0 \times 10^{-5}$ & \underline{97.80} & \underline{83.00} & \underline{39.40} & \textbf{81.40} & \textbf{64.60} & \textbf{29.80} & \textbf{85.55} & \textbf{81.30} & \underline{90.96} & \underline{68.24} & \textbf{91.07} & \textbf{66.20} & \underline{51.80} & \textbf{71.62} \\
\midrule
\textit{Context length 8192} \\
\midrule
$1.0 \times 10^{-6}$ & 96.20 & 77.80 & 28.00 & 60.80 & 40.00 & 12.40 & 60.40 & 57.95 & 84.84 & \textbf{38.70} & \textbf{96.40} & \textbf{62.20} & 48.00 & 58.75 \\
$3.0 \times 10^{-6}$ & \underline{99.00} & \underline{88.00} & \underline{40.80} & \textbf{75.60} & \underline{44.20} & \underline{14.40} & \textbf{78.55} & \textbf{76.15} & \textbf{91.40} & 34.18 & \underline{95.47} & 61.60 & \textbf{51.00} & \underline{65.41} \\
$1.0 \times 10^{-5}$ & \textbf{99.20} & \textbf{89.80} & \textbf{43.60} & \underline{73.60} & \textbf{58.60} & \textbf{19.60} & \underline{76.35} & \underline{74.10} & \underline{91.04} & \underline{36.74} & 95.33 & \underline{61.80} & \underline{49.00} & \textbf{66.83} \\
\midrule
\textit{Context length 16384} \\
\midrule
$1.0 \times 10^{-6}$ & 98.60 & 78.00 & 32.00 & 64.60 & \underline{38.40} & 8.00 & 60.75 & 58.50 & 81.96 & \textbf{12.66} & \textbf{99.53} & 56.60 & 46.80 & 56.65 \\
$3.0 \times 10^{-6}$ & \textbf{99.40} & \textbf{90.20} & \textbf{46.60} & \underline{74.00} & 38.00 & \underline{11.20} & \underline{77.40} & \textbf{77.15} & \textbf{91.24} & \underline{11.82} & \textbf{99.53} & \underline{59.00} & \underline{50.60} & \underline{63.55} \\
$1.0 \times 10^{-5}$ & \underline{98.80} & \underline{84.60} & \underline{42.00} & \textbf{75.60} & \textbf{49.80} & \textbf{12.80} & \textbf{78.40} & \underline{74.60} & \underline{89.12} & 9.88 & \underline{99.40} & \textbf{61.20} & \textbf{51.00} & \textbf{63.63} \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrrrrr|r}
\toprule
Model & narr & qasp & mfq & hpot & 2wiki & musq & govr & qmsm & mnews & trec & triv & samsm & pcnt & pr & lcc & rbp & AVG \\
\midrule
$1.0 \times 10^{-6}$ & 22.98 & 37.30 & \textbf{42.23} & \underline{37.19} & \textbf{31.51} & \textbf{14.42} & 28.61 & 22.15 & 22.72 & \textbf{31.00} & 85.35 & \textbf{33.31} & \underline{1.50} & \textbf{50.00} & 35.85 & 36.53 & \underline{33.29} \\
$3.0 \times 10^{-6}$ & \textbf{23.67} & \underline{38.81} & 41.70 & \textbf{38.09} & 29.49 & \underline{13.43} & \textbf{29.25} & \textbf{22.98} & \textbf{22.98} & \underline{30.00} & \underline{86.44} & \underline{32.39} & 0.50 & 41.50 & \underline{36.19} & \underline{38.82} & 32.89 \\
$1.0 \times 10^{-5}$ & \underline{23.32} & \textbf{40.05} & \underline{42.18} & 35.36 & \underline{31.18} & 12.64 & \underline{29.12} & \underline{22.95} & \underline{22.85} & 29.50 & \textbf{88.68} & 32.28 & \textbf{4.29} & \underline{44.00} & \textbf{38.07} & \textbf{39.92} & \textbf{33.52} \\
\bottomrule
\end{tabular}
}
\small
\setlength{\tabcolsep}{6pt}

  Model                                    mfq                  dure                 vcsum                  lsht                    pr                   AVG
  ---------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  $1.0 \times 10^{-6}$     [38.79]{.underline}                 24.13             **11.00**             **21.00**   [19.50]{.underline}   [22.88]{.underline}
  $3.0 \times 10^{-6}$                   36.39   [25.97]{.underline}                 10.85                 17.75   [19.50]{.underline}                 22.09
  $1.0 \times 10^{-5}$               **39.65**             **27.07**   [10.99]{.underline}   [19.50]{.underline}             **22.50**             **23.94**

  : Continual Pretraining LR sweep -- LongBench Chinese (cn5) per-subtask. {#tab:continual-lr-longbench-cn}

\FloatBarrier

### Post-training LR sweep

\small
\setlength{\tabcolsep}{6pt}

  Model                               RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  ---------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  $2.0 \times 10^{-5}$                   72.99                 69.08                 63.89                 37.98   [33.08]{.underline}                 66.75                 80.89
  $3.0 \times 10^{-5}$     [75.06]{.underline}   [70.23]{.underline}   [65.91]{.underline}             **39.08**                 31.74   [67.50]{.underline}             **81.05**
  $5.0 \times 10^{-5}$               **75.17**             **71.06**             **66.22**   [38.72]{.underline}             **34.16**             **69.00**   [80.97]{.underline}

  : Post-training LR sweep summary (16 $\times$ ratio compression, causal mask, mlp adapter, continual pre-training LR $= 1.0 \times 10^{-5}$). {#tab:posttrain-lr-summary}

\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{Context length 4096} \\
\midrule
$2.0 \times 10^{-5}$ & 94.60 & 89.20 & 40.60 & 82.40 & 80.20 & 46.00 & \underline{72.20} & \underline{75.55} & \underline{83.56} & 78.94 & \underline{89.40} & \textbf{66.40} & \textbf{49.80} & 72.99 \\
$3.0 \times 10^{-5}$ & \textbf{96.40} & \underline{89.40} & \underline{54.60} & \textbf{85.00} & \underline{80.80} & \underline{48.60} & \textbf{72.45} & \textbf{80.20} & \textbf{83.92} & \textbf{80.54} & 88.27 & \underline{66.00} & \underline{49.60} & \underline{75.06} \\
$5.0 \times 10^{-5}$ & \underline{96.00} & \textbf{90.00} & \textbf{58.20} & \underline{85.00} & \textbf{87.00} & \textbf{50.60} & 71.10 & 74.65 & 82.12 & \underline{79.04} & \textbf{91.93} & 64.20 & 47.40 & \textbf{75.17} \\
\midrule
\textit{Context length 8192} \\
\midrule
$2.0 \times 10^{-5}$ & 93.60 & 86.20 & 51.20 & 80.20 & \underline{67.00} & 32.20 & 68.00 & \textbf{77.70} & \textbf{80.76} & 54.82 & \textbf{95.73} & \textbf{63.40} & \textbf{47.20} & 69.08 \\
$3.0 \times 10^{-5}$ & \textbf{96.00} & \textbf{90.20} & \underline{55.20} & \textbf{83.40} & 65.40 & \underline{36.60} & \textbf{70.25} & \underline{77.65} & \underline{80.52} & \underline{55.44} & \underline{95.67} & \underline{61.80} & 44.80 & \underline{70.23} \\
$5.0 \times 10^{-5}$ & \underline{94.60} & \underline{89.80} & \textbf{57.60} & \underline{83.40} & \textbf{76.20} & \textbf{37.40} & \underline{69.25} & 75.15 & 73.32 & \textbf{63.56} & 95.33 & 61.20 & \underline{47.00} & \textbf{71.06} \\
\midrule
\textit{Context length 16384} \\
\midrule
$2.0 \times 10^{-5}$ & 92.20 & 85.80 & 47.00 & \textbf{75.60} & 55.60 & \underline{29.00} & \underline{68.10} & 68.25 & \textbf{79.44} & 24.10 & \underline{98.67} & \textbf{61.00} & \textbf{45.80} & 63.89 \\
$3.0 \times 10^{-5}$ & \underline{95.80} & \underline{88.00} & \textbf{57.40} & \underline{75.60} & \underline{62.20} & \textbf{30.00} & \textbf{70.70} & \textbf{73.05} & \underline{77.32} & \underline{24.78} & \textbf{98.73} & 59.40 & \underline{43.80} & \underline{65.91} \\
$5.0 \times 10^{-5}$ & \textbf{96.40} & \textbf{90.60} & \underline{53.80} & 74.60 & \textbf{72.40} & 28.20 & 67.50 & \underline{69.70} & 67.60 & \textbf{38.38} & 98.53 & \underline{60.20} & 43.00 & \textbf{66.22} \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrrrrr|r}
\toprule
Model & narr & qasp & mfq & hpot & 2wiki & musq & govr & qmsm & mnews & trec & triv & samsm & pcnt & pr & lcc & rbp & AVG \\
\midrule
$2.0 \times 10^{-5}$ & \underline{23.54} & \underline{40.34} & \textbf{48.52} & 46.17 & 36.93 & \underline{21.76} & \textbf{28.59} & \underline{22.21} & 22.76 & \textbf{49.00} & \underline{87.88} & \underline{32.31} & \underline{3.50} & 74.00 & 34.65 & 35.54 & 37.98 \\
$3.0 \times 10^{-5}$ & \textbf{25.57} & 39.80 & \underline{47.98} & \textbf{47.96} & \textbf{38.44} & 21.17 & \underline{28.56} & \textbf{22.42} & \textbf{23.48} & \underline{45.00} & \textbf{89.43} & \textbf{32.50} & 3.00 & \underline{81.50} & \textbf{37.47} & \textbf{41.05} & \textbf{39.08} \\
$5.0 \times 10^{-5}$ & 23.12 & \textbf{42.47} & 47.64 & \underline{46.80} & \underline{37.13} & \textbf{24.03} & 28.27 & 22.04 & \underline{22.80} & 44.00 & 87.10 & 32.15 & \textbf{4.50} & \textbf{85.50} & \underline{35.45} & \underline{36.57} & \underline{38.72} \\
\bottomrule
\end{tabular}
}
\small
\setlength{\tabcolsep}{6pt}

  Model                                    mfq                  dure                 vcsum                  lsht                    pr                   AVG
  ---------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  $2.0 \times 10^{-5}$     [47.53]{.underline}             **24.26**             **10.77**             **20.83**   [62.00]{.underline}   [33.08]{.underline}
  $3.0 \times 10^{-5}$                   46.83   [23.63]{.underline}   [10.55]{.underline}                 18.67                 59.00                 31.74
  $5.0 \times 10^{-5}$               **47.61**                 21.95                 10.38   [18.85]{.underline}             **72.00**             **34.16**

  : Post-training LR sweep -- LongBench Chinese (cn5) per-subtask. {#tab:posttrain-lr-longbench-cn}

\FloatBarrier
\clearpage
\FloatBarrier

## Architecture comparison {#app-subsec:arch-sweep}

### Pooling Operator {#app-subsubsec:pooling-operator}

\small
\setlength{\tabcolsep}{6pt}

  Model                  RULER 4k    RULER 8k   RULER 16k     LB en16      LB cn5   LongHealth5       GSM8K
  ------------------- ----------- ----------- ----------- ----------- ----------- ------------- -----------
  *4x compression*
  Mean                      91.76       91.03       89.96   **46.04**       40.92         76.25   **91.05**
  Concat                **92.30**   **91.50**   **90.50**       45.39   **41.80**     **82.50**       89.90
  *8x compression*
  Mean                      85.42       84.48       82.47       42.23       34.61         71.75       87.26
  Concat                **87.20**   **86.80**   **84.40**   **42.32**   **36.46**     **79.50**   **87.40**
  *16x compression*
  Mean                  **75.06**       70.23   **65.91**   **39.08**   **31.74**         67.50   **81.05**
  Concat                    74.50   **70.30**       64.90       36.24       25.88     **76.30**       78.90

  : Pooling operator: Mean vs Concat ($W = 1024$, causal, continual pre-training LR = $1.0 \times 10^{-5}$). {#app-tab:pool-summary}

\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{4x compression} \\
\midrule
Mean   & \textbf{99.40} & 99.60 & 93.40 & 99.00 & 99.40 & 93.80 & 98.25 & \textbf{99.20} & \textbf{94.64} & 91.42 & 86.73 & \textbf{79.20} & 58.80 & 91.76 \\
Concat & 99.20 & \textbf{99.80} & \textbf{94.40} & \textbf{99.40} & \textbf{99.80} & \textbf{97.60} & \textbf{98.80} & \textbf{99.20} & 93.90 & \textbf{92.80} & \textbf{87.60} & 77.20 & \textbf{59.80} & \textbf{92.27} \\
\midrule
\textit{8x compression} \\
\midrule
Mean   & \textbf{98.60} & 94.20 & 64.20 & 93.00 & \textbf{98.20} & 82.20 & 92.30 & 90.70 & 87.40 & 90.80 & \textbf{90.60} & 73.20 & \textbf{55.00} & 85.42 \\
Concat & 98.40 & \textbf{97.00} & \textbf{76.60} & \textbf{94.60} & 95.60 & \textbf{86.00} & \textbf{93.80} & \textbf{92.30} & \textbf{90.70} & \textbf{92.80} & 88.40 & \textbf{74.80} & 53.20 & \textbf{87.25} \\
\midrule
\textit{16x compression} \\
\midrule
Mean   & \textbf{96.40} & 89.40 & 54.60 & \textbf{85.00} & 80.80 & 48.60 & \textbf{72.45} & \textbf{80.20} & \textbf{83.92} & \textbf{80.54} & 88.27 & \textbf{66.00} & \textbf{49.60} & \textbf{75.06} \\
Concat & \textbf{96.40} & \textbf{91.80} & \textbf{62.40} & 83.80 & \textbf{84.20} & \textbf{54.00} & 59.50 & 77.50 & 79.50 & 74.30 & \textbf{90.10} & 65.40 & 49.00 & 74.45 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{4x compression} \\
\midrule
Mean   & \textbf{99.40} & \textbf{99.80} & 95.60 & 98.60 & 98.40 & 91.60 & \textbf{98.80} & 98.95 & 94.64 & 82.04 & \textbf{90.40} & \textbf{76.20} & \textbf{59.00} & 91.03 \\
Concat & 99.20 & 99.60 & \textbf{96.60} & \textbf{99.20} & \textbf{99.60} & \textbf{96.00} & 98.60 & \textbf{99.20} & \textbf{94.90} & \textbf{84.20} & 89.80 & 75.20 & 57.40 & \textbf{91.50} \\
\midrule
\textit{8x compression} \\
\midrule
Mean   & \textbf{98.80} & \textbf{96.40} & 71.40 & 90.80 & \textbf{96.80} & 74.60 & 90.40 & 90.95 & 89.40 & 79.82 & \textbf{95.67} & 70.60 & 52.60 & 84.48 \\
Concat & 98.60 & 95.80 & \textbf{82.00} & \textbf{92.00} & 95.40 & \textbf{81.00} & \textbf{93.10} & \textbf{92.30} & \textbf{90.00} & \textbf{88.00} & 94.60 & \textbf{71.60} & \textbf{53.60} & \textbf{86.77} \\
\midrule
\textit{16x compression} \\
\midrule
Mean   & 96.00 & 90.20 & 55.20 & \textbf{83.40} & 65.40 & 36.60 & \textbf{70.25} & 77.65 & \textbf{80.52} & \textbf{55.44} & \textbf{95.67} & 61.80 & \textbf{44.80} & 70.23 \\
Concat & \textbf{97.60} & \textbf{93.20} & \textbf{61.80} & 82.80 & \textbf{72.80} & \textbf{41.20} & 62.10 & \textbf{78.00} & 68.10 & 54.60 & 94.70 & \textbf{62.60} & \textbf{44.80} & \textbf{70.33} \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{4x compression} \\
\midrule
Mean   & \textbf{99.60} & \textbf{99.60} & 94.80 & 99.20 & 98.20 & 90.40 & 98.15 & \textbf{98.80} & 91.40 & 67.34 & \textbf{99.53} & \textbf{75.60} & 56.80 & 89.96 \\
Concat & 99.20 & 99.20 & \textbf{95.60} & \textbf{99.60} & \textbf{99.00} & \textbf{92.40} & \textbf{98.30} & 98.70 & \textbf{92.80} & \textbf{69.90} & 98.50 & \textbf{75.60} & \textbf{57.60} & \textbf{90.49} \\
\midrule
\textit{8x compression} \\
\midrule
Mean   & \textbf{99.20} & 95.00 & 72.60 & \textbf{88.80} & \textbf{94.20} & 69.20 & 88.10 & 90.10 & 89.04 & 67.34 & 97.73 & 68.40 & \textbf{52.40} & 82.47 \\
Concat & 98.80 & \textbf{95.80} & \textbf{82.00} & 88.20 & 90.20 & \textbf{73.20} & \textbf{91.00} & \textbf{90.90} & \textbf{90.00} & \textbf{77.50} & \textbf{98.50} & \textbf{69.40} & 52.20 & \textbf{84.44} \\
\midrule
\textit{16x compression} \\
\midrule
Mean   & 95.80 & 88.00 & 57.40 & \textbf{75.60} & 62.20 & \textbf{30.00} & \textbf{70.70} & \textbf{73.05} & \textbf{77.32} & 24.78 & 98.73 & 59.40 & \textbf{43.80} & \textbf{65.91} \\
Concat & \textbf{96.80} & \textbf{90.60} & \textbf{65.00} & \textbf{75.60} & \textbf{65.00} & 28.80 & 63.10 & 70.30 & 58.20 & \textbf{26.60} & \textbf{98.90} & \textbf{60.80} & 43.60 & 64.87 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrrrrr|r}
\toprule
Model & narr & qasp & mfq & hpot & 2wiki & musq & govr & qmsm & mnews & trec & triv & samsm & pcnt & pr & lcc & rbp & AVG \\
\midrule
\textit{4x compression} \\
\midrule
Mean   & \textbf{30.74} & \textbf{46.76} & \textbf{49.76} & 52.90 & \textbf{42.75} & 28.61 & \textbf{30.12} & \textbf{22.89} & 23.83 & \textbf{68.50} & 87.90 & 34.10 & 6.67 & \textbf{100.00} & \textbf{52.01} & \textbf{59.17} & \textbf{46.04} \\
Concat & 28.30 & 46.00 & 48.70 & \textbf{56.80} & 42.40 & \textbf{29.70} & 29.80 & 22.80 & \textbf{24.00} & 66.00 & \textbf{88.50} & \textbf{34.20} & \textbf{7.50} & 98.00 & 49.20 & 54.30 & 45.39 \\
\midrule
\textit{8x compression} \\
\midrule
Mean   & 28.60 & \textbf{43.92} & \textbf{51.54} & \textbf{54.02} & \textbf{45.36} & 28.05 & 29.05 & 22.85 & 23.56 & 62.00 & 86.39 & \textbf{33.17} & \textbf{5.45} & \textbf{71.50} & 43.57 & 46.63 & 42.23 \\
Concat & \textbf{30.30} & 42.70 & 50.70 & 50.80 & 44.30 & \textbf{28.20} & \textbf{29.20} & \textbf{22.90} & \textbf{23.90} & \textbf{62.50} & \textbf{89.00} & 33.10 & 4.60 & 70.50 & \textbf{45.20} & \textbf{49.20} & \textbf{42.32} \\
\midrule
\textit{16x compression} \\
\midrule
Mean   & \textbf{25.57} & \textbf{39.80} & 47.98 & 47.96 & \textbf{38.44} & 21.17 & 28.56 & \textbf{22.42} & \textbf{23.48} & 45.00 & \textbf{89.43} & 32.50 & 3.00 & \textbf{81.50} & 37.47 & \textbf{41.05} & \textbf{39.08} \\
Concat & 23.30 & 36.80 & \textbf{48.00} & \textbf{49.00} & 38.30 & \textbf{22.60} & \textbf{28.60} & 22.40 & 23.10 & \textbf{46.50} & 88.20 & \textbf{32.70} & \textbf{3.50} & 37.00 & \textbf{39.90} & 39.90 & 36.24 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}

  Model                       mfq        dure       vcsum        lsht          pr         AVG
  ------------------- ----------- ----------- ----------- ----------- ----------- -----------
  *4x compression*
  Mean                      45.52       26.07   **11.00**       25.00   **97.00**       40.92
  Concat                **46.20**   **26.80**   **11.00**   **30.00**       95.00   **41.80**
  *8x compression*
  Mean                      51.27   **24.83**       10.79       25.65       60.50       34.61
  Concat                **53.10**       24.10   **10.90**   **26.70**   **67.50**   **36.46**
  *16x compression*
  Mean                      46.83   **23.63**   **10.55**       18.67   **59.00**   **31.74**
  Concat                **51.00**       23.50       10.10   **19.30**       25.50       25.88

  : Pooling operator: LongBench Chinese (cn5) per-subtask. {#app-tab:pool-longbench-cn}

\FloatBarrier

### Adapter & Overlap {#app-subsubsec:adapter-overlap}

\small
\setlength{\tabcolsep}{6pt}

  Model                                      RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  ----------------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  Causal-MLP-O0                             **75.06**             **70.23**             **65.91**             **39.08**             **31.74**             **67.50**             **81.05**
  Bidirectional-MLP-O0                          71.62                 66.83                 63.63                 33.52                 23.94                 62.75                 74.53
  Bidirectional-ATTN-MLP-O256     [71.91]{.underline}   [68.11]{.underline}   [64.96]{.underline}                 36.28                 26.54   [64.00]{.underline}                 73.77
  Bidirectional-MLP-O256                        62.03                 58.66                 58.24   [36.97]{.underline}   [26.63]{.underline}                 63.25   [74.98]{.underline}

  : Adapter & overlap comparison (16 $\times$ compression, mean pooling, $W = 1024$, continual pre-training LR $= 1.0 \times 10^{-5}$). {#tab:adapter-summary}

\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrr|r}
\toprule
Model & ns1 & ns2 & ns3 & nm1 & nm2 & nm3 & nmv & nmq & vt & cwe & fwe & qa1 & qa2 & AVG \\
\midrule
\textit{Context length 4096} \\
\midrule
Causal-MLP-O0 & 96.40 & \textbf{89.40} & \textbf{54.60} & \textbf{85.00} & \textbf{80.80} & \textbf{48.60} & 72.45 & 80.20 & 83.92 & \textbf{80.54} & \underline{88.27} & \underline{66.00} & 49.60 & \textbf{75.06} \\
Bidirectional-MLP-O0 & 97.80 & \underline{83.00} & 39.40 & \underline{81.40} & \underline{64.60} & \underline{29.80} & \underline{85.55} & \underline{81.30} & 90.96 & 68.24 & \textbf{91.07} & \textbf{66.20} & \textbf{51.80} & 71.62 \\
Bidirectional-ATTN-MLP-O256 & \textbf{99.80} & 80.40 & \underline{50.20} & 78.60 & 62.20 & 26.20 & \textbf{85.70} & \textbf{81.35} & \underline{95.00} & \underline{80.00} & 87.00 & 61.80 & 46.60 & \underline{71.91} \\
Bidirectional-MLP-O256 & \underline{99.00} & 60.60 & 21.60 & 61.80 & 55.40 & 17.60 & 69.90 & 61.60 & \textbf{97.12} & 57.66 & 87.27 & \underline{66.00} & \underline{50.80} & 62.03 \\
\midrule
\textit{Context length 8192} \\
\midrule
Causal-MLP-O0 & 96.00 & \textbf{90.20} & \underline{55.20} & \textbf{83.40} & \textbf{65.40} & \textbf{36.60} & 70.25 & \textbf{77.65} & 80.52 & \textbf{55.44} & \textbf{95.67} & 61.80 & 44.80 & \textbf{70.23} \\
Bidirectional-MLP-O0 & \underline{99.20} & \underline{89.80} & 43.60 & \underline{73.60} & \underline{58.60} & 19.60 & \underline{76.35} & 74.10 & 91.04 & 36.74 & 95.33 & 61.80 & \textbf{49.00} & 66.83 \\
Bidirectional-ATTN-MLP-O256 & \textbf{100.00} & 86.00 & \textbf{58.60} & 72.00 & 52.60 & \underline{25.60} & \textbf{76.75} & \underline{74.65} & \underline{96.12} & \underline{43.78} & 91.93 & \textbf{63.20} & 44.20 & \underline{68.11} \\
Bidirectional-MLP-O256 & 98.80 & 67.20 & 25.40 & 61.20 & 47.20 & 15.60 & 60.55 & 54.15 & \textbf{96.68} & 28.76 & \underline{95.60} & \underline{62.80} & \underline{48.60} & 58.66 \\
\midrule
\textit{Context length 16384} \\
\midrule
Causal-MLP-O0 & 95.80 & \underline{88.00} & \textbf{57.40} & \underline{75.60} & \textbf{62.20} & \textbf{30.00} & 70.70 & 73.05 & 77.32 & \textbf{24.78} & 98.73 & 59.40 & 43.80 & \textbf{65.91} \\
Bidirectional-MLP-O0 & \underline{98.80} & 84.60 & 42.00 & \underline{75.60} & \underline{49.80} & 12.80 & \textbf{78.40} & \underline{74.60} & 89.12 & 9.88 & \underline{99.40} & \textbf{61.20} & \textbf{51.00} & 63.63 \\
Bidirectional-ATTN-MLP-O256 & \textbf{100.00} & \textbf{89.00} & \underline{57.00} & \textbf{78.20} & 42.40 & \underline{17.60} & \underline{78.25} & \textbf{77.40} & \underline{93.80} & 10.12 & \textbf{99.47} & 56.20 & 45.00 & \underline{64.96} \\
Bidirectional-MLP-O256 & 97.80 & 75.40 & 28.60 & 68.20 & 36.80 & 12.80 & 62.95 & 60.65 & \textbf{96.00} & \underline{12.48} & 99.20 & \underline{59.80} & \underline{46.40} & 58.24 \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\resizebox{\linewidth}{!}{%
\begin{tabular}{lrrrrrrrrrrrrrrrr|r}
\toprule
Model & narr & qasp & mfq & hpot & 2wiki & musq & govr & qmsm & mnews & trec & triv & samsm & pcnt & pr & lcc & rbp & AVG \\
\midrule
Causal-MLP-O0 & \underline{25.57} & \underline{39.80} & \textbf{47.98} & \underline{47.96} & \textbf{38.44} & \textbf{21.17} & 28.56 & 22.42 & \textbf{23.48} & \textbf{45.00} & \textbf{89.43} & \textbf{32.50} & 3.00 & \textbf{81.50} & 37.47 & 41.05 & \textbf{39.08} \\
Bidirectional-MLP-O0 & 23.32 & \textbf{40.05} & 42.18 & 35.36 & 31.18 & 12.64 & \textbf{29.12} & \textbf{22.95} & 22.85 & 29.50 & \underline{88.68} & \underline{32.28} & 4.29 & 44.00 & 38.07 & 39.92 & 33.52 \\
Bidirectional-ATTN-MLP-O256 & \textbf{27.45} & 38.87 & 42.56 & \textbf{52.31} & \underline{36.98} & \underline{20.22} & 27.98 & 22.44 & \underline{23.01} & 27.50 & 87.23 & 32.16 & \textbf{6.00} & 51.50 & \underline{41.22} & \underline{43.00} & 36.28 \\
Bidirectional-MLP-O256 & 23.20 & 39.67 & \underline{42.77} & 46.56 & 31.94 & 16.41 & \underline{28.64} & \underline{22.50} & 22.65 & \underline{40.00} & 86.67 & 31.80 & \underline{4.50} & \underline{67.50} & \textbf{41.76} & \textbf{44.89} & \underline{36.97} \\
\bottomrule
\end{tabular}
}
\scriptsize
\setlength{\tabcolsep}{3pt}

  Model                                           mfq                  dure                 vcsum                  lsht                    pr                   AVG
  ----------------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  Causal-MLP-O0                             **46.83**                 23.63                 10.55                 18.67             **59.00**             **31.74**
  Bidirectional-MLP-O0                          39.65             **27.07**   [10.99]{.underline}   [19.50]{.underline}                 22.50                 23.94
  Bidirectional-ATTN-MLP-O256     [44.35]{.underline}                 23.26                 10.75                 18.83                 35.50                 26.54
  Bidirectional-MLP-O256                        38.06   [24.02]{.underline}             **11.05**             **22.50**   [37.50]{.underline}   [26.63]{.underline}

  : Adapter & overlap comparison -- LongBench Chinese (cn5) per-subtask. {#tab:arch-longbench-cn}

\FloatBarrier
\FloatBarrier

### Mask-type Comparison

\small
\setlength{\tabcolsep}{6pt}

  Mask                         RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  --------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  Bidirectional     [71.62]{.underline}   [66.83]{.underline}   [63.63]{.underline}   [33.52]{.underline}   [23.94]{.underline}   [62.75]{.underline}   [74.53]{.underline}
  Causal                      **75.06**             **70.23**             **65.91**             **39.08**             **31.74**             **67.50**             **81.05**

  : Mask-type ablation (16 $\times$ compression, mean pooling, W=1024, continual pre-training LR $= 1.0 \times 10^{-5}$). {#tab:abl-mask}

\FloatBarrier
\FloatBarrier
\FloatBarrier

### Window-size Comparison

\small
\setlength{\tabcolsep}{6pt}

  Model                             RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  -------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  Mean W16 (causal)                    24.85                 23.87                 22.54                 29.62                 22.62                 56.50                 46.55
  Mean W256 (causal)     [58.37]{.underline}   [54.39]{.underline}   [50.18]{.underline}             **34.15**             **25.97**             **62.25**             **76.80**
  Mean W1024 (bidir)               **62.33**             **58.75**             **56.65**   [33.29]{.underline}   [22.88]{.underline}   [60.25]{.underline}   [66.94]{.underline}

  : Window-size ablation (16 $\times$ compression, mean pooling, continual pre-training LR $= 1.0 \times 10^{-6}$). W=16 and W=256 use causal attention; W=1024 uses bidirectional attention. We do not run W=1024 with causal attention at LR $= 1.0 \times 10^{-6}$. However, Table 20 shows that causal masking is strictly better than bidirectional masking, and W=1024 (bidirectional) already outperforms W=256 (causal) on RULER tasks while also achieving lower pre-training loss. We therefore conclude that W=1024 is the better setting. {#tab:abl-window}

\FloatBarrier
\FloatBarrier

### Encoder-representation Comparison

\small
\setlength{\tabcolsep}{6pt}

  Encoder init                     RULER 4k              RULER 8k             RULER 16k               LB en16                LB cn5           LongHealth5                 GSM8K
  ------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
  LLM (Instruct)        [38.29]{.underline}   [34.39]{.underline}   [33.50]{.underline}   [30.48]{.underline}   [22.73]{.underline}             **58.25**   [54.51]{.underline}
  Embed (Qwen3-Emb)               **42.00**             **38.76**             **38.12**             **31.72**             **25.32**   [58.00]{.underline}             **59.36**

  : Encoder-representation ablation (16 $\times$ compression, EOS pooling, causal, continual pre-training LR $= 1.0 \times 10^{-6}$). {#tab:abl-encoder}

\FloatBarrier
\clearpage
\FloatBarrier

## NIAH Agent results

\small
\setlength{\tabcolsep}{6pt}

  Model                        ns1         ns2         ns3         nm1         nm2         nm3         nmv         nmq         AVG
  -------------------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
  *4k context*
  LCLM 0.6B-4B               96.80       90.00       55.20       86.40       82.20       49.00       74.10       81.45       76.89
  LCLM 0.6B-4B agent     **98.20**   **97.00**   **97.60**   **97.00**   **86.60**   **89.60**   **99.55**   **86.35**   **93.99**
  *$\Delta$ agent*         *+1.40*     *+7.00*    *+42.40*    *+10.60*     *+4.40*    *+40.60*    *+25.45*     *+4.90*    *+17.10*
  *8k context*
  LCLM 0.6B-4B               96.00       91.20       55.80       84.00       66.80       37.60       70.45       78.20       72.51
  LCLM 0.6B-4B agent     **99.20**   **97.40**   **97.00**   **98.20**   **74.60**   **83.00**   **97.10**   **93.20**   **92.46**
  *$\Delta$ agent*         *+3.20*     *+6.20*    *+41.20*    *+14.20*     *+7.80*    *+45.40*    *+26.65*    *+15.00*    *+19.95*
  *16k context*
  LCLM 0.6B-4B               96.20       89.80       58.00       77.00       63.60       31.60       71.80       73.40       70.18
  LCLM 0.6B-4B agent     **98.00**   **98.00**   **98.00**   **96.00**   **65.20**   **72.00**   **96.35**   **95.75**   **89.91**
  *$\Delta$ agent*         *+1.80*     *+8.20*    *+40.00*    *+19.00*     *+1.60*    *+40.40*    *+24.55*    *+22.35*    *+19.74*

  : RULER NIAH (retrieval) per-task results for our LCLM 0.6B-4B model at 16x compression, with and without the agentic expand-or-answer loop, across context lengths. *$\Delta$ agent* is the cell-wise gain over the non-agentic baseline. {#app-tab:niah-agent}

[^1]: <https://github.com/NVIDIA/kvpress/blob/main/notebooks/speed_and_memory.ipynb>