Attention and Compression is all you need for Controllably Efficient Language Models

Source

Raw Markdown: paper_compress-attend-transformer-2025.md
PDF: paper_compress-attend-transformer-2025.pdf
Preprint: arXiv 2511.05313
OpenReview: ICLR 2026 rejected submission
Official code: rajesh-lab/cat-transformer
Official Hugging Face: CAT transformer collection
Official X thread: Jatin Prakash post
Local status snapshots: papers/compress-attend-transformer-2025/openreview-status-6rYa2BUnTt.json and authenticated X API snapshot papers/compress-attend-transformer-2025/x-thread-bicycleman15-1987895472409673972.json

Status And Credibility

This arXiv preprint was first posted on 2025-11-07, so it is current as of 2026-06-12. The current OpenReview page identifies it as Submitted to ICLR 2026 with venue id ICLR.cc/2026/Conference/Rejected_Submission. It is therefore not accepted venue evidence.

The source is still included because Alex explicitly provided the X-linked source for ingest, the paper has an arXiv version, and the authors released code and Hugging Face checkpoints from the NYU/Rajesh Lab ecosystem. The source page should carry the rejected-submission caveat anywhere the paper’s claims are used. The provided X URL has been captured through the authenticated X API, including the root post, returned author self-replies in the announcement thread, one nearby author quote-post, public metrics, author metadata, media metadata, URLs, and conversation IDs. The API returned thread posts labeled 1-8, 10, and 11; no post labeled 9/ was returned in the captured author timeline or conversation lookup window.

Core Claim

Compress & Attend Transformers (CATs) are chunk-compressive language models with a test-time quality/efficiency knob. A compressor maps each chunk of tokens into a compressed representation, and a decoder generates the current chunk while attending to prior compressed chunk representations and the current chunk prefix. By changing chunk size, a single adaptive CAT model can trade quality for compute and memory at test time without retraining.

flowchart LR
  C1[chunk c1] --> F1[compress f(c1)]
  C2[chunk c2] --> F2[compress f(c2)]
  F1 --> D[decoder for later chunk]
  F2 --> D
  Prefix[current chunk prefix] --> D
  D --> Next[next token]

The modeling contract is:

p_{θ} (c_{i} ∣ c_{< i}) = j \prod g_{θ} (x_{i, j} ∣ x_{i, < j}, f_{θ} (c_{i - 1}), \dots, f_{θ} (c_{1})) .

Mechanism

CAT separates two roles:

a dense bidirectional transformer compressor that converts each chunk into a compressed chunk representation;
a causal dense transformer decoder that attends to compressed prior chunks plus raw tokens from the current chunk.

Training can be parallelized by interleaving raw chunks and compressed chunk representations with a custom attention mask. The mask lets each token attend only to earlier tokens in the same chunk and compressed representations of previous chunks, so the decoder can reuse key/value computations for compressed representations. At generation time, past raw chunks can be discarded and replaced by compressed chunk representations, reducing the retained KV cache by roughly the chunk size.

For adaptive CATs, the model samples chunk sizes during training and receives a learnable indicator token for the active chunk size. At test time, changing that indicator changes the quality/compute tradeoff.

Evidence And Results

The paper scales CAT from 90M to about 1B parameters and compares against dense Transformers, sparse attention, Mamba2, Gated Delta Net, and hybrid baselines.
On language modeling, common-sense reasoning, LongBench, RULER, BabiLong, and real-world in-context recall tasks, the paper reports that CAT variants outperform many efficient baselines under multiple compute/memory budgets.
In the generation benchmark, CAT is reported as $1.4$ - $3.2 \times$ faster than a dense Transformer while using up to $2.2$ - $9.5 \times$ less total memory as chunk size increases.
CAT keeps a single adaptive model for chunk sizes $4$ , $8$ , $16$ , and $32$ , so the budget knob is not a set of separately trained checkpoints in the main adaptive setting.
The paper reports memory-matched MQAR results to separate memory-budget effects from architecture effects.
The official repository provides pure PyTorch implementation paths for fixed and adaptive CATs, plus a CAT layer variant. The README caveats that released checkpoints are retrained variants with a slightly different config, including QK norm, and are not exactly the paper models.

Why It Matters

CAT is useful for this wiki because it reframes long-context efficiency as a controllable chunk-compression interface rather than a fixed sparse-attention or recurrent-state choice. It keeps dense attention as the basic primitive, but changes what the decoder is allowed to retain from older chunks. That makes it a clean comparison point for memory tokens, SSM state, KV-cache compression, latent context compression, and learned hierarchy.

For time-series and world-model readers, the transferable idea is a budget knob over retained history resolution. Older observations could be compressed into coarser latent state while the current window stays high resolution. The caution is equally important: a fixed-size chunk representation has a practical information limit, and the paper itself notes large chunks can fail when the compressor cannot surface the right information.

Limitations

The OpenReview status is rejected from ICLR 2026, so the source should be treated as arXiv/code evidence rather than accepted venue evidence.
The paper evaluates language modeling, long-context understanding, and recall tasks, not numeric time-series forecasting, event streams, robotics, or action-conditioned world models.
CAT uses more parameters than the dense Transformer baseline in the main recipe; efficiency comes from shorter retained sequences, not fewer parameters.
The paper reports that current training is slower than dense Transformer training in wall-clock terms because the custom mask path is not yet optimized enough.
Large chunk sizes can underperform because fixed-size chunk representations may not hold all information needed for accurate retrieval.
The adaptive knob selects a chunk-size budget, but task-dependent budget selection is still manual; the paper leaves learned data-dependent adaptivity to future work.
The released checkpoint artifacts are not exact paper checkpoints according to the official README.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state, long context, and constant updates	adjacent	Keeps current chunks raw while older chunks become compressed representations.	Needs numeric streams, event streams, control inputs, interventions, and online update evaluations.
Dynamic compute allocation	adjacent	Chunk size controls memory and compute at test time in one adaptive model.	Needs realized latency, batching, and hardware-kernel evidence in production-like serving.
Representation quality: semantic state vs dense numeric detail	warning	The paper explicitly reports that large chunks can miss retrieval-critical detail.	Need preservation probes for rare regimes, cross-channel deviations, and delayed effects.
Hierarchical modeling and compression	adjacent	Provides a simple chunk-compress-then-attend hierarchy using dense Transformer components.	Need learned boundaries and domain-specific compression units rather than fixed token chunks.

Links Into The Wiki

Open Questions

Can a CAT-style chunk knob be learned from uncertainty, event density, anomaly risk, or action relevance rather than chosen manually?
What chunk representation size is needed before multivariate time-series state stops losing rare but decision-relevant information?
Can CAT-style compression be combined with SSM or recurrent state so older chunks become persistent state instead of decoder-attended latent tokens?
How should fixed chunk boundaries change for irregular event streams, asynchronous sensors, and variable-rate observations?
Does the serving win survive optimized dense, FP8, KV-compressed, and retrieval-augmented baselines under the same latency and memory budget?

Alex Open Research Wiki

Explorer

Attention and Compression is all you need for Controllably Efficient Language Models

Attention and Compression is all you need for Controllably Efficient Language Models

Source

Status And Credibility

Core Claim

Mechanism

Evidence And Results

Why It Matters

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks