TurboQuant

Source

Raw Markdown: paper_turboquant-2025.md
PDF: paper_turboquant-2025.pdf
Preprint: arXiv 2504.19874
OpenReview: ICLR 2026 poster
Official blog post: Google Research announcement
vLLM implementation critique: A First Comprehensive Study of TurboQuant: Accuracy and Performance
Gonzo ML discussion: post 5377, post 5384, post 5385
Review: ArxivIQ note

Status And Credibility

TurboQuant first appeared on arXiv on 2025-04-28 and is listed by OpenReview as an ICLR 2026 poster, published 2026-01-26 and last modified 2026-05-13. The paper authors are Amir Zandieh at Google Research, Majid Daliri at New York University, Majid Hadian at Google DeepMind, and Vahab Mirrokni at Google Research.

The paper is therefore older than one year by arXiv date, but the current venue status and 2026 Google Research announcement make it a credible, current source for vector quantization, KV-cache compression, and vector-search memory reduction. A later 2026 vLLM / Red Hat AI implementation study is credible as an external systems critique because it tests TurboQuant inside an inference framework rather than repeating only paper-style accuracy tables. No official code repository was verified during ingest; the Gonzo post links only an unofficial implementation. X_BEARER_TOKEN was unavailable, and no verified author or lab X announcement thread was found via public search during this pass.

Core Claim

TurboQuant is a data-oblivious online vector quantization method for high-dimensional vectors. It targets both mean-squared reconstruction error and inner-product distortion without learning data-dependent codebooks at indexing time.

The paper’s key technical claim is that random rotation makes coordinates behave enough like a known concentrated distribution that scalar quantization can approach the vector distortion-rate lower bound. For inner products, TurboQuant adds a one-bit Quantized Johnson-Lindenstrauss residual stage so the estimator is unbiased:

E_{Q} [⟨ y, Q^{- 1} (Q (x))⟩] = ⟨ y, x ⟩ .

Method

flowchart LR
    X[High-dimensional vector x] --> R[Random rotation]
    R --> S[Scalar Lloyd-Max quantization per coordinate]
    S --> MSE[MSE-oriented reconstruction]
    X --> Res[Residual after reconstruction]
    Res --> QJL[1-bit QJL residual sketch]
    MSE --> Prod[Inner-product estimator]
    QJL --> Prod
    Prod --> Apps[KV cache and vector search]

The MSE path rotates a vector, quantizes each coordinate with precomputed scalar codebooks for the induced Beta-like coordinate distribution, and rotates back during dequantization. The inner-product path uses one fewer bit for this MSE quantizer, then applies QJL to the residual, using the last bit to remove the inner-product bias that plain MSE-optimal quantization can introduce.

Evidence And Results

Theory: the paper gives lower bounds and upper bounds for MSE and inner-product distortion, with the stated constant-factor gap around the distortion-rate lower bound.
Needle-in-a-Haystack: TurboQuant matches the full-precision Llama-3.1-8B-Instruct result in the paper’s long-context retrieval setup at 4x KV-cache compression.
LongBench-E: the paper reports Llama-3.1-8B-Instruct average score 50.06 for full cache and 50.06 for TurboQuant at 3.5 bits, with TurboQuant at 2.5 bits still close at 49.44.
Mistral: the paper reports full-cache LongBench-E average 49.89 and TurboQuant 2.5-bit average 49.62.
Vector search: on DBpedia and GloVe embedding experiments, TurboQuant is reported to beat Product Quantization and RabitQ on recall while reducing quantization/indexing time to near zero in the paper’s benchmark.
Google Research’s blog frames the same line as a practical compression result for KV cache and vector search, including at least 6x key-value memory reduction in needle-style tests and up to 8x attention-logit speedup on H100 for 4-bit TurboQuant. Treat those as official narrative and benchmark claims, not independent reproduction.
vLLM / Red Hat AI critique: a vLLM integration study benchmarks BF16, FP8, and four TurboQuant KV-cache variants on long-context retrieval, reasoning, latency, throughput, and serving speed. It finds FP8 to be the best default; TurboQuant k8v4 preserves quality but offers only modest extra capacity over FP8 while hurting latency and throughput; 4bit-nc is the practical memory-pressure variant; and k3v4-nc / 3bit-nc lose accuracy and serving performance.

Practical Critique

The vLLM critique changes how the Google blog’s “extreme compression” framing should be used. TurboQuant can still be a valid compression primitive, but its storage-only design means the serving system must dequantize low-bit KV cache back to BF16 for attention. FP8 differs because it can store the cache and run attention computation in a hardware-native FP8 path.

That difference makes the practical baseline stricter. In the vLLM study, FP8 gives 2x KV-cache capacity with negligible accuracy loss and no throughput cost in the tested settings. TurboQuant k8v4 gives only a small capacity gain over FP8, while adding latency and throughput penalties. TurboQuant 4bit-nc becomes the narrow useful case: it can trade per-token speed and some accuracy for higher capacity when memory pressure dominates, especially under burst load. The more aggressive k3v4-nc and 3bit-nc variants are poor production candidates unless a target workload independently validates the accuracy hit and serving slowdown.

For this KB, the durable lesson is not “TurboQuant fails.” The lesson is that vector-state compression must be evaluated at the whole serving contract: memory footprint, hardware-native kernels, dequantization cost, latency, throughput, and target-task quality. Inner-product fidelity or paper-level long-context accuracy is insufficient evidence that a compressed state representation improves an always-on system.

Why It Matters

TurboQuant is useful to this KB as a clean compression primitive: it compresses high-dimensional latent vectors without a training-time or index-time codebook learning stage. That matters for always-on sequence systems because serving cost can be dominated by memory bandwidth, cache footprint, vector-search indexing, and dequantization overhead rather than by parameter count alone.

For time-series and world-model work, the direct result is not numeric forecasting evidence. The transferable idea is an interface: if a model stores latent state, retrieved memory, trajectory embeddings, or KV-like sequence state, compression should preserve the geometry used by downstream scoring, not only the average reconstruction error.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state and long context	adjacent	KV-cache compression reduces retained sequence-state footprint in LLM inference benchmarks.	Needs numeric time-series, event-stream, and action-history serving tests.
Representation quality and latent-state preservation	adjacent	The method targets high-dimensional vectors and inner products, which resemble latent-state and retrieval-memory interfaces.	Need preservation probes for rare regimes, control inputs, and cross-channel deviations.
Vector search and retrieval	adjacent	Reported near-zero indexing time and strong recall against PQ/RabitQ on embedding datasets.	Need operational vector-store latency, update, and mixed-modality benchmarks.
Dynamic compute and serving cost	warning	Compression can change the memory and indexing unit without changing model parameters.	TSFMs still need a domain unit for samples, channel-time cells, events, actions, or compressed bits.

Limitations

The main evidence is LLM KV-cache compression and embedding-vector search, not time-series forecasting, trajectory modeling, robotics, or observability telemetry.
The paper’s practical KV-cache experiments are limited to Llama and Mistral-style open models and selected long-context benchmarks.
The “zero accuracy loss” framing depends on bitwidth, task, model, hardware, and benchmark; the paper itself reports 3.5-bit equality on one LongBench-E table and small degradation at 2.5 bits.
No official production-ready implementation was verified during ingest.
The Google blog’s H100 speedup and 6x memory claims are official narrative claims, but they should not be treated as independent deployment evidence.
The vLLM critique reports that TurboQuant variants reduce throughput and increase per-token latency versus FP8/BF16 in its tested setup; the method’s value is narrower than the original deployment narrative and depends on memory pressure outweighing speed and quality costs.
The vLLM study notes that TurboQuant support was limited to standard-attention models in that setup; sliding-window or hybrid-attention models were not supported.
Compression that preserves inner products may still erase rare, low-energy, or action-relevant state in time-series settings unless tested directly.

Links Into The Wiki

Open Questions

Can TurboQuant-style vector quantization preserve rare-regime and intervention-relevant latent state in multivariate time-series models?
Should TSFM memory compression optimize MSE, inner products, downstream prediction loss, action value, or intervention-relevant control value?
When does compressed sequence state help end-to-end serving after hardware-native baselines, dequantization overhead, latency, and throughput are counted?
Does near-zero vector-index construction change the design space for online retrieval memory in always-on observability agents?
What bitwidth is needed before compressed latent states stop hurting anomaly detection, root-cause localization, and counterfactual rollout quality?

Alex Open Research Wiki

Explorer

TurboQuant

TurboQuant

Source

Status And Credibility

Core Claim

Method

Evidence And Results

Practical Critique

Why It Matters

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks