Google Research Releases TurboQuant: 6x KV Cache Compression With Zero Accuracy Loss

⚡

TurboQuant at a glance: Compresses KV caches to ~3–3.5 bits per value · 6x minimum memory reduction · 8x attention speedup on H100 GPUs · 100% retrieval accuracy at 128k tokens · No model retraining required · Accepted at ICLR 2026 and AISTATS 2026

The KV Cache Memory Problem

As context windows in LLMs push beyond 100k tokens, the KV cache — which stores previous key and value vectors for efficient attention — becomes the dominant memory consumer. For a 70B-parameter model at 32-bit precision, the cache alone can exceed 40GB during extended conversations or document processing.

Industry reports indicate KV cache memory accounts for 80–90% of total VRAM usage in long-context scenarios. Conventional quantization methods often require storing auxiliary calibration constants or model retraining, limiting practical gains. TurboQuant targets this directly through a data-oblivious framework that needs neither.

How TurboQuant Works: PolarQuant + QJL

TurboQuant combines two sub-algorithms. PolarQuant converts standard Cartesian key/value vectors into polar coordinates (radius and angles). After applying a random rotation, the angular distribution becomes statistically predictable, eliminating the expensive normalization steps common in traditional quantizers.

The second component, Quantized Johnson-Lindenstrauss (QJL), applies a 1-bit quantized transform as an error-correction layer. QJL removes quantization bias while preserving the inner-product relationships critical for attention score calculations. The pipeline is fully training-free and supports online quantization during inference — no model weights are altered.

📊

Google describes TurboQuant as achieving near-optimal distortion rates for high-dimensional Euclidean vectors in both mean-squared error and inner-product metrics — without storing a single auxiliary calibration constant.

Benchmark Results on Gemma and Mistral

Google evaluated TurboQuant on Gemma and Mistral model families across multiple long-context benchmarks:

Needle In A Haystack: 100% retrieval accuracy maintained up to 128k tokens, matching full-precision baselines
LongBench (3.5-bit): Score of 50.06 — identical to the full-cache baseline
LongBench (2.5-bit): Score of 49.44 — marginal loss, still competitive
H100 attention speedup: Up to 8x vs. 32-bit implementations
KV cache size reduction: 6x minimum at 3–3.5 bits per value
Vector search: Higher 1@k recall than standard Product Quantization on GloVe datasets

The method also matched or exceeded KIVI — a leading prior technique — across question-answering, code generation, and summarization tasks in LongBench.

Why Training-Free Matters for Deployment

Earlier quantization approaches typically require 1–2 extra bits of overhead per value to store scaling constants, often negating much of the intended memory savings. Many also demand fine-tuning to recover accuracy, adding weeks of compute and engineering time before a model can ship.

TurboQuant's data-oblivious design eliminates both. The polar coordinate transformation and QJL error correction maintain attention fidelity with no overhead constants and no retraining. Integration into existing serving stacks is feasible through software updates alone. A Rust library (v0.1.1) was published on the announcement date for immediate research and benchmarking use.

Implications for AI Infrastructure

Reducing KV cache memory by 6x means longer effective context windows on existing GPU hardware, or larger models deployed on the same memory footprint. The 8x attention speedup directly lowers inference latency and operational costs in data-center environments.

The March 24 announcement coincided with observable movements in memory-related equities, as analysts noted the potential for reduced demand for high-bandwidth memory in AI inference workloads. Early community implementations — including ports to MLX — have begun replicating the zero-accuracy-loss claims on Needle In A Haystack evaluations.

💬

“When KV caches shrink from 40GB-plus footprints to roughly one-sixth the size while preserving exact retrieval accuracy at 128k tokens, the arithmetic of long-context AI changes accordingly.”

What's Next

The core TurboQuant paper, along with companion works on PolarQuant and QJL, is scheduled for formal presentation at ICLR 2026 in Rio de Janeiro and AISTATS 2026 in Tangier. Google has published the primary paper on arXiv with links to companion research. An interactive project site with compression visualizations accompanied the release.

Ongoing open-source activity around the Rust crate and framework integrations suggests rapid community validation of the reported metrics is already underway.

Discussion

Comments post live to the ObjectWire Discord server.

Join server →

Every comment appears live in our Discord server.

Join to see the full conversation and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · Google Research Releases TurboQuant: 6x KV Cache Compression With Zero Accuracy Loss.