At a Glance: TurboQuant reduces LLM Key-Value cache memory by at least 6x, delivers up to 8x faster attention computation, and operates at 3 bits per value, all with virtually zero accuracy loss and no retraining required. Two peer-reviewed papers underpin the system: PolarQuant (ICLR 2026) and QJL (AISTATS 2026).
What TurboQuant Does and Why It Matters
Google Research officially released TurboQuant on March 24, 2026, a training-free compression suite designed to attack the single greatest hardware obstacle facing modern large language models: the sheer amount of GPU RAM required to store conversational memory.
TurboQuant reduces the memory footprint of Key-Value (KV) caches, the mechanism LLMs use to remember previous parts of a conversation, by at least 6x while simultaneously delivering up to an 8x speedup in attention computation. For AI teams straining against VRAM limits on Nvidia A100 and H100 GPUs, this represents a fundamental shift in what is possible on existing hardware.
| Metric | Before TurboQuant → After TurboQuant |
|---|---|
KV cache memory (70B model, 100K tokens) | 40+ GB → ~6.7 GB |
Bits per value | 16-bit FP → 3-bit |
Attention computation speed | 1x baseline → up to 8x |
Calibration constants overhead | 1-2 bits per value → 0 |
Retraining required | Yes (prior methods) → No |
Accuracy loss | Measurable (prior methods) → Virtually zero |
The KV Cache Memory Tax | Why 100K-Token Context Breaks GPUs
As LLM context windows have expanded to 100,000 tokens and beyond, the KV cache has quietly become the dominant consumer of GPU memory. For every token a model holds in context, the KV cache must store the key and value vectors for every attention head in every layer, and those numbers compound fast.
On a 70-billion parameter model, the KV cache alone can occupy more than 40 GB of VRAM when processing long documents. A high-end data center GPU carries between 40 and 80 GB of total VRAM. The conversational memory of a single long-context session can consume 80-90% of available hardware capacity, leaving almost nothing for the model weights or the generation process itself.
| GPU | Total VRAM |
|---|---|
Nvidia A100 (SXM) | 80 GB HBM2e |
Nvidia H100 (SXM) | 80 GB HBM3 |
Nvidia H200 | 141 GB HBM3e |
AMD MI300X | 192 GB HBM3 |
Nvidia B200 (Blackwell) | 192 GB HBM3e |
The result is a hard ceiling on long-form reasoning, multi-document analysis, and extended conversations, not because the models lack the architecture, but because the hardware cannot hold the memory. This is the bottleneck Nvidia's Blackwell B300 surge is partly designed to address through raw capacity, but TurboQuant offers an algorithmic solution that works on existing silicon.
Why Previous Quantization Methods Failed
Compressing the KV cache through quantization is not a new idea. Standard quantization reduces each stored value from 16-bit floating point to a lower-precision format, directly shrinking memory usage. The problem is that naive quantization introduces errors, the compressed value is not the same as the original, and those errors propagate through attention computation, degrading the model's output quality.
The conventional fix is to store calibration constants alongside the compressed values: small numbers that tell the model how to decompress accurately. But those constants carry their own overhead of 1 to 2 bits per value, often negating a substantial portion of the gains from compression in the first place. On long contexts, the overhead compounds into a significant memory cost.
| Method | Compression Limitation |
|---|---|
Naive INT8 quantization | Accuracy degradation on long contexts, visible in perplexity benchmarks |
GPTQ / AWQ (weight-only) | Compresses weights, not KV cache, does not address inference memory |
SmoothQuant | Requires per-channel calibration constants, adds 1-2 bits overhead |
KV cache eviction (H2O) | Drops tokens entirely, loses information, context fidelity falls |
TurboQuant | Data-oblivious, zero calibration overhead, 3-bit with near-zero accuracy loss |
The TurboQuant Solution | Data-Oblivious Compression at 3 Bits
Instead of calibrating to the specific values being compressed, TurboQuant uses a data-oblivious strategy, a mathematical framework that works the same way regardless of what data it encounters. Because the approach does not need to adapt to the input, it completely eliminates the need for calibration constants.
The result: 3 bits per value with virtually zero accuracy loss and zero constants overhead. The compression is real, the cost is gone, and the output quality is preserved.
5.3x compression ratio on FP16 values at 3 bits with zero calibration overhead, compared to the theoretical maximum of 5.33x (16 / 3). TurboQuant achieves 99.4% of the theoretical limit.
Stage 1 | PolarQuant Converts Cartesian to Polar Coordinates
PolarQuant, to be presented at ICLR 2026, handles the initial compression of KV vectors. Standard quantizers work in Cartesian coordinates (X, Y, Z values), which have uneven distributions that require per-channel normalization to compress cleanly, another source of overhead.
PolarQuant converts those vectors from Cartesian into polar coordinates (a radius value plus a set of angles). After a random rotation, the angular distribution of vectors becomes statistically predictable, something Google's researchers call near-uniform. Because the distribution is predictable, the system can skip the expensive per-channel normalization steps entirely.
| Step | What PolarQuant Does |
|---|---|
1. Input | Receives raw KV vectors in Cartesian space (standard FP16 format) |
2. Random rotation | Applies a random orthogonal rotation to the vector space |
3. Coordinate transform | Converts rotated vectors from Cartesian to polar coordinates |
4. Angular uniformity | Angles become near-uniform distribution after rotation |
5. Quantize | Compresses to 3-bit using uniform quantization, no per-channel calibration needed |
6. Output | 3-bit compressed KV cache values, zero calibration constants stored |
Stage 2 | QJL Eliminates Quantization Error from Attention Scores
PolarQuant compresses effectively, but quantization always introduces some small errors. Left uncorrected, those errors would distort the attention scores, the mechanism by which a model decides which parts of its context are relevant to the current token. QJL eliminates that risk.
QJL, to be presented at AISTATS 2026, uses a 1-bit Quantized Johnson-Lindenstrauss transform as a mathematical error-checker. The Johnson-Lindenstrauss lemma is a well-established result in dimensionality reduction guaranteeing that random projections preserve distances between points. QJL applies this principle to eliminate bias in the compressed representation, ensuring that attention scores computed from TurboQuant-compressed caches are statistically identical to those from the full-precision original model.
| Property | PolarQuant (Stage 1) vs QJL (Stage 2) |
|---|---|
Conference | ICLR 2026 vs AISTATS 2026 |
Role | Primary compression vs Error correction |
Technique | Polar coordinate transform vs 1-bit JL random projection |
Bits used | 3 bits per value vs 1 bit per projection |
Calibration needed | None vs None |
Target | KV vector storage vs Attention score accuracy |
Conference Publication Timeline | ICLR and AISTATS 2026
PolarQuant paper submitted to ICLR 2026
Google Research submits the polar-coordinate KV cache quantization paper for peer review.
QJL paper accepted at AISTATS 2026
The Quantized Johnson-Lindenstrauss error-correction paper is accepted for presentation.
PolarQuant accepted at ICLR 2026
Acceptance confirmed. Oral presentation scheduled for the Vienna conference in May.
TurboQuant suite publicly released
Google Research announces TurboQuant combining PolarQuant and QJL into a single drop-in compression layer.
ICLR 2026 presentation (Vienna)
PolarQuant will be presented at ICLR in Vienna, Austria.
AISTATS 2026 presentation
QJL will be presented at AISTATS alongside the broader TurboQuant system.
What TurboQuant Means for AI Inference Teams
The immediate beneficiaries are inference teams running 70B-class models on datacenter hardware. A 6x reduction in KV cache memory translates directly to either dramatically longer context windows on the same hardware, or the ability to serve many more simultaneous users from a single GPU node, both of which have significant cost implications at scale.
For researchers and smaller teams, TurboQuant's training-free nature is equally important. Applying it to an existing model requires no retraining, no fine-tuning, and no access to the original training data. It can be dropped into an existing inference stack as a compression layer.
The 8x attention speedup compounds on top of the memory gains. Faster attention means more tokens processed per second, which directly reduces the cost-per-token of long-context inference, the metric that most AI API providers are competing on.
| Scenario | Without TurboQuant → With TurboQuant |
|---|---|
70B model, 100K context on A100 (80 GB) | KV cache alone exceeds VRAM → Fits with ~47 GB headroom |
Concurrent users per H100 node | 1-2 long-context sessions → 8-12 sessions |
Max context length on H200 (141 GB) | ~250K tokens → 1.5M+ tokens |
Cost per million tokens (inference) | $X baseline → ~$X/6 at equivalent quality |
Deployment complexity | Requires retraining/calibration → Drop-in compression layer |
TurboQuant vs Other Efficiency Techniques | Where It Fits
TurboQuant joins a growing body of LLM efficiency research. It is complementary to, not a replacement for, techniques like Flash Attention, paged KV caches, and speculative decoding. Understanding where each technique operates is essential for teams building optimized inference stacks.
| Technique | What It Optimizes |
|---|---|
Flash Attention (Tri Dao) | Attention computation speed via IO-aware tiling, does not reduce memory footprint |
Paged KV Cache (vLLM) | Memory allocation efficiency, reduces fragmentation but not per-token storage cost |
Speculative Decoding | Token generation throughput via draft models, no effect on KV cache size |
GQA / MQA | Reduces KV heads at architecture level, requires model redesign and retraining |
KV Cache Eviction (H2O, Scissorhands) | Drops older tokens to free memory, loses context fidelity |
TurboQuant | Compresses KV cache storage to 3 bits, preserves all tokens and accuracy |
The key distinction: Flash Attention makes the math faster, paged KV caches manage memory allocation better, but TurboQuant makes the stored data itself smaller. Teams running autonomous coding agents or long-context retrieval systems can stack all three for compounding gains.
Competitive Landscape | Who Else Is Working on KV Cache Compression
Google is not alone in pursuing KV cache efficiency. Several research groups and companies have published competing approaches in 2025-2026, though none have matched TurboQuant's combination of zero calibration overhead, training-free deployment, and peer-reviewed mathematical guarantees.
| Organization | Approach |
|---|---|
Google Research (TurboQuant) | Data-oblivious polar quantization + JL error correction, 3-bit, zero calibration |
Meta AI (KIVI) | Per-channel INT2 KV quantization with residual correction, 2-bit aggressive |
Microsoft Research (Gear) | Low-rank approximation of KV cache with quantized residuals |
UC Berkeley (H2O) | Heavy-Hitter Oracle eviction, drops low-attention tokens to reduce cache size |
MIT (Scissorhands) | Pivotal token identification and selective KV eviction |
Alibaba DAMO (CLA) | Cross-Layer Attention sharing KV cache across transformer layers |