TurboQuant KV Cache Compression | 6x Memory, 8x Speed

⚡

At a Glance: TurboQuant reduces LLM Key-Value cache memory by at least 6x, delivers up to 8x faster attention computation, and operates at 3 bits per value, all with virtually zero accuracy loss and no retraining required. Two peer-reviewed papers underpin the system: PolarQuant (ICLR 2026) and QJL (AISTATS 2026).

What TurboQuant Does and Why It Matters

Google Research officially released TurboQuant on March 24, 2026, a training-free compression suite designed to attack the single greatest hardware obstacle facing modern large language models: the sheer amount of GPU RAM required to store conversational memory.

TurboQuant reduces the memory footprint of Key-Value (KV) caches, the mechanism LLMs use to remember previous parts of a conversation, by at least 6x while simultaneously delivering up to an 8x speedup in attention computation. For AI teams straining against VRAM limits on Nvidia A100 and H100 GPUs, this represents a fundamental shift in what is possible on existing hardware.

Metric	Before TurboQuant → After TurboQuant
KV cache memory (70B model, 100K tokens)	40+ GB → ~6.7 GB
Bits per value	16-bit FP → 3-bit
Attention computation speed	1x baseline → up to 8x
Calibration constants overhead	1-2 bits per value → 0
Retraining required	Yes (prior methods) → No
Accuracy loss	Measurable (prior methods) → Virtually zero

TurboQuant performance summary based on Google Research benchmarks, March 2026

The KV Cache Memory Tax | Why 100K-Token Context Breaks GPUs

As LLM context windows have expanded to 100,000 tokens and beyond, the KV cache has quietly become the dominant consumer of GPU memory. For every token a model holds in context, the KV cache must store the key and value vectors for every attention head in every layer, and those numbers compound fast.

On a 70-billion parameter model, the KV cache alone can occupy more than 40 GB of VRAM when processing long documents. A high-end data center GPU carries between 40 and 80 GB of total VRAM. The conversational memory of a single long-context session can consume 80-90% of available hardware capacity, leaving almost nothing for the model weights or the generation process itself.

GPU	Total VRAM
Nvidia A100 (SXM)	80 GB HBM2e
Nvidia H100 (SXM)	80 GB HBM3
Nvidia H200	141 GB HBM3e
AMD MI300X	192 GB HBM3
Nvidia B200 (Blackwell)	192 GB HBM3e

Datacenter GPU VRAM capacity, 2024-2026 generation

The result is a hard ceiling on long-form reasoning, multi-document analysis, and extended conversations, not because the models lack the architecture, but because the hardware cannot hold the memory. This is the bottleneck Nvidia's Blackwell B300 surge is partly designed to address through raw capacity, but TurboQuant offers an algorithmic solution that works on existing silicon.

Why Previous Quantization Methods Failed

Compressing the KV cache through quantization is not a new idea. Standard quantization reduces each stored value from 16-bit floating point to a lower-precision format, directly shrinking memory usage. The problem is that naive quantization introduces errors, the compressed value is not the same as the original, and those errors propagate through attention computation, degrading the model's output quality.

The conventional fix is to store calibration constants alongside the compressed values: small numbers that tell the model how to decompress accurately. But those constants carry their own overhead of 1 to 2 bits per value, often negating a substantial portion of the gains from compression in the first place. On long contexts, the overhead compounds into a significant memory cost.

Method	Compression Limitation
Naive INT8 quantization	Accuracy degradation on long contexts, visible in perplexity benchmarks
GPTQ / AWQ (weight-only)	Compresses weights, not KV cache, does not address inference memory
SmoothQuant	Requires per-channel calibration constants, adds 1-2 bits overhead
KV cache eviction (H2O)	Drops tokens entirely, loses information, context fidelity falls
TurboQuant	Data-oblivious, zero calibration overhead, 3-bit with near-zero accuracy loss

Prior KV cache compression approaches versus TurboQuant

The TurboQuant Solution | Data-Oblivious Compression at 3 Bits

Instead of calibrating to the specific values being compressed, TurboQuant uses a data-oblivious strategy, a mathematical framework that works the same way regardless of what data it encounters. Because the approach does not need to adapt to the input, it completely eliminates the need for calibration constants.

The result: 3 bits per value with virtually zero accuracy loss and zero constants overhead. The compression is real, the cost is gone, and the output quality is preserved.

📊

5.3x compression ratio on FP16 values at 3 bits with zero calibration overhead, compared to the theoretical maximum of 5.33x (16 / 3). TurboQuant achieves 99.4% of the theoretical limit.

Stage 1 | PolarQuant Converts Cartesian to Polar Coordinates

PolarQuant, to be presented at ICLR 2026, handles the initial compression of KV vectors. Standard quantizers work in Cartesian coordinates (X, Y, Z values), which have uneven distributions that require per-channel normalization to compress cleanly, another source of overhead.

PolarQuant converts those vectors from Cartesian into polar coordinates (a radius value plus a set of angles). After a random rotation, the angular distribution of vectors becomes statistically predictable, something Google's researchers call near-uniform. Because the distribution is predictable, the system can skip the expensive per-channel normalization steps entirely.

Step	What PolarQuant Does
1. Input	Receives raw KV vectors in Cartesian space (standard FP16 format)
2. Random rotation	Applies a random orthogonal rotation to the vector space
3. Coordinate transform	Converts rotated vectors from Cartesian to polar coordinates
4. Angular uniformity	Angles become near-uniform distribution after rotation
5. Quantize	Compresses to 3-bit using uniform quantization, no per-channel calibration needed
6. Output	3-bit compressed KV cache values, zero calibration constants stored

PolarQuant compression pipeline, step by step

Stage 2 | QJL Eliminates Quantization Error from Attention Scores

PolarQuant compresses effectively, but quantization always introduces some small errors. Left uncorrected, those errors would distort the attention scores, the mechanism by which a model decides which parts of its context are relevant to the current token. QJL eliminates that risk.

QJL, to be presented at AISTATS 2026, uses a 1-bit Quantized Johnson-Lindenstrauss transform as a mathematical error-checker. The Johnson-Lindenstrauss lemma is a well-established result in dimensionality reduction guaranteeing that random projections preserve distances between points. QJL applies this principle to eliminate bias in the compressed representation, ensuring that attention scores computed from TurboQuant-compressed caches are statistically identical to those from the full-precision original model.

Property	PolarQuant (Stage 1) vs QJL (Stage 2)
Conference	ICLR 2026 vs AISTATS 2026
Role	Primary compression vs Error correction
Technique	Polar coordinate transform vs 1-bit JL random projection
Bits used	3 bits per value vs 1 bit per projection
Calibration needed	None vs None
Target	KV vector storage vs Attention score accuracy

TurboQuant two-stage architecture comparison

Conference Publication Timeline | ICLR and AISTATS 2026

September 2025

PolarQuant paper submitted to ICLR 2026

Google Research submits the polar-coordinate KV cache quantization paper for peer review.

January 2026

QJL paper accepted at AISTATS 2026

The Quantized Johnson-Lindenstrauss error-correction paper is accepted for presentation.

February 2026

PolarQuant accepted at ICLR 2026

Acceptance confirmed. Oral presentation scheduled for the Vienna conference in May.

March 24, 2026

TurboQuant suite publicly released

Google Research announces TurboQuant combining PolarQuant and QJL into a single drop-in compression layer.

May 2026

ICLR 2026 presentation (Vienna)

PolarQuant will be presented at ICLR in Vienna, Austria.

May 2026

AISTATS 2026 presentation

QJL will be presented at AISTATS alongside the broader TurboQuant system.

What TurboQuant Means for AI Inference Teams

The immediate beneficiaries are inference teams running 70B-class models on datacenter hardware. A 6x reduction in KV cache memory translates directly to either dramatically longer context windows on the same hardware, or the ability to serve many more simultaneous users from a single GPU node, both of which have significant cost implications at scale.

For researchers and smaller teams, TurboQuant's training-free nature is equally important. Applying it to an existing model requires no retraining, no fine-tuning, and no access to the original training data. It can be dropped into an existing inference stack as a compression layer.

The 8x attention speedup compounds on top of the memory gains. Faster attention means more tokens processed per second, which directly reduces the cost-per-token of long-context inference, the metric that most AI API providers are competing on.

Scenario	Without TurboQuant → With TurboQuant
70B model, 100K context on A100 (80 GB)	KV cache alone exceeds VRAM → Fits with ~47 GB headroom
Concurrent users per H100 node	1-2 long-context sessions → 8-12 sessions
Max context length on H200 (141 GB)	~250K tokens → 1.5M+ tokens
Cost per million tokens (inference)	$X baseline → ~$X/6 at equivalent quality
Deployment complexity	Requires retraining/calibration → Drop-in compression layer

Practical impact scenarios for AI inference operations

TurboQuant vs Other Efficiency Techniques | Where It Fits

TurboQuant joins a growing body of LLM efficiency research. It is complementary to, not a replacement for, techniques like Flash Attention, paged KV caches, and speculative decoding. Understanding where each technique operates is essential for teams building optimized inference stacks.

Technique	What It Optimizes
Flash Attention (Tri Dao)	Attention computation speed via IO-aware tiling, does not reduce memory footprint
Paged KV Cache (vLLM)	Memory allocation efficiency, reduces fragmentation but not per-token storage cost
Speculative Decoding	Token generation throughput via draft models, no effect on KV cache size
GQA / MQA	Reduces KV heads at architecture level, requires model redesign and retraining
KV Cache Eviction (H2O, Scissorhands)	Drops older tokens to free memory, loses context fidelity
TurboQuant	Compresses KV cache storage to 3 bits, preserves all tokens and accuracy

LLM efficiency technique comparison, each addresses a different layer of the inference stack

The key distinction: Flash Attention makes the math faster, paged KV caches manage memory allocation better, but TurboQuant makes the stored data itself smaller. Teams running autonomous coding agents or long-context retrieval systems can stack all three for compounding gains.

Competitive Landscape | Who Else Is Working on KV Cache Compression

Google is not alone in pursuing KV cache efficiency. Several research groups and companies have published competing approaches in 2025-2026, though none have matched TurboQuant's combination of zero calibration overhead, training-free deployment, and peer-reviewed mathematical guarantees.

Organization	Approach
Google Research (TurboQuant)	Data-oblivious polar quantization + JL error correction, 3-bit, zero calibration
Meta AI (KIVI)	Per-channel INT2 KV quantization with residual correction, 2-bit aggressive
Microsoft Research (Gear)	Low-rank approximation of KV cache with quantized residuals
UC Berkeley (H2O)	Heavy-Hitter Oracle eviction, drops low-attention tokens to reduce cache size
MIT (Scissorhands)	Pivotal token identification and selective KV eviction
Alibaba DAMO (CLA)	Cross-Layer Attention sharing KV cache across transformer layers

KV cache compression research landscape, 2025-2026

💬

“The KV cache is the last frontier of LLM memory optimization. Weight quantization is mature, attention computation is fast. The bottleneck has shifted entirely to runtime memory, and that is exactly where TurboQuant operates.”

📰 Related Stories

Tech

Discussion

Comments post live to the ObjectWire Discord server.

Join server →

Every comment appears live in our Discord server.

Join to see the full conversation and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · TurboQuant KV Cache Compression | 6x Less Memory, 8x Faster Attention for LLMs.

TurboQuant KV Cache Compression | 6x Less Memory, 8x Faster Attention for LLMs

What TurboQuant Does and Why It Matters

The KV Cache Memory Tax | Why 100K-Token Context Breaks GPUs

Why Previous Quantization Methods Failed

The TurboQuant Solution | Data-Oblivious Compression at 3 Bits

Stage 1 | PolarQuant Converts Cartesian to Polar Coordinates

Stage 2 | QJL Eliminates Quantization Error from Attention Scores

Conference Publication Timeline | ICLR and AISTATS 2026

PolarQuant paper submitted to ICLR 2026

QJL paper accepted at AISTATS 2026

PolarQuant accepted at ICLR 2026

TurboQuant suite publicly released

ICLR 2026 presentation (Vienna)

AISTATS 2026 presentation

What TurboQuant Means for AI Inference Teams

TurboQuant vs Other Efficiency Techniques | Where It Fits

Competitive Landscape | Who Else Is Working on KV Cache Compression

📰 Related Stories

Google | News, Research, and Coverage Hub

Nvidia Blackwell B300 | Data Center Demand Surge 2026

OpenAI Symphony | Open Source Autonomous Coding Agents

Gemini Embedding 2 | Multimodal Embedding Model

Nvidia $4B Photonics | Lumentum, Coherent, AI Bottleneck

More from Google

Google Leads $40 Billion Investment in Anthropic at $350 Billion Valuation | Cloud Giants Race for AI Compute Anchor

DeepMind Uses Claude Code | Steve Yegge vs Demis Hassabis 2026

Gucci Smart Glasses | Kering and Google Partner for 2027 Launch

FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works

Google Warns of Iran-Linked Cyber Attacks Targeting Global Infrastructure Amid Ongoing Conflict

Gemini Embedding 2: Our First Natively Multimodal Embedding Model

Discussion