TECHNOLOGY

Google Research Drops

A training-free compression suite cuts KV cache size by 6x and attention computation by 8x — eliminating the biggest hardware barrier to long-context AI

March 25, 2026📖 5 min read

At a Glance: TurboQuant reduces LLM Key-Value cache memory by at least , delivers up to 8× faster attention computation, and operates at 3 bits per value — all with virtually zero accuracy loss and no retraining required.

What Is TurboQuant?

Google Research has officially released TurboQuant, a training-free compression suite designed to attack the single greatest hardware obstacle facing modern large language models: the sheer amount of GPU RAM required to store conversational memory.

Announced on Tuesday, March 24, 2026, TurboQuant reduces the memory footprint of Key-Value (KV) caches — the mechanism LLMs use to remember previous parts of a conversation — by at least 6×, while simultaneously delivering up to an 8× speedup in attention computation. For AI teams straining against VRAM limits, this represents a fundamental shift in what is possible on existing hardware.

The Memory Tax Problem

As LLM context windows have expanded to 100,000 tokens and beyond, the KV cache has quietly become the dominant consumer of GPU memory. For every token a model holds in context, the KV cache must store the key and value vectors for every attention head in every layer — and those numbers compound fast.

On a 70-billion parameter model, the KV cache alone can occupy more than 40 GB of VRAM when processing long documents. For context, a high-end data center GPU carries between 40 and 80 GB of total VRAM. This means the conversational memory of a single long-context session can consume 80–90% of available hardware capacity, leaving almost nothing for the model's weights or the generation process itself.

The result is a hard ceiling on long-form reasoning, multi-document analysis, and extended conversations — not because the models lack the architecture, but because the hardware cannot hold the memory.

Why Old Quantization Approaches Failed

Compressing the KV cache through quantization is not a new idea. Standard quantization reduces each stored value from 16-bit floating point to a lower-precision format, directly shrinking memory usage. The problem is that naive quantization introduces errors — the compressed value is not the same as the original — and those errors propagate through attention computation, degrading the model's output quality.

The conventional fix is to store calibration constants alongside the compressed values: small numbers that tell the model how to decompress accurately. But those constants carry their own overhead of 1 to 2 bits per value, often negating a substantial portion of the gains from compression in the first place. On long contexts, the overhead compounds into a significant memory cost.

TurboQuant takes a different approach entirely.

The TurboQuant Solution: Data-Oblivious Compression

Instead of calibrating to the specific values being compressed, TurboQuant uses a data-oblivious strategy — a mathematical framework that works the same way regardless of what data it encounters. Because the approach does not need to adapt to the input, it completely eliminates the need for calibration constants.

The result: 3 bits per value with virtually zero accuracy loss and zero constants overhead. The compression is real, the cost is gone, and the output quality is preserved.

How It Works: The Two-Stage Shield

TurboQuant achieves its efficiency through two novel sub-algorithms, each targeting a different stage of the compression problem. Both are scheduled to be presented at major conferences in 2026.

Stage 1 — PolarQuant (Primary Compression)

PolarQuant, to be presented at ICLR 2026, handles the initial compression of KV vectors. Standard quantizers work in Cartesian coordinates (X, Y, Z values), which have uneven distributions that require per-channel normalization to compress cleanly — another source of overhead.

PolarQuant converts those vectors from Cartesian into polar coordinates (a radius value plus a set of angles). After a random rotation, the angular distribution of vectors becomes statistically predictable — something Google's researchers call near-uniform. Because the distribution is predictable, the system can skip the expensive per-channel normalization steps entirely. The result is fast, lightweight compression with no calibration constants required.

Stage 2 — QJL (Error Correction)

PolarQuant compresses effectively, but quantization always introduces some small errors. Left uncorrected, those errors would distort the attention scores — the mechanism by which a model decides which parts of its context are relevant to the current token. QJL eliminates that risk.

QJL, to be presented at AISTATS 2026, uses a 1-bit Quantized Johnson-Lindenstrauss transform as a mathematical error-checker. The Johnson-Lindenstrauss lemma is a well-established result in dimensionality reduction guaranteeing that random projections preserve distances between points. QJL applies this principle to eliminate bias in the compressed representation, ensuring that attention scores computed from TurboQuant-compressed caches are statistically identical to those from the full-precision original model.

What This Means for AI Builders

The immediate beneficiaries are inference teams running 70B-class models on datacenter hardware. A 6× reduction in KV cache memory translates directly to either dramatically longer context windows on the same hardware, or the ability to serve many more simultaneous users from a single GPU node — both of which have significant cost implications at scale.

For researchers and smaller teams, TurboQuant's training-free nature is equally important. Applying it to an existing model requires no retraining, no fine-tuning, and no access to the original training data. It can be dropped into an existing inference stack as a compression layer.

The 8× attention speedup compounds on top of the memory gains. Faster attention means more tokens processed per second, which directly reduces the cost-per-token of long-context inference — the metric that most AI API providers are competing on.

TurboQuant joins a growing body of efficiency research — alongside techniques like Flash Attention, paged KV caches, and speculative decoding — that collectively make long-context LLMs viable outside hyperscaler budgets.

Tags

#Google#AI#LLM#TurboQuant#PolarQuant#QJL#KV Cache#Quantization#ICLR 2026#AISTATS 2026#VRAM#GPU

Discussion

Sign in to join the conversation

Your comments appear live in our Discord server — every post grows the community.

Every comment appears live in our Discord server.

Join to see the full conversation, get notified on new articles, and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · Google Research Drops .

J

Written by

Jack Wang

AI & Technology

Part ofObjectWirecoverage
📩 Newsletter

Stay ahead of every story

Breaking news, deep-dives, and editor picks — delivered straight to your inbox. No spam, ever.

Free · Unsubscribe anytime · No ads