OBJECTWIRE

Independent · Verified · In-Depth

Tech

FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works

FlashAttention 3 speeds up attention compute, TurboQuant compresses KV cache storage, Paged KV Cache eliminates memory fragmentation, and the real answer is you use all three

||10 min read

The mental model: FlashAttention 3 speeds up the attention compute itself (1.5-2x over FlashAttention-2 on H100). TurboQuant compresses KV cache values to 3 bits, cutting memory 6x. Paged KV Cache organizes that memory so long-context serving does not waste space, reducing fragmentation from 60-80% waste to under 5%. They solve different bottlenecks. In production, you stack all three.

What Each Technique Actually Does | The One-Table Summary

Before explaining how these techniques work individually, here is the fundamental distinction. Each targets a different layer of the inference pipeline. Confusing them, or treating them as competitors, misses the point entirely.

TechniqueBottleneck It Fixes
FlashAttention 3
Attention compute speed and GPU utilization. Reorders and overlaps attention work to reduce memory traffic and maximize throughput.
TurboQuant
KV cache storage size. Compresses values from 16-bit to 3-bit using data-oblivious quantization, eliminating calibration overhead.
Paged KV Cache
KV cache memory fragmentation. Stores cache in fixed-size pages instead of one contiguous slab, enabling dynamic allocation.
Each technique targets a distinct layer of the LLM inference stack

FlashAttention 3 | Making Attention Math 2x Faster on Hopper GPUs

FlashAttention 3, developed by Tri Dao and collaborators, is the third generation of the IO-aware attention algorithm that fundamentally changed how transformers compute attention. The core insight: standard attention implementations are bottlenecked not by compute but by memory bandwidth, the cost of moving data between GPU SRAM and HBM.

FlashAttention 3 targets Hopper-class GPUs (H100, H200) with architectural features unavailable on earlier generations. The published results report 1.5-2.0x speedup over FlashAttention-2 on H100 with FP16/BF16, plus native FP8 support that unlocks substantially higher effective throughput by halving the precision while maintaining output quality through careful accumulation strategies.

FeatureFlashAttention 2 vs FlashAttention 3
Target GPU architecture
Ampere (A100) and Hopper vs Hopper-native (H100/H200 specific optimizations)
Speedup over FA-2
Baseline vs 1.5-2.0x on H100 (FP16/BF16)
FP8 support
No vs Yes, with mixed-precision accumulation
Asynchrony
Limited vs Full warp-specialization with overlapped compute and memory
Block-sparse support
Basic vs Improved, enabling structured sparsity patterns
FLOPS utilization on H100
~35-40% of theoretical peak vs ~75% of theoretical peak
Kernel implementation
CUDA + Cutlass vs CUDA + Cutlass + Hopper-specific TMA instructions
FlashAttention 2 vs FlashAttention 3 feature comparison

The key architectural innovation in FA-3 is warp specialization: different warps (groups of 32 GPU threads) are assigned different roles, with some warps performing computation while others handle memory transfers simultaneously. On Hopper GPUs with their Tensor Memory Accelerator (TMA), this overlap is hardware-supported, allowing FA-3 to approach 75% of the GPU's theoretical peak FLOPS, up from roughly 35-40% with FA-2.

📊

75% FLOPS utilization on H100, up from ~35-40% with FlashAttention-2. That is not a marginal gain, it means the same GPU is doing nearly twice as much useful work per second on attention computation.

TurboQuant | Compressing the KV Cache to 3 Bits with Zero Accuracy Loss

While FlashAttention 3 makes the attention math faster, TurboQuant attacks a completely different problem: the stored data itself. Released by Google Research in March 2026, TurboQuant compresses KV cache values from 16-bit floating point to 3 bits per value using a data-oblivious quantization strategy that requires zero calibration constants and zero retraining.

The system combines two peer-reviewed techniques: PolarQuant (ICLR 2026), which converts KV vectors from Cartesian to polar coordinates for calibration-free compression, and QJL (AISTATS 2026), a 1-bit Quantized Johnson-Lindenstrauss transform that corrects quantization errors in attention scores.

TurboQuant MetricValue
KV cache memory reduction
6x (from 16-bit FP to 3-bit)
Attention computation speedup
Up to 8x
Accuracy loss
Virtually zero (statistically identical attention scores)
Calibration constants required
Zero (data-oblivious approach)
Retraining required
No, drop-in compression layer
Compression ratio vs theoretical max
99.4% of theoretical 5.33x limit (16/3)
Conference publications
PolarQuant (ICLR 2026), QJL (AISTATS 2026)
TurboQuant performance summary from Google Research, March 2026

The critical distinction: FlashAttention 3 makes the math faster. TurboQuant makes the data smaller. On a 70B model with a 100K-token context, the KV cache occupies 40+ GB of VRAM in FP16. TurboQuant shrinks that to ~6.7 GB, freeing VRAM for either longer contexts, more concurrent users, or both.

Paged KV Cache | Solving the 60-80% Memory Waste Problem

The third piece of the stack is Paged KV Cache, popularized by the vLLM project and its PagedAttention paper. It solves neither compute speed (FlashAttention) nor data size (TurboQuant), but rather memory allocation efficiency.

In traditional LLM serving, the KV cache for each request is allocated as a single contiguous block of GPU memory. The problem: you must allocate for the maximum possible sequence length at request start, even if the actual generation ends up much shorter. Studies show this contiguous allocation wastes 60-80% of KV cache memory on average due to internal fragmentation and over-reservation.

Allocation StrategyMemory Waste
Contiguous allocation (traditional)
60-80% average waste due to pre-reservation for max sequence length
Paged allocation (vLLM PagedAttention)
Under 5% waste via fixed-size block allocation on demand
Improvement
Up to 5.18x throughput gain from better memory utilization
Memory waste comparison: contiguous vs paged KV cache allocation

Paged KV Cache borrows the concept of virtual memory paging from operating systems. Instead of one contiguous slab, the KV cache is stored in fixed-size pages (blocks). Pages are allocated on demand as the sequence grows and freed immediately when the request completes. This means multiple requests can share the same physical memory pool without pre-allocating for worst-case lengths.

PagedAttention FeatureDetail
Block size
Configurable (typically 16-256 tokens per block)
Allocation
On-demand as tokens are generated, not pre-reserved
Deallocation
Immediate on request completion, no fragmentation residue
Sharing
Multiple requests share physical memory pool via block table indirection
Copy-on-write
Shared prefixes (system prompts) stored once, referenced by multiple requests
Implementation
vLLM (primary), also adopted by TensorRT-LLM, SGLang, and others
Overhead
Minimal per-block bookkeeping, negligible compute cost
PagedAttention mechanism details from the vLLM project

How They Stack | The Three-Layer Optimization Model

These three techniques are not alternatives. They are complementary layers that each address a different bottleneck in the LLM inference pipeline. A production-grade serving stack in 2026 can, and should, use all three simultaneously.

LayerWhat It Optimizes
Layer 1: Compute (FlashAttention 3)
Makes the attention dot-product and softmax faster by overlapping compute with memory transfers, achieving ~75% GPU FLOPS utilization
Layer 2: Data size (TurboQuant)
Compresses the KV cache values from 16-bit to 3-bit, reducing memory footprint 6x and attention bandwidth 8x
Layer 3: Memory allocation (Paged KV Cache)
Organizes compressed KV cache into dynamically-allocated pages, eliminating 60-80% fragmentation waste
Combined effect
Faster math (FA-3) on smaller data (TurboQuant) stored more efficiently (Paged KV Cache). Each layer multiplies the gains of the others.
The three-layer LLM inference optimization stack

The stacking math: On a 70B model with 100K-token context, before optimization the KV cache alone consumes ~40 GB of VRAM with ~60-80% allocation waste. After stacking all three: FlashAttention 3 runs the attention ~2x faster, TurboQuant compresses the cache to ~6.7 GB, and Paged KV Cache eliminates fragmentation waste down to <5%. The combined result is dramatically more concurrent users per GPU, longer sustainable context windows, and lower cost-per-token.

Where Each Technique Runs Best | Hardware and Framework Matrix

TechniqueHardware and Framework Requirements
FlashAttention 3
Hopper GPUs (H100, H200) required for full benefit. Falls back to FA-2 on Ampere (A100). Integrated in PyTorch, Hugging Face Transformers, vLLM, TensorRT-LLM.
TurboQuant
Hardware-agnostic (compression is mathematical, not silicon-dependent). Drop-in layer for any inference stack. Requires no retraining or calibration data.
Paged KV Cache
Hardware-agnostic. Implemented in vLLM (primary), TensorRT-LLM, SGLang, and DeepSpeed-FastGen. Works on any CUDA GPU.
Hardware and software requirements for each optimization technique

FlashAttention 3 is the only technique with a hard hardware dependency: its Hopper-specific optimizations (warp specialization, TMA instructions, FP8 tensor cores) do not exist on older GPUs. TurboQuant and Paged KV Cache are pure software optimizations that run on any CUDA-capable GPU, making them immediately deployable on existing infrastructure.

What Each Technique Does NOT Do | Common Misconceptions

Misunderstanding the scope of each technique leads to bad architecture decisions. Here is what each one explicitly does not address.

TechniqueWhat It Does NOT Do
FlashAttention 3
Does not reduce KV cache memory footprint. Does not compress stored values. Does not solve memory fragmentation. Only speeds up the attention computation step.
TurboQuant
Does not speed up attention compute directly (though smaller data means less bandwidth). Does not solve memory allocation fragmentation. Does not change the attention algorithm.
Paged KV Cache
Does not compress data. Does not speed up attention compute. Does not reduce the total amount of data stored, only how efficiently it is allocated in GPU memory.
Scope limitations of each technique, these are common sources of confusion

Timeline | How We Got Here

May 2022

FlashAttention 1 released (Tri Dao et al.)

First IO-aware attention algorithm. 2-4x speedup over standard attention on Ampere GPUs. Foundational paper.

July 2023

FlashAttention-2 released

2x speedup over FA-1. Better parallelism and work partitioning. Becomes default in PyTorch and Hugging Face.

September 2023

vLLM and PagedAttention published (UC Berkeley)

Introduces paged memory management for KV cache. Memory waste drops from 60-80% to under 5%. Throughput gains up to 5.18x.

January 2024

vLLM adoption explodes

vLLM becomes the default LLM serving framework. PagedAttention adopted by TensorRT-LLM, SGLang, and others.

July 2024

FlashAttention 3 released

Hopper-native optimization. 1.5-2x over FA-2 on H100. FP8 support. ~75% theoretical FLOPS utilization.

March 2026

Google Research releases TurboQuant

PolarQuant (ICLR 2026) + QJL (AISTATS 2026) combined. 3-bit KV cache compression with zero accuracy loss and zero calibration overhead.

Competing and Adjacent Techniques | The Broader Landscape

FlashAttention, TurboQuant, and Paged KV Cache are not the only optimization techniques in production. Several other approaches address overlapping or adjacent bottlenecks.

TechniqueWhat It Does and How It Relates
Speculative Decoding
Uses a small draft model to propose tokens verified by the large model. Speeds up generation throughput. Complementary to all three.
GQA / MQA (Grouped/Multi-Query Attention)
Reduces KV heads at the architecture level. Requires model redesign and retraining. Complementary to FA-3 and Paged KV Cache.
KV Cache Eviction (H2O, Scissorhands)
Drops low-attention tokens to reduce cache size. Trades accuracy for memory. Competes with TurboQuant as a compression alternative.
Continuous Batching
Serves multiple requests simultaneously with dynamic batch composition. Complementary to all three. Standard in vLLM.
Tensor Parallelism
Splits model across multiple GPUs. Orthogonal to all three, addresses model-too-large-for-one-GPU scenarios.
Ring Attention
Distributes attention computation across devices for ultra-long contexts. Complementary to FA-3, addresses different scale.
KIVI (Meta AI)
Per-channel INT2 KV quantization. Competes with TurboQuant, more aggressive compression, higher accuracy risk.
Adjacent LLM optimization techniques and how they relate to the core three

Practical Deployment | What a Production Stack Looks Like in 2026

For teams deploying LLMs at scale in 2026, the standard optimized inference stack combines all three techniques plus several adjacent optimizations. Here is the reference architecture.

Stack LayerComponent and Role
Hardware
Nvidia H100/H200 (Hopper) or B300 (Blackwell) for FA-3 support
Serving framework
vLLM or TensorRT-LLM (Paged KV Cache built in)
Attention kernel
FlashAttention 3 (automatic in vLLM on Hopper GPUs)
KV cache compression
TurboQuant (drop-in layer, no retraining)
Batching
Continuous batching (standard in vLLM)
Generation
Speculative decoding with draft model (optional, ~2x decode speed)
Model precision
FP8 weights + 3-bit KV cache (TurboQuant)
Monitoring
Tokens/sec, TTFT, time-per-output-token, VRAM utilization
Reference LLM inference stack for production deployment in 2026

This stack runs on hardware from Nvidia, AMD, or Google (with FA-3 specific to Nvidia Hopper/Blackwell). For teams using the Nvidia-Groq LPX inference platform , the LPU architecture handles decode differently, and the stacking model may shift as Groq's deterministic execution model changes the bottleneck profile.

💬
“The question is not which optimization to use. The question is which layer of your stack is still unoptimized. FlashAttention 3, TurboQuant, and Paged KV Cache each remove a different ceiling, and the one you are hitting right now is the one you should deploy next.”

📰 Related Stories

More from Google

View all

Discussion

Comments post live to the ObjectWire Discord server.
Join server →

Every comment appears live in our Discord server.

Join to see the full conversation and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works.

J

Written by

Jack Wang

AI & Technology