FlashAttention 3 vs TurboQuant vs Paged KV Cache | LLM Stack

⚡

The mental model: FlashAttention 3 speeds up the attention compute itself (1.5-2x over FlashAttention-2 on H100). TurboQuant compresses KV cache values to 3 bits, cutting memory 6x. Paged KV Cache organizes that memory so long-context serving does not waste space, reducing fragmentation from 60-80% waste to under 5%. They solve different bottlenecks. In production, you stack all three.

What Each Technique Actually Does | The One-Table Summary

Before explaining how these techniques work individually, here is the fundamental distinction. Each targets a different layer of the inference pipeline. Confusing them, or treating them as competitors, misses the point entirely.

Technique	Bottleneck It Fixes
FlashAttention 3	Attention compute speed and GPU utilization. Reorders and overlaps attention work to reduce memory traffic and maximize throughput.
TurboQuant	KV cache storage size. Compresses values from 16-bit to 3-bit using data-oblivious quantization, eliminating calibration overhead.
Paged KV Cache	KV cache memory fragmentation. Stores cache in fixed-size pages instead of one contiguous slab, enabling dynamic allocation.

Each technique targets a distinct layer of the LLM inference stack

FlashAttention 3 | Making Attention Math 2x Faster on Hopper GPUs

FlashAttention 3, developed by Tri Dao and collaborators, is the third generation of the IO-aware attention algorithm that fundamentally changed how transformers compute attention. The core insight: standard attention implementations are bottlenecked not by compute but by memory bandwidth, the cost of moving data between GPU SRAM and HBM.

FlashAttention 3 targets Hopper-class GPUs (H100, H200) with architectural features unavailable on earlier generations. The published results report 1.5-2.0x speedup over FlashAttention-2 on H100 with FP16/BF16, plus native FP8 support that unlocks substantially higher effective throughput by halving the precision while maintaining output quality through careful accumulation strategies.

Feature	FlashAttention 2 vs FlashAttention 3
Target GPU architecture	Ampere (A100) and Hopper vs Hopper-native (H100/H200 specific optimizations)
Speedup over FA-2	Baseline vs 1.5-2.0x on H100 (FP16/BF16)
FP8 support	No vs Yes, with mixed-precision accumulation
Asynchrony	Limited vs Full warp-specialization with overlapped compute and memory
Block-sparse support	Basic vs Improved, enabling structured sparsity patterns
FLOPS utilization on H100	~35-40% of theoretical peak vs ~75% of theoretical peak
Kernel implementation	CUDA + Cutlass vs CUDA + Cutlass + Hopper-specific TMA instructions

FlashAttention 2 vs FlashAttention 3 feature comparison

The key architectural innovation in FA-3 is warp specialization: different warps (groups of 32 GPU threads) are assigned different roles, with some warps performing computation while others handle memory transfers simultaneously. On Hopper GPUs with their Tensor Memory Accelerator (TMA), this overlap is hardware-supported, allowing FA-3 to approach 75% of the GPU's theoretical peak FLOPS, up from roughly 35-40% with FA-2.

📊

75% FLOPS utilization on H100, up from ~35-40% with FlashAttention-2. That is not a marginal gain, it means the same GPU is doing nearly twice as much useful work per second on attention computation.

TurboQuant | Compressing the KV Cache to 3 Bits with Zero Accuracy Loss

While FlashAttention 3 makes the attention math faster, TurboQuant attacks a completely different problem: the stored data itself. Released by Google Research in March 2026, TurboQuant compresses KV cache values from 16-bit floating point to 3 bits per value using a data-oblivious quantization strategy that requires zero calibration constants and zero retraining.

The system combines two peer-reviewed techniques: PolarQuant (ICLR 2026), which converts KV vectors from Cartesian to polar coordinates for calibration-free compression, and QJL (AISTATS 2026), a 1-bit Quantized Johnson-Lindenstrauss transform that corrects quantization errors in attention scores.

TurboQuant Metric	Value
KV cache memory reduction	6x (from 16-bit FP to 3-bit)
Attention computation speedup	Up to 8x
Accuracy loss	Virtually zero (statistically identical attention scores)
Calibration constants required	Zero (data-oblivious approach)
Retraining required	No, drop-in compression layer
Compression ratio vs theoretical max	99.4% of theoretical 5.33x limit (16/3)
Conference publications	PolarQuant (ICLR 2026), QJL (AISTATS 2026)

TurboQuant performance summary from Google Research, March 2026

The critical distinction: FlashAttention 3 makes the math faster. TurboQuant makes the data smaller. On a 70B model with a 100K-token context, the KV cache occupies 40+ GB of VRAM in FP16. TurboQuant shrinks that to ~6.7 GB, freeing VRAM for either longer contexts, more concurrent users, or both.

Paged KV Cache | Solving the 60-80% Memory Waste Problem

The third piece of the stack is Paged KV Cache, popularized by the vLLM project and its PagedAttention paper. It solves neither compute speed (FlashAttention) nor data size (TurboQuant), but rather memory allocation efficiency.

In traditional LLM serving, the KV cache for each request is allocated as a single contiguous block of GPU memory. The problem: you must allocate for the maximum possible sequence length at request start, even if the actual generation ends up much shorter. Studies show this contiguous allocation wastes 60-80% of KV cache memory on average due to internal fragmentation and over-reservation.

Allocation Strategy	Memory Waste
Contiguous allocation (traditional)	60-80% average waste due to pre-reservation for max sequence length
Paged allocation (vLLM PagedAttention)	Under 5% waste via fixed-size block allocation on demand
Improvement	Up to 5.18x throughput gain from better memory utilization

Memory waste comparison: contiguous vs paged KV cache allocation

Paged KV Cache borrows the concept of virtual memory paging from operating systems. Instead of one contiguous slab, the KV cache is stored in fixed-size pages (blocks). Pages are allocated on demand as the sequence grows and freed immediately when the request completes. This means multiple requests can share the same physical memory pool without pre-allocating for worst-case lengths.

PagedAttention Feature	Detail
Block size	Configurable (typically 16-256 tokens per block)
Allocation	On-demand as tokens are generated, not pre-reserved
Deallocation	Immediate on request completion, no fragmentation residue
Sharing	Multiple requests share physical memory pool via block table indirection
Copy-on-write	Shared prefixes (system prompts) stored once, referenced by multiple requests
Implementation	vLLM (primary), also adopted by TensorRT-LLM, SGLang, and others
Overhead	Minimal per-block bookkeeping, negligible compute cost

PagedAttention mechanism details from the vLLM project

How They Stack | The Three-Layer Optimization Model

These three techniques are not alternatives. They are complementary layers that each address a different bottleneck in the LLM inference pipeline. A production-grade serving stack in 2026 can, and should, use all three simultaneously.

Layer	What It Optimizes
Layer 1: Compute (FlashAttention 3)	Makes the attention dot-product and softmax faster by overlapping compute with memory transfers, achieving ~75% GPU FLOPS utilization
Layer 2: Data size (TurboQuant)	Compresses the KV cache values from 16-bit to 3-bit, reducing memory footprint 6x and attention bandwidth 8x
Layer 3: Memory allocation (Paged KV Cache)	Organizes compressed KV cache into dynamically-allocated pages, eliminating 60-80% fragmentation waste
Combined effect	Faster math (FA-3) on smaller data (TurboQuant) stored more efficiently (Paged KV Cache). Each layer multiplies the gains of the others.

The three-layer LLM inference optimization stack

⚡

The stacking math: On a 70B model with 100K-token context, before optimization the KV cache alone consumes ~40 GB of VRAM with ~60-80% allocation waste. After stacking all three: FlashAttention 3 runs the attention ~2x faster, TurboQuant compresses the cache to ~6.7 GB, and Paged KV Cache eliminates fragmentation waste down to <5%. The combined result is dramatically more concurrent users per GPU, longer sustainable context windows, and lower cost-per-token.

Where Each Technique Runs Best | Hardware and Framework Matrix

Technique	Hardware and Framework Requirements
FlashAttention 3	Hopper GPUs (H100, H200) required for full benefit. Falls back to FA-2 on Ampere (A100). Integrated in PyTorch, Hugging Face Transformers, vLLM, TensorRT-LLM.
TurboQuant	Hardware-agnostic (compression is mathematical, not silicon-dependent). Drop-in layer for any inference stack. Requires no retraining or calibration data.
Paged KV Cache	Hardware-agnostic. Implemented in vLLM (primary), TensorRT-LLM, SGLang, and DeepSpeed-FastGen. Works on any CUDA GPU.

Hardware and software requirements for each optimization technique

FlashAttention 3 is the only technique with a hard hardware dependency: its Hopper-specific optimizations (warp specialization, TMA instructions, FP8 tensor cores) do not exist on older GPUs. TurboQuant and Paged KV Cache are pure software optimizations that run on any CUDA-capable GPU, making them immediately deployable on existing infrastructure.

What Each Technique Does NOT Do | Common Misconceptions

Misunderstanding the scope of each technique leads to bad architecture decisions. Here is what each one explicitly does not address.

Technique	What It Does NOT Do
FlashAttention 3	Does not reduce KV cache memory footprint. Does not compress stored values. Does not solve memory fragmentation. Only speeds up the attention computation step.
TurboQuant	Does not speed up attention compute directly (though smaller data means less bandwidth). Does not solve memory allocation fragmentation. Does not change the attention algorithm.
Paged KV Cache	Does not compress data. Does not speed up attention compute. Does not reduce the total amount of data stored, only how efficiently it is allocated in GPU memory.

Scope limitations of each technique, these are common sources of confusion

Timeline | How We Got Here

May 2022

FlashAttention 1 released (Tri Dao et al.)

First IO-aware attention algorithm. 2-4x speedup over standard attention on Ampere GPUs. Foundational paper.

July 2023

FlashAttention-2 released

2x speedup over FA-1. Better parallelism and work partitioning. Becomes default in PyTorch and Hugging Face.

September 2023

vLLM and PagedAttention published (UC Berkeley)

Introduces paged memory management for KV cache. Memory waste drops from 60-80% to under 5%. Throughput gains up to 5.18x.

January 2024

vLLM adoption explodes

vLLM becomes the default LLM serving framework. PagedAttention adopted by TensorRT-LLM, SGLang, and others.

July 2024

FlashAttention 3 released

Hopper-native optimization. 1.5-2x over FA-2 on H100. FP8 support. ~75% theoretical FLOPS utilization.

March 2026

Google Research releases TurboQuant

PolarQuant (ICLR 2026) + QJL (AISTATS 2026) combined. 3-bit KV cache compression with zero accuracy loss and zero calibration overhead.

Competing and Adjacent Techniques | The Broader Landscape

FlashAttention, TurboQuant, and Paged KV Cache are not the only optimization techniques in production. Several other approaches address overlapping or adjacent bottlenecks.

Technique	What It Does and How It Relates
Speculative Decoding	Uses a small draft model to propose tokens verified by the large model. Speeds up generation throughput. Complementary to all three.
GQA / MQA (Grouped/Multi-Query Attention)	Reduces KV heads at the architecture level. Requires model redesign and retraining. Complementary to FA-3 and Paged KV Cache.
KV Cache Eviction (H2O, Scissorhands)	Drops low-attention tokens to reduce cache size. Trades accuracy for memory. Competes with TurboQuant as a compression alternative.
Continuous Batching	Serves multiple requests simultaneously with dynamic batch composition. Complementary to all three. Standard in vLLM.
Tensor Parallelism	Splits model across multiple GPUs. Orthogonal to all three, addresses model-too-large-for-one-GPU scenarios.
Ring Attention	Distributes attention computation across devices for ultra-long contexts. Complementary to FA-3, addresses different scale.
KIVI (Meta AI)	Per-channel INT2 KV quantization. Competes with TurboQuant, more aggressive compression, higher accuracy risk.

Adjacent LLM optimization techniques and how they relate to the core three

Practical Deployment | What a Production Stack Looks Like in 2026

For teams deploying LLMs at scale in 2026, the standard optimized inference stack combines all three techniques plus several adjacent optimizations. Here is the reference architecture.

Stack Layer	Component and Role
Hardware	Nvidia H100/H200 (Hopper) or B300 (Blackwell) for FA-3 support
Serving framework	vLLM or TensorRT-LLM (Paged KV Cache built in)
Attention kernel	FlashAttention 3 (automatic in vLLM on Hopper GPUs)
KV cache compression	TurboQuant (drop-in layer, no retraining)
Batching	Continuous batching (standard in vLLM)
Generation	Speculative decoding with draft model (optional, ~2x decode speed)
Model precision	FP8 weights + 3-bit KV cache (TurboQuant)
Monitoring	Tokens/sec, TTFT, time-per-output-token, VRAM utilization

Reference LLM inference stack for production deployment in 2026

This stack runs on hardware from Nvidia, AMD, or Google (with FA-3 specific to Nvidia Hopper/Blackwell). For teams using the Nvidia-Groq LPX inference platform , the LPU architecture handles decode differently, and the stacking model may shift as Groq's deterministic execution model changes the bottleneck profile.

💬

“The question is not which optimization to use. The question is which layer of your stack is still unoptimized. FlashAttention 3, TurboQuant, and Paged KV Cache each remove a different ceiling, and the one you are hitting right now is the one you should deploy next.”

📰 Related Stories

Tech

Discussion

Comments post live to the ObjectWire Discord server.

Join server →

Every comment appears live in our Discord server.

Join to see the full conversation and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works.

FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works

What Each Technique Actually Does | The One-Table Summary

FlashAttention 3 | Making Attention Math 2x Faster on Hopper GPUs

TurboQuant | Compressing the KV Cache to 3 Bits with Zero Accuracy Loss

Paged KV Cache | Solving the 60-80% Memory Waste Problem

How They Stack | The Three-Layer Optimization Model

Where Each Technique Runs Best | Hardware and Framework Matrix

What Each Technique Does NOT Do | Common Misconceptions

Timeline | How We Got Here

FlashAttention 1 released (Tri Dao et al.)

FlashAttention-2 released

vLLM and PagedAttention published (UC Berkeley)

vLLM adoption explodes

FlashAttention 3 released

Google Research releases TurboQuant

Competing and Adjacent Techniques | The Broader Landscape

Practical Deployment | What a Production Stack Looks Like in 2026

📰 Related Stories

TurboQuant KV Cache Compression | 6x Memory, 8x Speed

B300 vs MI300X vs TPU v6 | 2026 AI Chip Comparison

Nvidia Groq $20B Deal | LPX Inference Platform at GTC 2026

Nvidia Blackwell B300 | Data Center Demand Surge 2026

Google | News, Research, and Coverage Hub

More from Google

Google Leads $40 Billion Investment in Anthropic at $350 Billion Valuation | Cloud Giants Race for AI Compute Anchor

DeepMind Uses Claude Code | Steve Yegge vs Demis Hassabis 2026

Gucci Smart Glasses | Kering and Google Partner for 2027 Launch

TurboQuant KV Cache Compression | 6x Less Memory, 8x Faster Attention for LLMs

Gemini Embedding 2: Our First Natively Multimodal Embedding Model

Google Warns of Iran-Linked Cyber Attacks Targeting Global Infrastructure Amid Ongoing Conflict

Discussion