OBJECTWIRE

Independent · Verified · In-Depth

Tech

FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works

FlashAttention 3 speeds up attention compute, TurboQuant compresses KV cache storage, Paged KV Cache eliminates memory fragmentation, and the real answer is you use all three

April 1, 2026📖 10 min read

FlashAttention 3 | Making Attention Math 2x Faster on Hopper GPUs

FlashAttention 3, developed by Tri Dao and collaborators, is the third generation of the IO-aware attention algorithm that fundamentally changed how transformers compute attention. The core insight: standard attention implementations are bottlenecked not by compute but by memory bandwidth, the cost of moving data between GPU SRAM and HBM.

FlashAttention 3 targets Hopper-class GPUs (H100, H200) with architectural features unavailable on earlier generations. The published results report 1.5-2.0x speedup over FlashAttention-2 on H100 with FP16/BF16, plus native FP8 support that unlocks substantially higher effective throughput by halving the precision while maintaining output quality through careful accumulation strategies.

FeatureFlashAttention 2 vs FlashAttention 3
Target GPU architecture
Ampere (A100) and Hopper vs Hopper-native (H100/H200 specific optimizations)
Speedup over FA-2
Baseline vs 1.5-2.0x on H100 (FP16/BF16)
FP8 support
No vs Yes, with mixed-precision accumulation
Asynchrony
Limited vs Full warp-specialization with overlapped compute and memory
Block-sparse support
Basic vs Improved, enabling structured sparsity patterns
FLOPS utilization on H100
~35-40% of theoretical peak vs ~75% of theoretical peak
Kernel implementation
CUDA + Cutlass vs CUDA + Cutlass + Hopper-specific TMA instructions
FlashAttention 2 vs FlashAttention 3 feature comparison

The key architectural innovation in FA-3 is warp specialization: different warps (groups of 32 GPU threads) are assigned different roles, with some warps performing computation while others handle memory transfers simultaneously. On Hopper GPUs with their Tensor Memory Accelerator (TMA), this overlap is hardware-supported, allowing FA-3 to approach 75% of the GPU's theoretical peak FLOPS, up from roughly 35-40% with FA-2.

📊

The critical distinction: FlashAttention 3 makes the math faster. TurboQuant makes the data smaller. On a 70B model with a 100K-token context, the KV cache occupies 40+ GB of VRAM in FP16. TurboQuant shrinks that to ~6.7 GB, freeing VRAM for either longer contexts, more concurrent users, or both.

Paged KV Cache | Solving the 60-80% Memory Waste Problem

The third piece of the stack is Paged KV Cache, popularized by the vLLM project and its PagedAttention paper. It solves neither compute speed (FlashAttention) nor data size (TurboQuant), but rather memory allocation efficiency.

In traditional LLM serving, the KV cache for each request is allocated as a single contiguous block of GPU memory. The problem: you must allocate for the maximum possible sequence length at request start, even if the actual generation ends up much shorter. Studies show this contiguous allocation wastes 60-80% of KV cache memory on average due to internal fragmentation and over-reservation.

Allocation StrategyMemory Waste
Contiguous allocation (traditional)
60-80% average waste due to pre-reservation for max sequence length
Paged allocation (vLLM PagedAttention)
Under 5% waste via fixed-size block allocation on demand
Improvement
Up to 5.18x throughput gain from better memory utilization
Memory waste comparison: contiguous vs paged KV cache allocation

Paged KV Cache borrows the concept of virtual memory paging from operating systems. Instead of one contiguous slab, the KV cache is stored in fixed-size pages (blocks). Pages are allocated on demand as the sequence grows and freed immediately when the request completes. This means multiple requests can share the same physical memory pool without pre-allocating for worst-case lengths.

PagedAttention FeatureDetail
Block size
Configurable (typically 16-256 tokens per block)
Allocation
On-demand as tokens are generated, not pre-reserved
Deallocation
Immediate on request completion, no fragmentation residue
Sharing
Multiple requests share physical memory pool via block table indirection
Copy-on-write
Shared prefixes (system prompts) stored once, referenced by multiple requests
Implementation
vLLM (primary), also adopted by TensorRT-LLM, SGLang, and others
Overhead
Minimal per-block bookkeeping, negligible compute cost
PagedAttention mechanism details from the vLLM project

How They Stack | The Three-Layer Optimization Model

These three techniques are not alternatives. They are complementary layers that each address a different bottleneck in the LLM inference pipeline. A production-grade serving stack in 2026 can, and should, use all three simultaneously.

LayerWhat It Optimizes
Layer 1: Compute (FlashAttention 3)
Makes the attention dot-product and softmax faster by overlapping compute with memory transfers, achieving ~75% GPU FLOPS utilization
Layer 2: Data size (TurboQuant)
Compresses the KV cache values from 16-bit to 3-bit, reducing memory footprint 6x and attention bandwidth 8x
Layer 3: Memory allocation (Paged KV Cache)
Organizes compressed KV cache into dynamically-allocated pages, eliminating 60-80% fragmentation waste
Combined effect
Faster math (FA-3) on smaller data (TurboQuant) stored more efficiently (Paged KV Cache). Each layer multiplies the gains of the others.
The three-layer LLM inference optimization stack

FlashAttention 3 is the only technique with a hard hardware dependency: its Hopper-specific optimizations (warp specialization, TMA instructions, FP8 tensor cores) do not exist on older GPUs. TurboQuant and Paged KV Cache are pure software optimizations that run on any CUDA-capable GPU, making them immediately deployable on existing infrastructure.

What Each Technique Does NOT Do | Common Misconceptions

Misunderstanding the scope of each technique leads to bad architecture decisions. Here is what each one explicitly does not address.

TechniqueWhat It Does NOT Do
FlashAttention 3
Does not reduce KV cache memory footprint. Does not compress stored values. Does not solve memory fragmentation. Only speeds up the attention computation step.
TurboQuant
Does not speed up attention compute directly (though smaller data means less bandwidth). Does not solve memory allocation fragmentation. Does not change the attention algorithm.
Paged KV Cache
Does not compress data. Does not speed up attention compute. Does not reduce the total amount of data stored, only how efficiently it is allocated in GPU memory.
Scope limitations of each technique, these are common sources of confusion

Timeline | How We Got Here

May 2022

FlashAttention 1 released (Tri Dao et al.)

First IO-aware attention algorithm. 2-4x speedup over standard attention on Ampere GPUs. Foundational paper.

July 2023

FlashAttention-2 released

2x speedup over FA-1. Better parallelism and work partitioning. Becomes default in PyTorch and Hugging Face.

September 2023

vLLM and PagedAttention published (UC Berkeley)

Introduces paged memory management for KV cache. Memory waste drops from 60-80% to under 5%. Throughput gains up to 5.18x.

January 2024

vLLM adoption explodes

vLLM becomes the default LLM serving framework. PagedAttention adopted by TensorRT-LLM, SGLang, and others.

July 2024

FlashAttention 3 released

Hopper-native optimization. 1.5-2x over FA-2 on H100. FP8 support. ~75% theoretical FLOPS utilization.

March 2026

Google Research releases TurboQuant

PolarQuant (ICLR 2026) + QJL (AISTATS 2026) combined. 3-bit KV cache compression with zero accuracy loss and zero calibration overhead.

Competing and Adjacent Techniques | The Broader Landscape

FlashAttention, TurboQuant, and Paged KV Cache are not the only optimization techniques in production. Several other approaches address overlapping or adjacent bottlenecks.

TechniqueWhat It Does and How It Relates
Speculative Decoding
Uses a small draft model to propose tokens verified by the large model. Speeds up generation throughput. Complementary to all three.
GQA / MQA (Grouped/Multi-Query Attention)
Reduces KV heads at the architecture level. Requires model redesign and retraining. Complementary to FA-3 and Paged KV Cache.
KV Cache Eviction (H2O, Scissorhands)
Drops low-attention tokens to reduce cache size. Trades accuracy for memory. Competes with TurboQuant as a compression alternative.
Continuous Batching
Serves multiple requests simultaneously with dynamic batch composition. Complementary to all three. Standard in vLLM.
Tensor Parallelism
Splits model across multiple GPUs. Orthogonal to all three, addresses model-too-large-for-one-GPU scenarios.
Ring Attention
Distributes attention computation across devices for ultra-long contexts. Complementary to FA-3, addresses different scale.
KIVI (Meta AI)
Per-channel INT2 KV quantization. Competes with TurboQuant, more aggressive compression, higher accuracy risk.
Adjacent LLM optimization techniques and how they relate to the core three

Practical Deployment | What a Production Stack Looks Like in 2026

For teams deploying LLMs at scale in 2026, the standard optimized inference stack combines all three techniques plus several adjacent optimizations. Here is the reference architecture.

Stack LayerComponent and Role
Hardware
Nvidia H100/H200 (Hopper) or B300 (Blackwell) for FA-3 support
Serving framework
vLLM or TensorRT-LLM (Paged KV Cache built in)
Attention kernel
FlashAttention 3 (automatic in vLLM on Hopper GPUs)
KV cache compression
TurboQuant (drop-in layer, no retraining)
Batching
Continuous batching (standard in vLLM)
Generation
Speculative decoding with draft model (optional, ~2x decode speed)
Model precision
FP8 weights + 3-bit KV cache (TurboQuant)
Monitoring
Tokens/sec, TTFT, time-per-output-token, VRAM utilization
Reference LLM inference stack for production deployment in 2026

This stack runs on hardware from Nvidia, AMD, or Google (with FA-3 specific to Nvidia Hopper/Blackwell). For teams using the Nvidia-Groq LPX inference platform , the LPU architecture handles decode differently, and the stacking model may shift as Groq's deterministic execution model changes the bottleneck profile.

💬

Tags

#FlashAttention 3#TurboQuant#Paged KV Cache#vLLM#PagedAttention#Tri Dao#Google Research#PolarQuant#QJL#LLM Inference#H100#Hopper GPU#CUDA

Tags

#FlashAttention 3#TurboQuant#Paged KV Cache#LLM#vLLM#Google Research

Discussion

Sign in to join the conversation

Your comments appear live in our Discord server, every post grows the community.

Every comment appears live in our Discord server.

Join to see the full conversation, get notified on new articles, and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works.

J

Written by

Jack Wang

AI & Technology

Part ofObjectWirecoverage
📩 Newsletter

Stay ahead of every story

Breaking news, deep-dives, and editor picks, delivered straight to your inbox. No spam, ever.

Free · Unsubscribe anytime · No ads