FlashAttention 3 | Making Attention Math 2x Faster on Hopper GPUs
FlashAttention 3, developed by Tri Dao and collaborators, is the third generation of the IO-aware attention algorithm that fundamentally changed how transformers compute attention. The core insight: standard attention implementations are bottlenecked not by compute but by memory bandwidth, the cost of moving data between GPU SRAM and HBM.
FlashAttention 3 targets Hopper-class GPUs (H100, H200) with architectural features unavailable on earlier generations. The published results report 1.5-2.0x speedup over FlashAttention-2 on H100 with FP16/BF16, plus native FP8 support that unlocks substantially higher effective throughput by halving the precision while maintaining output quality through careful accumulation strategies.
| Feature | FlashAttention 2 vs FlashAttention 3 |
|---|---|
Target GPU architecture | Ampere (A100) and Hopper vs Hopper-native (H100/H200 specific optimizations) |
Speedup over FA-2 | Baseline vs 1.5-2.0x on H100 (FP16/BF16) |
FP8 support | No vs Yes, with mixed-precision accumulation |
Asynchrony | Limited vs Full warp-specialization with overlapped compute and memory |
Block-sparse support | Basic vs Improved, enabling structured sparsity patterns |
FLOPS utilization on H100 | ~35-40% of theoretical peak vs ~75% of theoretical peak |
Kernel implementation | CUDA + Cutlass vs CUDA + Cutlass + Hopper-specific TMA instructions |
The key architectural innovation in FA-3 is warp specialization: different warps (groups of 32 GPU threads) are assigned different roles, with some warps performing computation while others handle memory transfers simultaneously. On Hopper GPUs with their Tensor Memory Accelerator (TMA), this overlap is hardware-supported, allowing FA-3 to approach 75% of the GPU's theoretical peak FLOPS, up from roughly 35-40% with FA-2.
The critical distinction: FlashAttention 3 makes the math faster. TurboQuant makes the data smaller. On a 70B model with a 100K-token context, the KV cache occupies 40+ GB of VRAM in FP16. TurboQuant shrinks that to ~6.7 GB, freeing VRAM for either longer contexts, more concurrent users, or both.
Paged KV Cache | Solving the 60-80% Memory Waste Problem
The third piece of the stack is Paged KV Cache, popularized by the vLLM project and its PagedAttention paper. It solves neither compute speed (FlashAttention) nor data size (TurboQuant), but rather memory allocation efficiency.
In traditional LLM serving, the KV cache for each request is allocated as a single contiguous block of GPU memory. The problem: you must allocate for the maximum possible sequence length at request start, even if the actual generation ends up much shorter. Studies show this contiguous allocation wastes 60-80% of KV cache memory on average due to internal fragmentation and over-reservation.
| Allocation Strategy | Memory Waste |
|---|---|
Contiguous allocation (traditional) | 60-80% average waste due to pre-reservation for max sequence length |
Paged allocation (vLLM PagedAttention) | Under 5% waste via fixed-size block allocation on demand |
Improvement | Up to 5.18x throughput gain from better memory utilization |
Paged KV Cache borrows the concept of virtual memory paging from operating systems. Instead of one contiguous slab, the KV cache is stored in fixed-size pages (blocks). Pages are allocated on demand as the sequence grows and freed immediately when the request completes. This means multiple requests can share the same physical memory pool without pre-allocating for worst-case lengths.
| PagedAttention Feature | Detail |
|---|---|
Block size | Configurable (typically 16-256 tokens per block) |
Allocation | On-demand as tokens are generated, not pre-reserved |
Deallocation | Immediate on request completion, no fragmentation residue |
Sharing | Multiple requests share physical memory pool via block table indirection |
Copy-on-write | Shared prefixes (system prompts) stored once, referenced by multiple requests |
Implementation | vLLM (primary), also adopted by TensorRT-LLM, SGLang, and others |
Overhead | Minimal per-block bookkeeping, negligible compute cost |
How They Stack | The Three-Layer Optimization Model
These three techniques are not alternatives. They are complementary layers that each address a different bottleneck in the LLM inference pipeline. A production-grade serving stack in 2026 can, and should, use all three simultaneously.
| Layer | What It Optimizes |
|---|---|
Layer 1: Compute (FlashAttention 3) | Makes the attention dot-product and softmax faster by overlapping compute with memory transfers, achieving ~75% GPU FLOPS utilization |
Layer 2: Data size (TurboQuant) | Compresses the KV cache values from 16-bit to 3-bit, reducing memory footprint 6x and attention bandwidth 8x |
Layer 3: Memory allocation (Paged KV Cache) | Organizes compressed KV cache into dynamically-allocated pages, eliminating 60-80% fragmentation waste |
Combined effect | Faster math (FA-3) on smaller data (TurboQuant) stored more efficiently (Paged KV Cache). Each layer multiplies the gains of the others. |
FlashAttention 3 is the only technique with a hard hardware dependency: its Hopper-specific optimizations (warp specialization, TMA instructions, FP8 tensor cores) do not exist on older GPUs. TurboQuant and Paged KV Cache are pure software optimizations that run on any CUDA-capable GPU, making them immediately deployable on existing infrastructure.
What Each Technique Does NOT Do | Common Misconceptions
Misunderstanding the scope of each technique leads to bad architecture decisions. Here is what each one explicitly does not address.
| Technique | What It Does NOT Do |
|---|---|
FlashAttention 3 | Does not reduce KV cache memory footprint. Does not compress stored values. Does not solve memory fragmentation. Only speeds up the attention computation step. |
TurboQuant | Does not speed up attention compute directly (though smaller data means less bandwidth). Does not solve memory allocation fragmentation. Does not change the attention algorithm. |
Paged KV Cache | Does not compress data. Does not speed up attention compute. Does not reduce the total amount of data stored, only how efficiently it is allocated in GPU memory. |
Timeline | How We Got Here
FlashAttention 1 released (Tri Dao et al.)
First IO-aware attention algorithm. 2-4x speedup over standard attention on Ampere GPUs. Foundational paper.
FlashAttention-2 released
2x speedup over FA-1. Better parallelism and work partitioning. Becomes default in PyTorch and Hugging Face.
vLLM and PagedAttention published (UC Berkeley)
Introduces paged memory management for KV cache. Memory waste drops from 60-80% to under 5%. Throughput gains up to 5.18x.
vLLM adoption explodes
vLLM becomes the default LLM serving framework. PagedAttention adopted by TensorRT-LLM, SGLang, and others.
FlashAttention 3 released
Hopper-native optimization. 1.5-2x over FA-2 on H100. FP8 support. ~75% theoretical FLOPS utilization.
Google Research releases TurboQuant
PolarQuant (ICLR 2026) + QJL (AISTATS 2026) combined. 3-bit KV cache compression with zero accuracy loss and zero calibration overhead.
Competing and Adjacent Techniques | The Broader Landscape
FlashAttention, TurboQuant, and Paged KV Cache are not the only optimization techniques in production. Several other approaches address overlapping or adjacent bottlenecks.
| Technique | What It Does and How It Relates |
|---|---|
Speculative Decoding | Uses a small draft model to propose tokens verified by the large model. Speeds up generation throughput. Complementary to all three. |
GQA / MQA (Grouped/Multi-Query Attention) | Reduces KV heads at the architecture level. Requires model redesign and retraining. Complementary to FA-3 and Paged KV Cache. |
KV Cache Eviction (H2O, Scissorhands) | Drops low-attention tokens to reduce cache size. Trades accuracy for memory. Competes with TurboQuant as a compression alternative. |
Continuous Batching | Serves multiple requests simultaneously with dynamic batch composition. Complementary to all three. Standard in vLLM. |
Tensor Parallelism | Splits model across multiple GPUs. Orthogonal to all three, addresses model-too-large-for-one-GPU scenarios. |
Ring Attention | Distributes attention computation across devices for ultra-long contexts. Complementary to FA-3, addresses different scale. |
KIVI (Meta AI) | Per-channel INT2 KV quantization. Competes with TurboQuant, more aggressive compression, higher accuracy risk. |
Practical Deployment | What a Production Stack Looks Like in 2026
For teams deploying LLMs at scale in 2026, the standard optimized inference stack combines all three techniques plus several adjacent optimizations. Here is the reference architecture.
| Stack Layer | Component and Role |
|---|---|
Hardware | Nvidia H100/H200 (Hopper) or B300 (Blackwell) for FA-3 support |
Serving framework | vLLM or TensorRT-LLM (Paged KV Cache built in) |
Attention kernel | FlashAttention 3 (automatic in vLLM on Hopper GPUs) |
KV cache compression | TurboQuant (drop-in layer, no retraining) |
Batching | Continuous batching (standard in vLLM) |
Generation | Speculative decoding with draft model (optional, ~2x decode speed) |
Model precision | FP8 weights + 3-bit KV cache (TurboQuant) |
Monitoring | Tokens/sec, TTFT, time-per-output-token, VRAM utilization |
This stack runs on hardware from Nvidia, AMD, or Google (with FA-3 specific to Nvidia Hopper/Blackwell). For teams using the Nvidia-Groq LPX inference platform , the LPU architecture handles decode differently, and the stacking model may shift as Groq's deterministic execution model changes the bottleneck profile.
Tags
Tags
Discussion
Sign in to join the conversation
Your comments appear live in our Discord server, every post grows the community.
Every comment appears live in our Discord server.
Join to see the full conversation, get notified on new articles, and connect with the community.
Comments sync to our ObjectWire Discord · FlashAttention 3 vs TurboQuant vs Paged KV Cache | How the LLM Optimization Stack Actually Works.