At a Glance: Nvidia's DGX B300 delivers 144 PFLOPS FP4 and 2.1 TB total GPU memory in an 8-GPU system at ~14 kW. AMD's MI300X offers 192 GB HBM3 per GPU with 5.3 TB/s bandwidth, the highest single-chip memory in the group. Google's TPU v6 Trillium improves peak compute by 4.7x over TPU v5e, doubles HBM and interconnect bandwidth, and is over 67% more energy-efficient. Each chip wins a different race, and no single vendor-published benchmark covers all three.
What Each Chip Is Best At | The One-Table Summary
Before diving into specifications, here is the fundamental positioning of each chip. These are not interchangeable products. Each was designed for a different primary workload, and the “best” choice depends entirely on what you are building.
Full Specification Comparison | B300 vs MI300X vs TPU v6
The following table compiles every published specification from Nvidia, AMD, and Google. Where vendors report different metrics (system-level vs chip-level, FP4 vs FP8 vs BF16), we note the difference. This is the core reference table.
Memory Capacity | Why 192 GB Per GPU Matters for LLM Inference
AMD's MI300X stands out with 192 GB of HBM3 per GPU, the highest single-chip memory in this comparison. For LLM inference, memory capacity determines the largest model you can host on a single accelerator without splitting it across multiple chips via tensor parallelism.
A 70-billion parameter model in FP16 requires approximately 140 GB of VRAM just for the model weights. Add the KV cache for a 100K-token context window (as detailed in our TurboQuant KV cache analysis ), and a single MI300X can host the entire model plus a substantial context window on one GPU, no tensor parallelism needed. Neither the B300 nor the TPU v6 publishes a per-chip memory figure that matches 192 GB (though the B300 system total of 2.1 TB across 8 GPUs is higher in aggregate).
This is why MI300X has found traction with inference-heavy deployments: a single GPU can host a 70B model with room to spare, while Nvidia's H100 (80 GB) and even H200 (141 GB) require multi-GPU setups for the same workload.
Raw Compute | 144 PFLOPS FP4 and What It Actually Means
Nvidia's headline figure of 144 PFLOPS FP4 for the DGX B300 is the most aggressive throughput claim in this comparison. However, context matters: FP4 (4-bit floating point) is a low-precision format primarily useful for inference workloads where quantized models can tolerate reduced precision. Training typically runs at BF16 or FP8, where the raw PFLOPS number would be significantly lower.
When Nvidia quotes 144 PFLOPS FP4 and AMD quotes MI300X compute in FP8 or BF16, they are measuring different things. A direct comparison requires normalizing to the same precision, which neither vendor does in their marketing materials. This is the single biggest caveat in any chip-vs-chip comparison.
The precision problem: Nvidia's 144 PFLOPS FP4 headline would be roughly 72 PFLOPS at FP8 and ~36 PFLOPS at BF16 if precision scaling were linear (it is not exactly linear in practice, but the order of magnitude holds). AMD's MI300X at ~2.6 PFLOPS FP8 per GPU would be ~20.8 PFLOPS FP8 in an 8-GPU system. The gap is real, but not 7x, it is closer to 3-4x at equivalent precision.
Energy Efficiency | Google's 67% Advantage and What It Costs
Google positions TPU v6 Trillium primarily on efficiency rather than raw performance. The claim of 67%+ energy efficiency improvement over TPU v5e is significant because power cost is increasingly the dominant line item in AI infrastructure budgets, often exceeding the cost of the hardware itself over a 3-year deployment.
The critical difference: Nvidia and AMD sell hardware you deploy in your own datacenter (or a colo). Google's TPU v6 is only available on Google Cloud, where the efficiency gains are baked into Google's per-hour pricing rather than your electricity bill. This makes direct cost comparison between TPU v6 and GPU-based systems fundamentally different, you are comparing capex + opex (GPU) versus pure opex (TPU cloud).
Interconnect and Scaling | NVLink 5 vs Infinity Fabric vs ICI
For multi-chip training at scale, the interconnect between GPUs/TPUs often matters more than per-chip compute. A fast chip connected by a slow link becomes a fast chip that spends most of its time waiting for data.
Google's advantage here is architectural: TPU pods scale to 9,216 chips in a single interconnected fabric without requiring external InfiniBand switches. Nvidia and AMD systems require expensive InfiniBand or Ethernet networking hardware to scale beyond a single 8-GPU node, adding cost, complexity, and latency. For organizations building 10,000+ chip clusters, this difference is substantial.
Software Stack | CUDA Dominance vs ROCm Momentum vs JAX Lock-in
Hardware specifications only matter if the software can use them. The software ecosystem is often the deciding factor in chip selection, especially for teams with existing codebases and trained engineers.
The practical reality: most AI teams have CUDA expertise and CUDA-optimized code. Switching to ROCm or JAX/TPU carries real migration costs. This is why Nvidia maintains >80% market share despite AMD and Google offering competitive hardware, CUDA's moat is the ecosystem, not the silicon.
Practical Decision Framework | Which Chip for Which Workload
Based on the published specifications and positioning from all three vendors, here is how to choose.
The Benchmark Problem | Why No Fair Comparison Exists
A critical caveat underpins this entire comparison: there is no single third-party benchmark that tests all three chips on the same workload under the same conditions.
The best available proxy is MLPerf, the industry's closest thing to a standardized AI benchmark. Nvidia dominates MLPerf submissions. AMD has submitted MI300X results that show competitive inference performance on Llama 2 workloads. Google submits TPU results but often in categories or configurations that do not directly overlap with GPU submissions.
If you want a truly rigorous comparison, the best next step is to benchmark all three on one workload, such as MLPerf inference on Llama 3 70B, measuring tokens per second per dollar. No vendor publishes this figure, which tells you everything about how competitive the market actually is.
What Is Coming Next | B300 Ultra, MI350X, TPU v7
Related Coverage Across ObjectWire
This comparison connects to a broader set of AI infrastructure reporting across ObjectWire. The following articles provide deeper context on specific aspects of the chip war.