Full Specification Comparison | B300 vs MI300X vs TPU v6
The following table compiles every published specification from Nvidia, AMD, and Google. Where vendors report different metrics (system-level vs chip-level, FP4 vs FP8 vs BF16), we note the difference. This is the core reference table.
| Specification | Nvidia DGX B300 (8x Blackwell Ultra SXM) |
|---|---|
Peak compute (FP4) | 144 PFLOPS (system, 8 GPUs) |
Peak compute per GPU (FP4) | ~18 PFLOPS |
GPU memory per chip | ~262 GB HBM3e (estimated, 2.1 TB / 8) |
Total system memory | 2.1 TB HBM3e |
Memory bandwidth per GPU | Not disclosed (HBM3e class, likely ~8 TB/s) |
Interconnect | NVLink 5, 1.8 TB/s bisection bandwidth |
System power | ~14 kW (full DGX B300 node) |
Process node | TSMC 4NP |
Availability | H2 2026 (announced GTC March 2026) |
Software stack | CUDA, cuDNN, TensorRT, NeMo, Triton |
| Specification | AMD Instinct MI300X |
|---|---|
Peak compute (FP8) | ~2.6 PFLOPS per GPU |
Peak compute (BF16) | ~1.3 PFLOPS per GPU |
GPU memory per chip | 192 GB HBM3 |
Memory bandwidth | 5.3 TB/s per GPU |
Interconnect | Infinity Fabric, 896 GB/s GPU-to-GPU |
TDP | 750W per GPU |
Process node | TSMC 5nm (compute) + 6nm (I/O) |
Availability | Shipping since Q4 2023 |
Software stack | ROCm, PyTorch (native), JAX, vLLM |
| Specification | Google TPU v6 / Trillium |
|---|---|
Peak compute improvement | 4.7x over TPU v5e per chip |
HBM capacity | 2x TPU v5e (exact GB not publicly disclosed) |
Interconnect bandwidth | 2x ICI bandwidth vs TPU v5e |
Energy efficiency | 67%+ more efficient than TPU v5e |
Pod scale | Up to 9,216 chips per TPU v6 pod (Trillium) |
Process node | Not publicly disclosed |
Availability | Google Cloud (preview H2 2025, GA 2026) |
Software stack | JAX, TensorFlow, PyTorch/XLA, Pathways |
Memory Capacity | Why 192 GB Per GPU Matters for LLM Inference
AMD's MI300X stands out with 192 GB of HBM3 per GPU, the highest single-chip memory in this comparison. For LLM inference, memory capacity determines the largest model you can host on a single accelerator without splitting it across multiple chips via tensor parallelism.
A 70-billion parameter model in FP16 requires approximately 140 GB of VRAM just for the model weights. Add the KV cache for a 100K-token context window (as detailed in our TurboQuant KV cache analysis ), and a single MI300X can host the entire model plus a substantial context window on one GPU, no tensor parallelism needed. Neither the B300 nor the TPU v6 publishes a per-chip memory figure that matches 192 GB (though the B300 system total of 2.1 TB across 8 GPUs is higher in aggregate).
| Model Size | VRAM Required (FP16 Weights Only) |
|---|---|
7B parameters (Llama 3 7B) | ~14 GB |
13B parameters | ~26 GB |
34B parameters (Code Llama 34B) | ~68 GB |
70B parameters (Llama 3 70B) | ~140 GB |
180B parameters (Falcon 180B) | ~360 GB |
405B parameters (Llama 3.1 405B) | ~810 GB |
This is why MI300X has found traction with inference-heavy deployments: a single GPU can host a 70B model with room to spare, while Nvidia's H100 (80 GB) and even H200 (141 GB) require multi-GPU setups for the same workload.
Raw Compute | 144 PFLOPS FP4 and What It Actually Means
Nvidia's headline figure of 144 PFLOPS FP4 for the DGX B300 is the most aggressive throughput claim in this comparison. However, context matters: FP4 (4-bit floating point) is a low-precision format primarily useful for inference workloads where quantized models can tolerate reduced precision. Training typically runs at BF16 or FP8, where the raw PFLOPS number would be significantly lower.
| Precision Format | Typical Use Case |
|---|---|
FP4 (4-bit) | Quantized inference, post-training compression |
FP8 (8-bit) | Mixed-precision training and inference, emerging standard |
BF16 (16-bit brain float) | Standard training precision, widely supported |
FP16 (16-bit) | Training and inference, legacy standard |
FP32 (32-bit) | Scientific computing, loss calculation, optimizer states |
TF32 (TensorFloat-32) | Nvidia-specific training acceleration, 19-bit internal |
When Nvidia quotes 144 PFLOPS FP4 and AMD quotes MI300X compute in FP8 or BF16, they are measuring different things. A direct comparison requires normalizing to the same precision, which neither vendor does in their marketing materials. This is the single biggest caveat in any chip-vs-chip comparison.
The critical difference: Nvidia and AMD sell hardware you deploy in your own datacenter (or a colo). Google's TPU v6 is only available on Google Cloud, where the efficiency gains are baked into Google's per-hour pricing rather than your electricity bill. This makes direct cost comparison between TPU v6 and GPU-based systems fundamentally different, you are comparing capex + opex (GPU) versus pure opex (TPU cloud).
Interconnect and Scaling | NVLink 5 vs Infinity Fabric vs ICI
For multi-chip training at scale, the interconnect between GPUs/TPUs often matters more than per-chip compute. A fast chip connected by a slow link becomes a fast chip that spends most of its time waiting for data.
| Interconnect | Bandwidth and Scale |
|---|---|
Nvidia NVLink 5 (B300) | 1.8 TB/s bisection bandwidth, 8 GPUs per node, scales via NVSwitch + InfiniBand/Spectrum-X across nodes |
AMD Infinity Fabric (MI300X) | 896 GB/s GPU-to-GPU, 8 GPUs per node via OAM, scales via InfiniBand or RoCE across nodes |
Google ICI (TPU v6) | 2x ICI bandwidth vs v5e (absolute figure undisclosed), native pod scaling to 9,216 chips without external networking |
Google's advantage here is architectural: TPU pods scale to 9,216 chips in a single interconnected fabric without requiring external InfiniBand switches. Nvidia and AMD systems require expensive InfiniBand or Ethernet networking hardware to scale beyond a single 8-GPU node, adding cost, complexity, and latency. For organizations building 10,000+ chip clusters, this difference is substantial.
Software Stack | CUDA Dominance vs ROCm Momentum vs JAX Lock-in
Hardware specifications only matter if the software can use them. The software ecosystem is often the deciding factor in chip selection, especially for teams with existing codebases and trained engineers.
| Platform | Software Ecosystem Assessment |
|---|---|
Nvidia CUDA | Dominant ecosystem. PyTorch, TensorFlow, JAX, TensorRT, Triton, NeMo, vLLM all CUDA-first. Largest library of optimized kernels. Deepest talent pool. |
AMD ROCm | Rapidly improving. PyTorch native support since 2023. vLLM, DeepSpeed, and Hugging Face TGI now support ROCm. Smaller kernel library. Growing but smaller talent pool. |
Google JAX/XLA | Best-in-class for TPU. JAX + Pathways for large-scale training. PyTorch/XLA bridge exists but adds friction. Locked to Google Cloud. Smallest external talent pool. |
The practical reality: most AI teams have CUDA expertise and CUDA-optimized code. Switching to ROCm or JAX/TPU carries real migration costs. This is why Nvidia maintains >80% market share despite AMD and Google offering competitive hardware, CUDA's moat is the ecosystem, not the silicon.
Practical Decision Framework | Which Chip for Which Workload
Based on the published specifications and positioning from all three vendors, here is how to choose.
| Scenario | Recommended Chip and Why |
|---|---|
Building a 1,000+ GPU AI factory for frontier model training | Nvidia B300. Highest aggregate PFLOPS, NVLink 5 scaling, CUDA ecosystem, superpod reference architectures. |
Hosting a 70B model for production inference on fewest GPUs | AMD MI300X. 192 GB per GPU means single-chip hosting with KV cache headroom. ROCm + vLLM stack is production-ready. |
Cloud-native training with managed infrastructure | Google TPU v6. Best efficiency, pod-scale interconnect, JAX/Pathways optimization, no hardware procurement. |
Mixed training + inference on same hardware | Nvidia B300. Most flexible across precision formats (FP4-FP32), largest software compatibility. |
Budget-constrained inference at scale | AMD MI300X. Lower per-GPU cost than B300, competitive inference throughput, strong ROCm/vLLM support. |
Research with Google-published models (Gemini, PaLM) | Google TPU v6. TPU-native model checkpoints, zero porting friction, Google Cloud research credits. |
The Benchmark Problem | Why No Fair Comparison Exists
A critical caveat underpins this entire comparison: there is no single third-party benchmark that tests all three chips on the same workload under the same conditions.
| Challenge | Why It Prevents Fair Comparison |
|---|---|
Different precision metrics | Nvidia leads with FP4, AMD reports FP8/BF16, Google reports relative improvement vs TPU v5e |
System-level vs chip-level | Nvidia quotes DGX B300 (8 GPUs), AMD quotes per-GPU, Google quotes per-pod or per-chip improvement ratios |
MLPerf participation varies | Nvidia submits aggressively to MLPerf, AMD submits selectively, Google submits for TPU but not all categories |
Closed vs open systems | TPU v6 only runs on Google Cloud, making independent benchmarking impossible |
Workload selection bias | Each vendor benchmarks on workloads that favor their architecture |
Software optimization differences | CUDA kernels are more mature than ROCm kernels for the same operations, making hardware comparison unfair |
The best available proxy is MLPerf, the industry's closest thing to a standardized AI benchmark. Nvidia dominates MLPerf submissions. AMD has submitted MI300X results that show competitive inference performance on Llama 2 workloads. Google submits TPU results but often in categories or configurations that do not directly overlap with GPU submissions.
Related Coverage Across ObjectWire
This comparison connects to a broader set of AI infrastructure reporting across ObjectWire. The following articles provide deeper context on specific aspects of the chip war.
📰 Related Stories
Nvidia | News, Coverage, and Analysis Hub
April 2026Nvidia Blackwell B300 | Data Center Demand Surge 2026
March 2026Nvidia Groq $20B Deal | LPX Inference Platform at GTC 2026
March 2026TurboQuant KV Cache Compression | 6x Memory, 8x Speed
March 2026Nvidia $4B Photonics | Lumentum, Coherent, AI Bottleneck
March 2026Tags
Tags
Discussion
Sign in to join the conversation
Your comments appear live in our Discord server, every post grows the community.
Every comment appears live in our Discord server.
Join to see the full conversation, get notified on new articles, and connect with the community.
Comments sync to our ObjectWire Discord · Nvidia B300 vs AMD MI300X vs Google TPU v6 | 2026 AI Chip Specs, Workloads, Cost Comparison.