OBJECTWIRE

Independent · Verified · In-Depth

Tech

Nvidia B300 vs AMD MI300X vs Google TPU v6 | 2026 AI Chip Specs, Workloads, Cost Comparison

DGX B300 leads on raw PFLOPS, MI300X leads on single-GPU VRAM, TPU v6 Trillium leads on energy efficiency, and none of them are directly comparable without caveats

April 1, 2026📖 11 min read

Full Specification Comparison | B300 vs MI300X vs TPU v6

The following table compiles every published specification from Nvidia, AMD, and Google. Where vendors report different metrics (system-level vs chip-level, FP4 vs FP8 vs BF16), we note the difference. This is the core reference table.

SpecificationNvidia DGX B300 (8x Blackwell Ultra SXM)
Peak compute (FP4)
144 PFLOPS (system, 8 GPUs)
Peak compute per GPU (FP4)
~18 PFLOPS
GPU memory per chip
~262 GB HBM3e (estimated, 2.1 TB / 8)
Total system memory
2.1 TB HBM3e
Memory bandwidth per GPU
Not disclosed (HBM3e class, likely ~8 TB/s)
Interconnect
NVLink 5, 1.8 TB/s bisection bandwidth
System power
~14 kW (full DGX B300 node)
Process node
TSMC 4NP
Availability
H2 2026 (announced GTC March 2026)
Software stack
CUDA, cuDNN, TensorRT, NeMo, Triton
Nvidia DGX B300 specifications from GTC 2026 announcements
SpecificationAMD Instinct MI300X
Peak compute (FP8)
~2.6 PFLOPS per GPU
Peak compute (BF16)
~1.3 PFLOPS per GPU
GPU memory per chip
192 GB HBM3
Memory bandwidth
5.3 TB/s per GPU
Interconnect
Infinity Fabric, 896 GB/s GPU-to-GPU
TDP
750W per GPU
Process node
TSMC 5nm (compute) + 6nm (I/O)
Availability
Shipping since Q4 2023
Software stack
ROCm, PyTorch (native), JAX, vLLM
AMD Instinct MI300X specifications from AMD published data
SpecificationGoogle TPU v6 / Trillium
Peak compute improvement
4.7x over TPU v5e per chip
HBM capacity
2x TPU v5e (exact GB not publicly disclosed)
Interconnect bandwidth
2x ICI bandwidth vs TPU v5e
Energy efficiency
67%+ more efficient than TPU v5e
Pod scale
Up to 9,216 chips per TPU v6 pod (Trillium)
Process node
Not publicly disclosed
Availability
Google Cloud (preview H2 2025, GA 2026)
Software stack
JAX, TensorFlow, PyTorch/XLA, Pathways
Google TPU v6 Trillium specifications from Google Cloud announcements

Memory Capacity | Why 192 GB Per GPU Matters for LLM Inference

AMD's MI300X stands out with 192 GB of HBM3 per GPU, the highest single-chip memory in this comparison. For LLM inference, memory capacity determines the largest model you can host on a single accelerator without splitting it across multiple chips via tensor parallelism.

A 70-billion parameter model in FP16 requires approximately 140 GB of VRAM just for the model weights. Add the KV cache for a 100K-token context window (as detailed in our TurboQuant KV cache analysis ), and a single MI300X can host the entire model plus a substantial context window on one GPU, no tensor parallelism needed. Neither the B300 nor the TPU v6 publishes a per-chip memory figure that matches 192 GB (though the B300 system total of 2.1 TB across 8 GPUs is higher in aggregate).

Model SizeVRAM Required (FP16 Weights Only)
7B parameters (Llama 3 7B)
~14 GB
13B parameters
~26 GB
34B parameters (Code Llama 34B)
~68 GB
70B parameters (Llama 3 70B)
~140 GB
180B parameters (Falcon 180B)
~360 GB
405B parameters (Llama 3.1 405B)
~810 GB
Approximate VRAM requirements for FP16 model weights, excluding KV cache and activation memory

This is why MI300X has found traction with inference-heavy deployments: a single GPU can host a 70B model with room to spare, while Nvidia's H100 (80 GB) and even H200 (141 GB) require multi-GPU setups for the same workload.

Raw Compute | 144 PFLOPS FP4 and What It Actually Means

Nvidia's headline figure of 144 PFLOPS FP4 for the DGX B300 is the most aggressive throughput claim in this comparison. However, context matters: FP4 (4-bit floating point) is a low-precision format primarily useful for inference workloads where quantized models can tolerate reduced precision. Training typically runs at BF16 or FP8, where the raw PFLOPS number would be significantly lower.

Precision FormatTypical Use Case
FP4 (4-bit)
Quantized inference, post-training compression
FP8 (8-bit)
Mixed-precision training and inference, emerging standard
BF16 (16-bit brain float)
Standard training precision, widely supported
FP16 (16-bit)
Training and inference, legacy standard
FP32 (32-bit)
Scientific computing, loss calculation, optimizer states
TF32 (TensorFloat-32)
Nvidia-specific training acceleration, 19-bit internal
Floating-point precision formats used in AI workloads

When Nvidia quotes 144 PFLOPS FP4 and AMD quotes MI300X compute in FP8 or BF16, they are measuring different things. A direct comparison requires normalizing to the same precision, which neither vendor does in their marketing materials. This is the single biggest caveat in any chip-vs-chip comparison.

📊

The critical difference: Nvidia and AMD sell hardware you deploy in your own datacenter (or a colo). Google's TPU v6 is only available on Google Cloud, where the efficiency gains are baked into Google's per-hour pricing rather than your electricity bill. This makes direct cost comparison between TPU v6 and GPU-based systems fundamentally different, you are comparing capex + opex (GPU) versus pure opex (TPU cloud).

Interconnect and Scaling | NVLink 5 vs Infinity Fabric vs ICI

For multi-chip training at scale, the interconnect between GPUs/TPUs often matters more than per-chip compute. A fast chip connected by a slow link becomes a fast chip that spends most of its time waiting for data.

InterconnectBandwidth and Scale
Nvidia NVLink 5 (B300)
1.8 TB/s bisection bandwidth, 8 GPUs per node, scales via NVSwitch + InfiniBand/Spectrum-X across nodes
AMD Infinity Fabric (MI300X)
896 GB/s GPU-to-GPU, 8 GPUs per node via OAM, scales via InfiniBand or RoCE across nodes
Google ICI (TPU v6)
2x ICI bandwidth vs v5e (absolute figure undisclosed), native pod scaling to 9,216 chips without external networking
Multi-chip interconnect comparison

Google's advantage here is architectural: TPU pods scale to 9,216 chips in a single interconnected fabric without requiring external InfiniBand switches. Nvidia and AMD systems require expensive InfiniBand or Ethernet networking hardware to scale beyond a single 8-GPU node, adding cost, complexity, and latency. For organizations building 10,000+ chip clusters, this difference is substantial.

Software Stack | CUDA Dominance vs ROCm Momentum vs JAX Lock-in

Hardware specifications only matter if the software can use them. The software ecosystem is often the deciding factor in chip selection, especially for teams with existing codebases and trained engineers.

PlatformSoftware Ecosystem Assessment
Nvidia CUDA
Dominant ecosystem. PyTorch, TensorFlow, JAX, TensorRT, Triton, NeMo, vLLM all CUDA-first. Largest library of optimized kernels. Deepest talent pool.
AMD ROCm
Rapidly improving. PyTorch native support since 2023. vLLM, DeepSpeed, and Hugging Face TGI now support ROCm. Smaller kernel library. Growing but smaller talent pool.
Google JAX/XLA
Best-in-class for TPU. JAX + Pathways for large-scale training. PyTorch/XLA bridge exists but adds friction. Locked to Google Cloud. Smallest external talent pool.
Software ecosystem comparison for AI workloads

The practical reality: most AI teams have CUDA expertise and CUDA-optimized code. Switching to ROCm or JAX/TPU carries real migration costs. This is why Nvidia maintains >80% market share despite AMD and Google offering competitive hardware, CUDA's moat is the ecosystem, not the silicon.

Practical Decision Framework | Which Chip for Which Workload

Based on the published specifications and positioning from all three vendors, here is how to choose.

ScenarioRecommended Chip and Why
Building a 1,000+ GPU AI factory for frontier model training
Nvidia B300. Highest aggregate PFLOPS, NVLink 5 scaling, CUDA ecosystem, superpod reference architectures.
Hosting a 70B model for production inference on fewest GPUs
AMD MI300X. 192 GB per GPU means single-chip hosting with KV cache headroom. ROCm + vLLM stack is production-ready.
Cloud-native training with managed infrastructure
Google TPU v6. Best efficiency, pod-scale interconnect, JAX/Pathways optimization, no hardware procurement.
Mixed training + inference on same hardware
Nvidia B300. Most flexible across precision formats (FP4-FP32), largest software compatibility.
Budget-constrained inference at scale
AMD MI300X. Lower per-GPU cost than B300, competitive inference throughput, strong ROCm/vLLM support.
Research with Google-published models (Gemini, PaLM)
Google TPU v6. TPU-native model checkpoints, zero porting friction, Google Cloud research credits.
Decision framework based on workload requirements

The Benchmark Problem | Why No Fair Comparison Exists

A critical caveat underpins this entire comparison: there is no single third-party benchmark that tests all three chips on the same workload under the same conditions.

ChallengeWhy It Prevents Fair Comparison
Different precision metrics
Nvidia leads with FP4, AMD reports FP8/BF16, Google reports relative improvement vs TPU v5e
System-level vs chip-level
Nvidia quotes DGX B300 (8 GPUs), AMD quotes per-GPU, Google quotes per-pod or per-chip improvement ratios
MLPerf participation varies
Nvidia submits aggressively to MLPerf, AMD submits selectively, Google submits for TPU but not all categories
Closed vs open systems
TPU v6 only runs on Google Cloud, making independent benchmarking impossible
Workload selection bias
Each vendor benchmarks on workloads that favor their architecture
Software optimization differences
CUDA kernels are more mature than ROCm kernels for the same operations, making hardware comparison unfair
Structural barriers to fair AI chip benchmarking

The best available proxy is MLPerf, the industry's closest thing to a standardized AI benchmark. Nvidia dominates MLPerf submissions. AMD has submitted MI300X results that show competitive inference performance on Llama 2 workloads. Google submits TPU results but often in categories or configurations that do not directly overlap with GPU submissions.

Related Coverage Across ObjectWire

This comparison connects to a broader set of AI infrastructure reporting across ObjectWire. The following articles provide deeper context on specific aspects of the chip war.

📰 Related Stories

Tags

#Nvidia#AMD#Google#Blackwell B300#MI300X#TPU v6#Trillium#AI Chips#Benchmark#DGX B300#HBM3#CUDA#ROCm#MLPerf

Tags

#Nvidia#AMD#Google#Blackwell B300#MI300X#TPU v6#AI Chips#Benchmark

Discussion

Sign in to join the conversation

Your comments appear live in our Discord server, every post grows the community.

Every comment appears live in our Discord server.

Join to see the full conversation, get notified on new articles, and connect with the community.

Join ObjectWire Discord

Comments sync to our ObjectWire Discord · Nvidia B300 vs AMD MI300X vs Google TPU v6 | 2026 AI Chip Specs, Workloads, Cost Comparison.

O

Written by

ObjectWire Technology Desk

AI Infrastructure

Part ofObjectWirecoverage
📩 Newsletter

Stay ahead of every story

Breaking news, deep-dives, and editor picks, delivered straight to your inbox. No spam, ever.

Free · Unsubscribe anytime · No ads