H100 vs GB200 NVL72 Training Benchmarks – Power, TCO, and Reliability Analysis, Software Improvement Over Time

https://semianalysis.com/2025/08/20/h100-vs-gb200-nvl72-training-benchmarks/(semianalysis.com)

Frontier model training necessitates a deep analysis of cost, efficiency, and reliability for AI hardware. This report benchmarks over 2,000 H100 GPUs, analyzing model flops utilization (MFU), total cost of ownership (TCO), and energy use per token for models like Llama3 and GPT-3. It then compares these findings to the newer GB200 NVL72 architecture, highlighting current reliability challenges and software immaturity that prevent large-scale training runs on the new platform. While the GB200 software ecosystem is expected to improve, the H100 and Google TPUs remain the primary systems for completing frontier-scale training today.

0 points•by will22•1 month ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?