Llama 405B 506 tokens/second on an H200

  • not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"

  • And this is why nobody submits MLPerf against NVIDIA.

  • Significant further optimizations. FP8!