Hacker News

Llama 405B 506 tokens/second on an H200

by moondistanceon 10/14/2024, 1:21:49 AM with 3 comments

by EgoIncarnateon 10/14/2024, 2:58:50 AM
not "an H200", "In the table above, tensor parallelism is compared to pipeline parallelism with each across eight GPUs"
by 7eon 10/14/2024, 2:49:47 AM
And this is why nobody submits MLPerf against NVIDIA.
by moondistanceon 10/14/2024, 1:21:49 AM
Significant further optimizations. FP8!