Show HN: How to guide on training Llama-405B using PyTorch distributed APIs

  • The guide does not say how efficient this run was in terms of GPU utilization (tops/theoretical max tops).

  • Let me know if there are any questions or suggestions!

    Feel free to open issue on github, and contributions are welcome also