This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.
This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.
What kind of failures are you typically concerned with here?
300 L40s? What's this, 1998?
Hey, nice to see this here!
I'm the primary author so happy to answer any questions you might have!