Do any Heroku users have a way of making the platform more robust? Simple redundancy wouldn't have helped in this recent case. I suppose having apps in mutiple regions might have done the trick, but then you have to maintain a hot spare and manually spin it up when needed. Also, our business has customers exclusively in the EU so a US server wouldn't make sense. And there are GDPR issues with having servers in the US.
Anyhow.. I'm just looking for some advice and really some to tell my manager about how we plan to ensure this doesn't happen again in the future. But I don't think that is possible.
1.5 hours of complete downtime for our apps now.
We're getting "App boot timeout" errors for every request and "One or more of these arguments were missing: uid, gid, gateway, somaxconn, event_fd, out_fd" for scheduled tasks. The incident report has been up for almost four hours and it's not getting better: https://status.heroku.com/incidents/2590