I feel like an idiot. MS featured my Azure startup today, quoting me about overall stability etc (which has been the case for us, until today). They then proceeded to go down, taking all our production systems with them.
(yes we do have AWS, too)
Sigh.
The idea of cloud storage being down is less of an issue - I don't like it, but I understand it. What bothers me about this is:
1. I was never notified of the outage. I noticed it myself when attempting to log into one of my VMs and then started looking for status updates. Sadly, the best status updates I got were here on Hacker News.
2. When my servers did come back up, at least one of my IP addresses had changed, which meant I had to update all of the relevant DNS entries (which, as everyone here no doubt knows, can take up to 48 hours to propagate). I was never notified of this change in any way.
So, the worst part about this is that zero communication has come out of Microsoft - we first started seeing issues on Sunday and filed a ticked, had an open ticket while this larger outage happened, and haven't gotten a single email saying there's an outage. I found out about it from, sigh, buzzfeed.
Question - are AWS or GCE better at proactively messaging when there's an outage?
I run a site that monitors cloud service availability. Based on VMs and Blob storage containers I maintain and monitor, the outage affected every US Azure region with 1-2 hours of downtime: https://cloudharmony.com/status-for-azure
That status table with all the randomly located green checks is painful to look at... I guess a green check in the 'Global' column implies a green check in all location specific columns? But what about all the rows which have no Global green check, but most columns are still empty? Are those regions where the service is not deployed? Can we gray out those boxes or something if they are 'N/A'?
Also, funny if you try to zoom out in Chrome to see the whole thing, the row headers get out of alignment.
Why would I want to 'X' out specific rows/columns in the table? It was so complicated to begin with, someone thought adding more complication through end-user customization was a good idea? I just noticed, you can even expand some of the rows...
Seriously, a status page should tell you either "It's up" or "What's down". It's not even showing history over time, this is just a snapshot. The text at the top directly contradicts the icons in the table, making the whole thing even more ridiculous.
The footnote at the bottom is the best, "The Australia Regions are available only to customers with billing addresses in Australia and New Zealand." Thanks for that useful nugget! /s
The most damaging part to me is that "All good! Everything is running great." message on the status page.
Mistakes happen, services go down, I can get over that. What matters is how its dealt with. At the moment I would not want to be an Azure customer dealing with 9 hours+ downtime whilst MS are saying everything is great. At the very least change it to "Having some issues" or similar!
The postmortem for this should make for a good read. How does storage go down in eleven regions at once?
Our sites have been down for more than 3 hours now.
EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.
It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)
So much for the idea of 99.999% uptime with the magical "cloud" buzzword. I noticed during this downtime in North America that Word Online wasn't functioning as my daughter tried to use it to do some homework.
You should try Google's App Engine (paid premium account) tech support when your critical files disappear. Can't be any worse than this ... That's the problem with these hosted cloud solutions, your systems are at the mercy of the bad tech support. Try explaining that to your own customers ...
Actual link to status page: http://azure.microsoft.com/en-us/status/#current
(not that convenient to copy paste the OP link from a mobile device)
Microsoft are refusing to help us with our downed servers because we don't have a support contract. The outage is their issue not ours!!
As more and more services and apps depend on 'the cloud', I'm wondering, how many of them would survive a major cloud outage: the cloud company going bankrupt, stock market crash or economic meltdown, a malware exploiting a major server-side bug (like heartbleed or shellshock, but worse) wiping or encrypting the data on the infrastructure/user machines.
How much of the user's data would be forever lost in such an event ?
The other aspect is privacy - in theory, all user's data can be stored and accessed forever, eg. 20 years from now, when the reincarnation of someone like Stalin comes to power.
Anyway, the point I'm trying to make is that we should design our services or apps with this in mind - the cloud can and will fail from time to time, maybe forever. So, if possible, use the cloud as a 'bonus' feature, a means to back up data and store user's data offline for when the dark day comes at least the user still has his data.
Reality call: ANY and ALL Cloud services, be it Google, Azure, AWS etc, will be down for hours at some point every few years.
Regretting the decision to go with Azure. Talk about terrible timing. We have media outlets interested in our site, we send info and the site is dead. Talk about a crap first impression.
Our VMs and websites on USEast are unreachable, however our storage seems to be working fine. There is something very backwards with how they are communicating this outage.
This may be greater then just west Europe. I personally have servers in US East that are unreachable, and there are a few reports of others in US region reporting partial unavailability for the US based servers.
I wonder how many customers Azure just lost do to their unexpected 2 day fiasco
We've been noticing ups and downs for the last few hours of our VM powering an important database in West Europe.
Seriously considering another layer above azure to mitigate this in the future. Very disappointing to see.
At least initially their status indicated they're handing the problem but lately it's just been "All Good" and they said they resolved it on twitter but it's not at 100% yet: http://azure.microsoft.com/en-us/status/
Oh no, did we break the status page too? Sorry Azure team, really didn't mean to pile on!
Yup! Azure websites and Storage are down in multiple regions.
Storage, Websites and Visual Studio Online - Multiple Regions - Partial Service Interruption 5 mins agoStarting at 19 Nov 2014 00:52 UTC we are experiencing a connectivity issue to Azure Services including Storage, Websites and Visual Studio Online. The next update will be provided in 60 minutes.
Well, their status page is telling lies.
Storage is the source of the outage, and most of the services rely on it, so they are all impacted.
Still down even 2 hours later, regardless of the status page saying its OK.
Judging by how cloud services "frequently" go down when everything is normal, it makes me wonder what would happen in case of a real problem (volcano eruption, social unrest, nuclear disaster, alien invasion ...). I still don't get the cloud infatuation, and no you don't have to get off my lawn, I'm "only" 36 (yeah I know, in IT I'm already a dinosaur).
My VMs are down. This much be something major.
Seems to be back up now, my site (https://ian.sh) was down for a while.
There really isn't anything I can do either. My VM isn't back up yet. I'd go to sleep and just expect it to be online in the morning (when it really matters), but I'm afraid a drive won't reattach or something like that. Meanwhile, twiddling thumbs...hit F5...twiddle thumbs...)
The page cannot be displayed because an internal server error has occurred.
Their error pages are less graceful than mine.
come on! give them some slack.. they probably aren't very experienced at managing their linux servers! ;)
This is why i have server class refurbished machine handy as working backup so that you can restore if ther service is not restored with in few minutes. Or have another copy of vm/db in other provider like rackspace or something
Do you run multi-region or maybe multi-provider setups? How do you migrate your instances from failed regions to healthy ones? How do you route users to the healthy regions? DNS? Do you think anycast could be an alternative?
My website, my webapplication for member management + my clients are down :s, i really don't like this...
Didn't receive any calls yet, but i don't think that will take long.
This is nice. My web site server IP was changed when the server came back up. So now I have to update all of the site DNS settings.
Disgusting Virtual service
Disgusting management interface
Abysmal support
Way to fuck up a mustard sandwich Microsoftie
We moved everything we had away from that Virus named Azure.
My VM is still down (US East). Is anyone else still experiencing issues?
"Everything is running great"
i hope so lol
So did anyone receive a call "The cloud is down"? Or at least an e-mail?
Every time this happens, ask yourself... Are you outage-proof? Do you have a rational reason to believe that internally-managed infrastructure would never have a problem like this?
I'm guessing the reason that this site is down I was trying to load ... http://www.dotnetrocks.com/
Maybe they unknowingly upgraded to Intel's latest SSDs in their storage array. https://news.ycombinator.com/item?id=8626928
This outage exposes the clowns that actually chose Azure as their cloud provider. If you use AMZN and it goes down, at least you're in good company, with the likes of Netflix, Twitter, Instagram, and so on. It's like yeah, I'm big like they are. So what, it went down, so is Netflix.
What does your client/customer think of you being on Azure? That you chose the crappy solution because your low-tech infrastructure still uses windows, which does not carry a lot of tech cred.
Azure support is probably the worst one I had ever deal with. When my account (and service itself) stopped working, I haven't received any email. When I tried to sign in, all I got was some generic error saying "There's something wrong with your account". My services of course were down and I couldn't do ANYTHING. I've contacted the support to learn that my account has been blocked (!) because there was some suspicious (!!) activity going on. What the... No, they couldn't tell me what exactly it was. I've exchanged emails back and forth with the support for several days to learn nothing new, my account and services were still disabled and I was more than pissed off. From that day I hate Azure and I advise anyone against using it, because such situation is absolutely unacceptable.