Why it pays to allow failure to occur in your infrastructure – The AWS S3 failure


On the last day of Feb in 2017, there was a failure that occurred in the AWS platform. It was not an outage, per-se, but it’s effect was seen as such. The terminology was:

AWS services and customer applications depending on S3 will continue to experience high error rates.

S3 services in the US-east-1 region were not working properly, but were not offline. This affected a large number of sites and services such as:

  • Adobe’s services
  • Amazon’s Twitch
  • Atlassian’s Bitbucket and HipChat
  • Airbnb
  • Autodesk Live and Cloud Rendering
  • Buffer
  • Business Insider
  • Carto
  • Chef
  • Citrix
  • Clarifai
  • Codecademy
  • Coindesk
  • Convo
  • Coursera
  • Cracked
  • Docker
  • Elastic
  • Expedia
  • Expensify
  • FanDuel
  • FiftyThree
  • Heroku
  • Kickstarter
  • Slack
  • Pinterest
  • Time Inc
  • the U.S. Securities and Exchange Commission (SEC)
  • PagerDuty

and ironically, isitdownrightnow.com

This started up a dialog of how to avoid failures of AWS primitives by designing for failure and setting up load balancing, multiple regions and zones, etc.

During this discourse, came the aha moment. Wouldn’t it be easier to just make people more accepting of downtime and outages, than it would be to build a highly resilient infrastructure? If you convince people that outages are a given fact and that they should just wait it out, then you don’t have to make it as reliable. You just have to ensure there is no data loss and that outages are not too frequent.

So what is too frequent? How do you know how low of a bar you have to set? The secret is in user experience analysis and design (UX design). What is the maximum amount of inconvenience that a user can tolerate before they choose an alternative. When looking at website responsiveness, you have have three key thresholds.

  • 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
  • 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
  • 10 seconds is about the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable, since users will then not know what to expect.

In this instance, it is not the design that will impact responsiveness, but rather some other factors, such as the core infrastructure availability. The maximum amount of inconvenience that a user is willing to experience is called the UX threshold. If a failure happens, how does it impact the user? for how long? and how often?

If you map out the UX threshold and what it takes from an infrastructure perspective, to maintain that, then you will have the minimum build requirements for an environment that has the lowest possible positive user retention stats.

Is keeping the lights on good enough? Or do you need to spend more to grow an indestructible, highly available infrastructure? Or is it, like most environments, somewhere in between? I don’t believe that lowering availability standards and convincing people that it is normal, is a good money saving solution. It just inherently feels deceitful and shifty. But the truth is that companies are starting to do this as a trend. Especially companies that offer free services. Design decisions are made based on financial factors just as much as technological ones.

A key factor in the normalization of failure is how many companies are affected. If the failure affects many companies, then it is easy to throw your hands up and say “Nothing we can do. See all the other companies affected? This is out of our control”. It is blame transference. They say “It’s AWS, not us!”.

Naivety is not a defense for downtime. If something goes down because of a core infrastructure failure, it is because it is by design. A decision was made to prioritize lowering costs over availability.

The news organizations and users will not remember a specific company outage as much as they will remember that it was caused by AWS, or whatever platform it was on. In an era where blame transference can be used to save money, moving to the cloud can remove the requirement for designing high availability, that would normally be seen as standard in traditional infrastructures.

Humpy Dumpty sat on a wall. Humpty Dumpty had a great regional failure. All the kings horses and all the kings support staff, couldn’t get Humpty back online to meet SLAs.