February went out with a whimper for many companies that utilize Amazon Web Services’ S3 cloud storage thanks to a wide-scale network outage caused by an incorrect command entered by an S3 team member. The four-hour incident impacted news organizations, enterprise chat apps like Slack, and many more. AWS owns 40% of the cloud services market, which means when they have a bad day, many others go down with them.
In the summary of the S3 service disruption, Amazon stated:
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.
As is often the case, the most unreliable part of the system was the human involvement. “Fat finger” errors are costly. When it comes to network outages, an estimated 60% of unplanned downtime is attributed to human error during device configuration.
With Uplogix, you can reduce unplanned downtime by providing a built-in safety net with the SurgicalRollback™ feature. It allows you to quickly recover and minimize the impact of failed configuration changes.