February went out with a whimper for many companies that utilize Amazon Web Services’ S3 cloud storage thanks to a wide-scale network outage caused by an incorrect command entered by an S3 team member. The four-hour incident impacted news organizations, enterprise chat apps like Slack, and many more. AWS owns 40% of the cloud services market, which means when they have a bad day, many others go down with them.

In the summary of the S3 service disruption, Amazon stated:

The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

As is often the case, the most unreliable part of the system was the human involvement. “Fat finger” errors are costly. When it comes to network outages, an estimated 60% of unplanned downtime is attributed to human error during device configuration.

With Uplogix, you can reduce unplanned downtime by providing a built-in safety net with the SurgicalRollback™ feature. It allows you to quickly recover and minimize the impact of failed configuration changes.

Surgical Rollback™ combines fine grained configuration differencing with a unique “production confirmation” based approach to changes. Any change made is followed by a prompt for confirmation by the technician initiating the change. If no confirmation is received (e.g. if the change brought down the network and the technician’s access with it), the change is precisely and automatically rolled back, avoiding a network outage.
Another benefit of getting humans out of the system is security. People are the cause of most security breaches. They skip steps tying to save time and get distracted and leave tasks undone or done incompletely introducing vulnerabilities into the network. Using a machine to manage some of the basic network management tasks means that jobs are going to happen the same way every time. Exactly like the run book says to do it.
For Amazon, it sounds like they had an established playbook, and an authorized person, but in the end it came down to PEBCAK (problem exists between keyboard and chair).