New thinking on incident response

new thinking on incident response includes quality of life metrics

Airbnb matches thousands of travelers with over 2 million rental property listings on their website daily. Reliability is key to the success of the company valued at over $30 billion. A recent interview with a site reliability engineer at Airbnb covered some key metrics for incident response in the high visibility, high uptime operation.

Beyond the traditional incident response metrics like mean-time-to-resolve and mean-time-to-acknowledge, Airbnb is looking at bigger picture, quality of life metrics like work/life balance. For example, for engineers on call, how often are they being alerted at 2 a.m. with issues? Are they able to live healthy, productive lives, or are they spending their “off” hours responding to issues? Airbnb sees this as an important metric when it comes to competing for talented and scarce workforce.

Uplogix has a stake in the incident response discussion. First, one of our favorite metrics that we impact with our out-of-band solution is mean-time-to-innocence. One of the key initial pieces of information when it comes to solving a problem is knowing (with confidence, not just blind finger pointing) is who actually owns the problem. From its position in the network stack, yet independent of the network itself, Uplogix can rapidly and automatically troubleshoot issues and alert exactly where the issue resides.

Beyond this ability to correctly point the finger at the guilty party, Uplogix can proactively take recovery steps when the issue is with a device being managed by Uplogix. Based on your run book steps, Uplogix can proceed through escalating efforts including everything from clearing the service module or cycling the interface on to rebooting or cycling the power.

All of this is triggered by monitoring that occurs up to 10 times more frequently than traditional SNMP, (the Uplogix default is every 30 seconds) and that gathers richer data. During an outage, Uplogix can also backfill centralized tools over an out-of-band link.

When you start looking at the broader set of incident response metrics, add Uplogix into your discussion. You might just find some more quality time to book a relaxing vacation on Airbnb…

Published:

Share:

Subscribe to Blog Updates

More Posts

Uplogix Resource Center

Uplogix attacks the challenges of network management from a different angle. Take a few minutes to browse through our Data Sheets, Case Studies and additional resources to see for yourself.