CHAOS CARNIVAL

Survival Guide: Black Swan Events

Featuring Emily Arnott and Jake Englund

Abstract

In SRE, a core message is that failure is inevitable. No matter how much you prepare, there will always be incidents you can't foresee. This doesn't mean preparation is useless, though. This talk will focus on one extremely valuable type of preparedness: having backups and restoration processes for the worst disasters. When your system experiences a total outage, an effective option is often to switch to a backup system before trying to solve the issue itself. This will restore service as fast as possible. However, just making backup systems isn't enough.
 
This talk reveals complacency and blind spots when it comes to backup systems. Many organizations feel comforted by having created backups, but aren't actually prepared to use them. There will be practical advice given on how to improve backup systems for organizations of all sizes. The talk will cover looking at backup systems from the perspectives making them more reliable, more robust, and more resilient - based on the definitions given by Dr. David D. Woods. In order to make the advice inclusive, there won't be much technical detail. Instead, the focus will be on mindsets and strategies.
 
Black swan events are highly impactful incidents that are so unlikely or unimaginable that effort isn’t made to prepare for them. You'll learn how to conduct thought experiments of "meteor strikes" and other worst-case scenarios, such as ransomware, to feel ready for other problems you can't yet imagine. You'll also see how backup systems can still be useful for such disasters. This is how a resilient backup system is created - one that can still handle what falls outside your expectations.

Speakers

Emily Arnott

Emily Arnott
Community Relations Manager

Emily is the Community Relations Manager at Blameless, where she fosters a place for discussing the latest in SRE. She has also presented talks at SREcon, Conf42, and Chaos Carnival. She has contributed many articles on many topics in SRE to the Blameless blog.

Jake Englund

Jake Englund
Sr. Site Reliabilty Engineer

Jake is a Senior Site Reliability Engineer at Blameless. Jake has been fascinated by the unique challenges and innovative solutions which come with scaling web services by orders of magnitude.

WATCH NOW

Black Swan Events