DevOpsDays Boston 2017 - There is No Root Cause... by Matthew Boeckman

Channel:
Subscribers:
42,400
Published on ● Video Link: https://www.youtube.com/watch?v=PZHoMG4iNMQ



Duration: 29:30
405 views
5


DevOpsDays Boston 2017 - There is No Root Cause: Emergent Behavior in Complex Systems by Matthew Boeckman

What went wrong? Why does this always happen? How can we ensure it Never Happens Again? For most of the internet age, engineering teams have focused on finding a cause of an outage. A belief existed, and persists, that all errors or behaviors can be traced back to a single causal entity. The Root Cause Analysis is conducted in service of finding that entity, and correcting it. By doing so, we have been taught, we prevent recurrence of the error in question.

Much of RCA thinking comes from manufacturing and electrical systems, where simple causality can exist. An oft failing fuse is caused by poor wiring. In computing environments, there is rarely so simple a cause. Within even the simplest application nest dependencies, logic, bottlenecks, and inefficiency. By wrapping that application in an operating system, on a server, on a network, on the internet, managed by process, actioned by people we add enough complexity to force us to reconsider the Root Cause Analysis approach.

Modern tools and practices, like DevOps, enable engineering teams to adopt significant complexity at relatively low operational cost. Once unthinkable, microservice architecture in a public cloud environment is now a common choice for new software projects. Consider, for a moment, the layers of complexity captured in that decision. Now consider how opaque the agents in those systems are to the operators (us).

Emergence is a phenomenon whereby larger entities arise through interactions among smaller or simpler entities. In theory, complex systems exhibit highly unpredictable behavior, and generate surprising patterns. In practice, teams operating complex engineering systems always see deeply interrelated causality - a blend of people, process, and the systems themselves. So why do we still focus our after action analysis on a Single Cause?

In this talk, we’ll explore these conflicting realities for incident management teams. Attendees will learn about differences between Root Cause Analysis, and more techniques like Postmortem. While this is a technical talk with examples of both simple and complex infrastructures, much time will be spent considering the impacts of people and process to those same systems. Attendees will leave with some actionable ideas to bring back to their teams to improve their own after action analysis activities.




Other Videos By Confreaks


2017-11-02DevOpsDays Boston 2017- Iterative Security... by Tom McLaughlin
2017-11-02DevOpsDays Boston 2017- Terrible Ideas In Lambda by Corey Quinn
2017-11-02DevOpsDays Boston 2017- Real-World Kubernetes For DevOps by Phil Lombardi
2017-11-02DevOpsDays Boston 2017- Your Emotional API by John Sawers
2017-11-02DevOpsDays Boston 2017- Crayons, Glue, and Stickers by Adam Kaufman
2017-11-02DevOpsDays Boston 2017- With Great Power Comes Great Responsibility... by Michael Sacks
2017-11-02DevOpsDays Boston 2017- Don’t be a bystander, be an Incident Commander! by Rachael Byrne
2017-10-23RubyConf 2008 - Writing Code That Doesn't Suck. Yehuda Katz
2017-10-20DevOpsDays Boston 2017- Lost Art of Troubleshooting by Leon Fayer
2017-10-20DevOpsDays Boston 2017 - SRE: Lessons from a Parallel Universe by David Blank-Edelman
2017-10-20DevOpsDays Boston 2017 - There is No Root Cause... by Matthew Boeckman
2017-10-20DevOpsDays Boston 2017 - KEYNOTE: Settlers of DevOps... by Rob Cummings
2017-10-12Rocky Mountain Ruby 2017 - Leadership Lessons from the Agile Manifesto by Anjuan Simmons
2017-10-12Rocky Mountain Ruby 2017 - A Discussion on Responsible Hiring & Team Building by April Wensel
2017-10-12Rocky Mountain Ruby 2017 - The (Non-Perfect) Mathematics of Trust by Vaidehi Joshi
2017-10-12Rocky Mountain Ruby 2017 - Trust, But Verify (Programmatically) by Ben Orenstein
2017-10-12Rocky Mountain Ruby 2017 - Comparative Error Handling... by Brittany Storoz
2017-10-12Rocky Mountain Ruby 2017 - Community Spotlight: Elaine Marino from Equili BY Elaine Marino
2017-10-12Rocky Mountain Ruby 2017 - Trust Me by Adam Cuppy
2017-10-12Rocky Mountain Ruby 2017 - Community Spotlight: Carrie Simon from Defy Venture by Carrie Simon
2017-10-12Rocky Mountain Ruby 2017 - Building Helm Charts From the Ground Up... by Amy Chen