Learning from Incidents
Aligning to a DevOps culture has seen many organisations gain a distinct competitive advantage in their marketplace - especially if they started changing their thinking early which Seek did. Frequent daily deployments, teams owning what they build, the ability to iterate and deliver Products faster, and a greater emphasis on collaboration with much less of "that's not my job", has achieved many benefits. But there is flipsides to this rapid rate of change, and depending on your perspective, how you capitalise on it could be the next big advantage you can take.
When teams gain greater autonomy to make technology choices the amount of diversification in your enterprise grows rapidly - especially when you are on the bleeding edge of what the major cloud providers are releasing. This increase in diversification will place greater cognitive loads on the people operating and building the system, to a point where an ability to mental model your systems becomes impossible. Incidents and failure will still be a part of normal system functions, still just as complex, but more asynchronous and therefore more difficult to diagnose the reverberations of failure through the system. How you embrace failure in this greater field of diversification, learn from it and use it, is what will set you apart.
This presentation will discuss how Seek has dealt with and collated extensive amounts of data on "Normal Accidents" over the last several years. We will demonstrate how incident analysis and involvement of teams in post-mortem rituals, has paved the way to many starting viewing our diverse software stack as the Socio-Technical system it is, and how appreciating the "Human Factors" elements of incidents are important to building greater resiliency in the system. We will discuss how involvement of technology people in incident investigation and facilitation will lead to richer amounts of data, that can be fed back into the delivery cycle and continuously improve the reliability and resiliency of your products We will also discuss the traps and pitfalls to avoid such as obsessing over the Root Cause and why the “5 Why’s” technique of incident analysis can be flawed.
Outline/Structure of the Talk
The presentation is built using a HTML5 infinite canvas (similar to Prezi) that will draw-in and engage the audience. It is a highly visual presentation, the content is based on real-world experiences and is structured into main 6 parts, specifically:
1) How Seek used to learn from incidents i.e. it didnt
2) How several months of serious incidents led to a knee-jerk response to wrestle back the perception of "control" of software engineering
3) How Incident Analysis techniques were initially mired in early 20th century thinking and led to a poor understanding and learning of our complex system
4) How we started to think better about our thinking and view incidents differently
5) How we placed greater emphasis on supporting our people
6) Close out on Resilience Engineering and how following its principles could lead to greater success in building more diverse and compelling products for our customers
The audience will learn new ways of thinking and managing complex systems. Particularly they will learn techniques and reasons on the value of drawing extensive information out of incidents to further enhance and grow the learning culture within their organisations - alongside such practices as Site Reliability and Chaos Engineering.
The ultimate goal of this talk is to encourage others to engineer for resiliency as their software stacks diversify and grow more complex in order to continue deliver value to their customers
Any technology leaders in an organisation responsible for building and managing complex software systems
Prerequisites for Attendees
Ideally participants will have a degree of familiarity with incident management, reporting and some experiences being part of facilliation and war rooms. This is your "basic level" of experience.
Participants with a keen interest in Resilience Engineering, Human Factors, Complex and Cognitive Systems Engineering, Safety 1 versus Safety 2 thinking will derive much more value as this talk will present many of these themes within a real, "warts and all" context of Seeks experience in dealing with incidents over the past few years.
Familiarisation with published works by Eric Hollnagel, Charles Perrow, Richard Cook, John Allspaw, David Woods and Sidney Dekker will also be a bonus