I’d imagine most people reading this are aware of Root Cause Analysis (RCA) meetings. Maybe you know them better as Postmortems. If you’ve not attended one or heard of the concept before, RCAs are meetings following an incident that resulted in unexpected outcome in a system. They can be for bugs, human error or third party outages. The goal of an RCA is to understand exactly what happened and then dig into why it happened, so that a potential repeat of the incident can be prevented in future. It’s my strong belief that these should be blameless. I’m not going to get into that in this post, but Atlassian have a great guide on how to run an RCA in this manner.
After speaking to my partner, an Ecologist, she told me about how annoying it was to have to report close calls, incidents where something bad almost happened. I learned this wasn’t a practice unique to her company, but a requirement of the Health & Safety executive that these are reported. It got me thinking, I’ve been in the position as an engineer numerous times where something bad has almost happened, but has been rectified just before being allowed out into the wild to cause real damage. A few big examples come to mind: realising a database migration script isn’t quite right as it’s running; revoking a change before users get a chance to interact with it; cancelling a build before it gets a chance to produce some destructive outcome. Had these actions resulted in others experiencing the effects of them, we’d have looked into it and held an RCA meeting. But they didn’t, so nothing happened.
On thinking about this, it makes sense to treat almost incidents as incidents. We should investigate and fix the process that led up to them. Just because nothing bad happened, doesn’t mean there isn’t an opportunity to fix the process, allowing us to continue our mission of delivering high quality software at speed.