Recently a friend sent me a link to a blog post called “The myth of the Root Cause” (http://blog.scalyr.com/2016/10/the-myth-of-the-root-cause/). This post makes some valid points. Especially the point that as our systems become more sophisticated, so do our problems. Often it is hard to find a single root cause. In fact, there might not even be one!
In the blog post: “Focus on Analysis: the end of root cause” (https://victorops.com/blog/focus-on-analysis-the-end-of-root-cause) the author (Matthew Boeckman) claims that focusing on finding a root cause might even be the wrong frame of mind when analyzing an incident. If there is no such thing as a root cause then searching for it may be considered harmful. And again, when our systems become more sophisticated, so do our problems.
Is root cause analysis in IT dead? That brings me to my next question.
What do we actually mean with "root cause"?
I believe some of the misunderstanding comes what we actually mean with “root cause”. Dictionaries certainly don’t help in this regard (just have a look for yourself), but the few definitions that are out there share some common understandings.
Here is what we identified as the three key points that define a root cause:
- A root cause is the deepest, most fundamental, earliest cause that led to a problem.
- A root cause should be meaningful to the (RCA) process. Some definitions suggest a root cause should be something that is also fixable and that removal of the root cause should lead to a removal of the problem.
- There can be multiple root causes for a single problem.
Though most definitions agree that we should allow for multiple root causes, most people are surprised when I tell them that. Why? When we require that there is a single root cause to all problems then it is not impossible to find one in each case: the big bang is the root cause for all problems. But how is that useful? That is why most definitions of “root cause” include the second point, which then necessarily leads to the third point.
Why do we search for root causes in the first place?
We want the solutions that we bring to the table to have a maximum ROI. A superficial fix may solve the problem, but may not prevent a slight variation on the same problem to occur. Sound business logic.
Even though it seems the IT industry at large is struggling with the concept of RCA the need is becoming bigger and bigger. Through all the Agile and DevOps transformations of the last decade we have seen an enormous increase in the dynamic nature of IT landscapes. Most incidents nowadays stay contained, but if they don’t they can be extremely hard to trace back to their origins.
Can we automate root cause analysis?
It depends on how you define “root cause”. Some definitions make it pretty hard. For example if we require a root cause to be fixable, then we have to teach a system what is fixable and what not. However at StackState we use a definition that stays true to the three key points of what a root cause is, while at the same time allowing automation.
Here is our definition for root cause:
"A cause whose direct or indirect effect is significantly/meaningfully correlated with the problem and whose direct causes do not significantly/meaningfully correlate with the problem. A problem may have multiple root causes whose combined effect resulted in the problem. Cause and problem (effect) may be further refined to mean either an event or state."
StackState builds a model out of your IT environment. That means StackState knows that if your internal HTTP proxy is not functioning then your microservices are not going to be able to communicate with each other and that that will cause several critical business functions to fail. This is an example of what we call a causal model.
Given a directed graph that represents the causal model we can then traverse in the opposite direction of the problem until we hit some vertices (events or states) that do not anymore yield a significant correlation with the problem. This traversal may branch of into a search for multiple root causes. We should allow teleports in this traversal, so that we can find correlations that do no pass strictly through the edges of the graph.
With this definition, what is in the model and what is significant/meaningful is up to the user, but root causes can be assigned automatically.
What is a cause?
At StackState we have defined causes as being either states or events. This innovation turns out to be an incredibly useful way of defining a cause. An event refers to a point or interval of time that interacts with the real world. For example, a network traffic spike or a bad transaction. A state refers to a property of an object, for example a server being down or a connection pool that is nearly full.
Both events and states can be part of a chain of cause and effect. They are also interchangeable:
- An event can cause another event: a commercial during a popular sports event may cause a network traffic spike.
- An event can also cause a state: a network traffic spike may cause a server to go down.
- A state can cause another state: that server being down may cause the payment server to go down as well.
- And finally a state can also cause an event: a server going down can cause a connection timeout.
Each state and possible state transition is known in advance. We call that a state machine. A simple example would be that a server can either be up or down, but complicated state machines can be constructed that predefine not only multiple states, but also the conditions for the state transitions.
The great thing about state machines is that it gives us a way to reason about the world in terms of the possibilities. For example, an absence cannot exist in the actual world. That would be a paradox. But it might be precisely what we would want to call a root cause. So how can we reason about something being absent? State machines! Think about it, how do you know a glass is empty if you have not defined that as a state in a state machine?
There is more interplay between states and events. Each transition between states can be captured as an event. That means that events can be used to verify state machines or reverse engineer them.
You can start to see why we think it is important to have both. While events explain what happens, states (and their state machines) describe what can happen. Both are incredibly interesting from a RCA point of view.
Contextual root causes
What is significant/meaningful for one stakeholder in one context might not be significant/meaningful for another stakeholder in another context. For example when a bug caused down time, the developer who coded the bug might consider the complexity of the codebase the root cause, while for the QA engineer the fact that the release notes were not complete is the root cause.
Having multiple root causes from multiple angles may yield multiple solutions that all have a great ROI. If we would have insisted on a single root cause we might have had endless frustrating debates on what the true root cause is and potentially could have missed the fact that multiple small solutions would yield a better ROI than a single big solution. Perhaps we need to start thinking about contextual root causes.
Spatial and temporal root causes
Given that we will allow multiple root causes for different aspects of a problem then we can also predefine different types of root causes. For example, spatial and temporal root causes. A spatial root cause tells you what is currently broken (or at some specific time) that is causing the problem. A temporal root cause tells you what happened to cause the problem.
Let me give you two examples:
Problem: User cannot make payment.
Spatial root cause: Accounts database unresponsive.
Temporal root cause: Select query on account with 10m transactions.
The spatial root cause tells us where the problem occurred (the database), whereas the temporal root cause tells us why the problem occurred (slow query). Keep in mind that the same temporal root cause could have also caused a problem in an entirely different part of the stack, for example the http server might have ran out of memory. Having just the temporal root cause may not be interesting, whereas adding the spatial root cause may tell you exactly what you may need to know.
Problem: Challenger space shuttle exploded.
Spatial root cause: Broken seal around O-Ring.
Temporal root cause: Decision to go ahead with launch after discussion on potential O-Ring failure.
Of course no post about root cause analysis can be complete without mentioning the infamous Challenger disaster. As you may remember the actual explosion was caused, because of a broken O-ring, which caused a chain reaction that led to the explosion. However further RCA found that the deeper problem was that at the time NASA made a decision to go ahead with the launch anyway after concerns about the O-Ring were raised.
The limits of automated RCA
An even deeper root cause to the Challenger explosion was established. At the time NASA’s culture was such that risking the mission was considered more important than the PR disaster that would have resulted from postponing the mission (again) at such a late stage. Though this could theoretically be modeled as a state of the NASA organization, such metaphysical concepts as organizational culture are not going to be concepts machines will reason about in the next decade(s) or so.
So while our definition does allow this root cause to be assigned, as long as it is not directly or indirectly observable (and thus neither temporal nor spatial) unfortunately it is not going to be automated real soon. It would therefore perhaps be more accurate to call automated RCA, assisted RCA.
Business dictates that there is always going to be a need to get to the roots of problems. Not doing some type of RCA after an incident simply means that you have not verified whether or not a solution with a better ROI exists. Instead of getting rid of RCA we need to get rid of the idea of a 1 to 1 mapping between problem and root cause and start thinking about contextual root causes.
It is shocking to find how little insight companies still have nowadays in what they have running in production, let alone other environments. We need to invest in building better models of our IT environments. Doing so vastly increases the speed with which RCA can be performed. By building better models alone there is still so much to gain in the field of RCA, so instead of declaring it dead let’s just acknowledge how immature RCA in IT really is. In fact, we’re just getting started!