It probably sounds familiar: You detect an issue with the end user experience (perhaps a performance degradation or outage). Once you’ve discovered the root-cause of the problem, it’s not difficult to fix. However, the real pain point is the time between those two events. Often times it is left to the user to find the root-cause by assembling teams, processes, tools and piecing together siloed data fragments. Every minute spent looking for the root-cause is money lost for your company, and even worse, reputation-damaging. This post covers which evidence you need to collect to accelerate your root cause analysis process.
I don't have to tell you that modern IT infrastructures are typically very complex with a combination of physical, virtual and cloud components. A lot of IT organizations are using DevOps, Continuous Delivery and Agile approaches to enable continuous improvement, which likewise increase the complexity.
On top of that, enterprises invest in a variety of application monitoring tools (e.g. Dynatrace, AppDynamics and New Relic), data lakes (e.g. Splunk and Elastic) and infrastructure monitoring tools (e.g. Nagios and Zabbix) to manage and monitor their services across on-premise and cloud environments.
In my job, I meet a lot of customers who have invested heavily in a data lake. And for good reasons, of course. With the increasing complexity and growing number of tools used, there’s a tremendous need to have all the data in one place.
Recently a friend sent me a link to a blog post called “The myth of the Root Cause” (http://blog.scalyr.com/2016/10/the-myth-of-the-root-cause/). This post makes some valid points. Especially the point that as our systems become more sophisticated, so do our problems. Often it is hard to find a single root cause. In fact, there might not even be one!