It probably sounds familiar: You detect an issue with the end user experience (perhaps a performance degradation or outage). Once you’ve discovered the root-cause of the problem, it’s not difficult to fix. However, the real pain point is the time between those two events. Often times it is left to the user to find the root-cause by assembling teams, processes, tools and piecing together siloed data fragments. Every minute spent looking for the root-cause is money lost for your company, and even worse, reputation-damaging. This post covers which evidence you need to collect to accelerate your root cause analysis process.
You can prevent possible outages by making an explicit and formal disaster-detection strategy. Once formalized, the procedure opens a wide perspective for automation. The disaster-detection strategy must enforce the collection of evidence (metrics, events, traces, incidents, component versions) in the tools. The collected evidence should be sufficient to explain any unknown issues and reduce the time it takes to make an automated root-cause analysis, which boils down to a few specifics:
- Analyze the most critical possible issues and create a strategy to gather the necessary metrics (what-if analysis)
- Obtain the right metrics, events, etc. in the tooling; those that can explain the issues
- Gather all additional information from all tools, including application-level logs to correlate them and find cause-effect relationships
- Make automatic procedures that help to correlate evidence, deviations from the norm, and detect issue-cascading propagations
Responsibility for the first bullet point falls on the IT Department. A single platform should preferably handle the last three. This “central collection and processing platform” would integrate and processes all the data. Otherwise, we lose too much time looking in different tools and involving different teams.
"we lose too much time looking in different tools and involving different teams.…."
Collecting The Evidence
Detect all issues isn’t possible from a pragmatic perspective. However, detecting the most critical issues can be done with the most obvious metrics or events, such as:
- customer-centric service response time
- network loss
- available disk space
- error count
- errors in logs
- message build-up in queues
- environment changes, such as upgrades or deployments
Many customers use agents that monitor each VM and send OS metrics (like CPU, Memory, Disk I/O, Network I/O) to a log aggregation tool (like Splunk or Elasticsearch). Those metrics are good as low-level infrastructure evidence, but they don’t immediately answer the question - "Is there a problem from a business- or end-user perspective?" Those metrics are mostly used to confirm or reject the theories about what went wrong.
To find the root-cause and detect issues that span multiple teams, hosts, etc., you must correlate user-facing/mission-critical monitoring evidence to the low-level infrastructure and application level evidence. An ability to visualize the complete business chain, from on-premise to cloud, and from legacy to microservices, is critical to accomplishing this task. It's the only way you can understand which components might impact the business service and rule out those that don’t.
A Concrete Use Case
Imagine, for instance, there is a response delay for user-facing component 'Business X'. As an operator, I would like to see what other components called by the component 'Business X' also experience the delay. In other words, I’d like to see how this delay is propagated. Having an understanding of the components and their relationships is crucial, otherwise valuable time is wasted searching for stuff that’s unrelated to the current problem. Those connections allow us to see related services with one click, instead of sending emails to other teams or searching in Splunk to find what is correlated.
Once a group of components with cascading delays is found, I consider each one-by-one, and rule out those that are not correlated. I do this by looking at all the collected metrics and events to discover deviations that can explain the delay.
Integrating multiple sources of information and adding relationships increases the likelihood of finding the root-cause or developing a realistic theory based on evidence and data. Understanding relationships between components, both logical and physical, is crucial. Combining this insight with all available telemetry (from multiple data sources) provides a unified view, which allows you to reduce the mean-time-to-repair (MTTR) drastically. To learn more about root cause analysis, we recommend to read our post on 'Is Root Cause Analysis Dead or Are We Just Getting Started?'.
Do you want to learn more about how StackState helps to automate your insight and root cause analysis? Request a free guided today to get a better understanding of our solution.