StackState Blog

Automate incident investigation to save money and become proactive

Posted by Mark Bakker on Mar 1, 2017 4:16:42 PM
Find me on:

incident management 

How many hours did your best engineers spent investigating incidents and problems last month? Do those engineers get a big applause when they solved the issue? Most likely the answers are “a lot” and “yes”…

 

The reason that problem and incident investigation is hard, is because usually you have to search through multiple tools, correlate data from all those tools and interpret this data.

At StackState we believe we should not only automate the release cycle of your software but we also should automate everything including the hard parts, like troubleshooting and incident investigation.

 

There are some companies that try to help during incident investigation. They do this by correlating incidents reported by different tools. Then see if they contain the same message and verify these are reported in the same time frame. This type of analytics does help during incidents, but usually takes quite some time.

 

As an engineer I executed lots of those investigations. Most of the time the investigation follows the following pattern:

  1. There is a bottleneck somewhere.
  2. You search for bottlenecks in all dependencies for the hotspot (slow service calls, high disk usage, high CPU usage, increasing memory usage, slow queries, etc.)
  3. When you found a bottleneck, you repeat those steps until you find the Root Cause of the problem.
  4. You fix the Root Cause.

At StackState we automated this process to reduce the time needed to find the root cause of any change or failure. However, it is always better to prevent these issues impacting your end users and thus your business. That’s why we use anomaly detection combined with graph analytics to become more proactive. This will avoid an issue becoming a root cause of an outage in an early stage, so you don’t have to react in panic, but prevent this outage from happening anyway.

 

This all will ensure service availability, -stability and -performance. So happy end users and a happy business department. But above all: it will save you money!

 

PS - we're running a webinar 'Container Monitoring at scale' on March 8, 2017. Go here to claim your spot.

 

Topics: Dev/Ops, ITSM, ITOA

Our Mission

To simplify the lives of IT managers, Operators and Developers

To accomplish this, we created an Algorithmic IT Operations platform. It aggregates information from a multitude of sources and existing Dev/Ops tools to provide a unique insight into the health of the entire IT stack and to find root causes of problems across tools, teams and departments. Validate the effects of changes before applying them. Not only for one type of system, but for the full stack regardless of its size.

Join over 5,000 people from companies like eBay, American Express, Cisco, Tesco, ING and more who get our best new posts delivered via email. Subscribe below if you'd like to get it too:

Subscribe to Email Updates