StackState Blog

Monitor Mesos with StackState - Part 2

Posted by Mark Bakker on Mar 22, 2017 3:00:02 PM
Find me on:



This post is part 2 in a 4-part series about Container Monitoring. Post 1 dives into some of the new challenges containers and microservices create and the information you should focus on. This article describes how to monitor your Mesos cluster.


Apache Mesos is a distributed systems kernel at the heart of the Mesosphere DC/OS and is designed for operations at very large scale. It abstracts the entire data center into a single pool of computing resources, simplifying running distributed systems at scale.


Mesos supports different types of workloads to build a truly modern application. These distributed workloads include container orchestration (like Mesos containers, Docker and Kubernetes), analytics (Spark), big data technologies (Kafka and Cassandra) and much more.


Mesos monitoring challenge

Mesos can help you to move resources from your cluster to the critical applications that need them, but understanding what's happening inside your Mesos cluster can be a challenge. One of the most important (and hated) tasks in systems is troubleshooting. Despite all your experience and knowledge systems will break. With the containerization of your landscape it is even more important to know where to look to know what is failing and how it impacts critical services.


In a typical Mesos environment, you run several services like Marathon (container orchestration), Docker and distributed applications such as HBase and MongoDB and Spark. Your tasks have dependencies on these services and on each other. StackState helps you understand and monitor these dependencies.


All these technologies produce different types of metrics. It's not efficient to monitor and control each technology component of your cluster with a different tool. When a container suddenly breaks you will receive an overload of alerts. Endless troubleshooting will follow. How are you going to deal with this challenge? 


Monitor Mesos with StackState

StackState makes it easy to aggregate a variety of relevant metrics and checks from your Mesos master, its slaves and tasks. Just provide the Mesos API endpoint to StackState or install the agent.


In the example below, we are showing a visualization of a containerized environment. It shows the health state of each component and their underlying dependencies. If something breaks you immediately see the cause of the problem.

Pasted image at 2017_03_22 10_38 AM.png

 Real-time visualization of your container landscape


StackState is also able to analyze log files and relevant events. This helps you to quickly investigate the root cause of problems. In the example below you see some slow queries from a database running on Mesos.



Pasted image at 2017_03_22 11_13 AM.png

Analyze events with StackState


StackState does not only show and monitor all your dependencies, but also has advanced analytical capabilities which offer you the possibility to reason about the whole model of your IT stack. For example, you can perform a query about which Mesos batch task will not be ready in time and could affect your primary business processes. E.g. calculating the right interest each day in time (ETA 8:00AM) is an important process that should not fail because of one job that took too much time. StackState will notify you when problems will occur.



Connecting Mesos to StackState gives the ability to:

  • Visualize your Mesos cluster performance
  • Correlate the performance of Mesos with the rest of your applications
  • Find all tasks (including containers)
  • Correlate all tasks (based on connectivity between them)
  • Report metrics for a Mesos slave and master
  • Correlate all services running in Mesos with other components in your IT stack
  • Parse application or service logs to extract and ship relevant events to StackState
  • Big Data Graph analytics


StackState offers all the advanced functionalities you need to monitor your containerized environment including automated service discovery, root cause analysis, anomaly detection and the correlation of metrics, logs and events in once place. 

Learn More


Topics: Dev/Ops, Integrations, Monitoring, Virtualization, Technologies, ITOA

Our Mission

To simplify the lives of IT managers, Operators and Developers

To accomplish this, we created an Algorithmic IT Operations platform. It aggregates information from a multitude of sources and existing Dev/Ops tools to provide a unique insight into the health of the entire IT stack and to find root causes of problems across tools, teams and departments. Validate the effects of changes before applying them. Not only for one type of system, but for the full stack regardless of its size.

Join over 5,000 people from companies like eBay, American Express, Cisco, Tesco, ING and more who get our best new posts delivered via email. Subscribe below if you'd like to get it too:

Subscribe to Email Updates