Container Monitoring
Why, how, and what to look out for
Overview
Why monitor your containers?
Challenges with container monitoring
Despite the popularity of containers for modern application deployment, the CNCF 2018 survey found that 34% of respondents cited monitoring as one of the top challenges for adopting containers. Adopting a container monitoring system comes with unique challenges compared to a traditional monitoring solution for virtualized infrastructure. These challenges include:
Containers are ephemeral
Containers share resources
Insufficient tooling
What to look for in a container monitoring system
A good monitoring system needs to provide an overview of your entire application as well as relevant information on each component. Here’s what to consider when selecting a container monitoring solution:
- Can you see how the entire application is performing, with respect to both the business as well as the technical platform?
- Can you correlate events and logs, in order to spot abnormalities and proactively, or reactively, respond to events and minimize damage?
- Can you drill down to each component and layer to isolate and identify the source of failure?
- How easy is it to add instrumentation to your code?
- How easy is it to configure alarms, alerts, and automation?
- Can you display, analyze, and alarm on any set of acquired metrics and logs from different data sources?
Options for monitoring solutions
To speed up development cycles and build in governance into their continuous integration and continuous delivery (CI/CD) pipelines, teams can build reactive tooling and scripts into their standard devops orchestration by making use of metrics from their monitoring solution or leverage community projects like AutoPilot from Portworx to help. Autopilot is a monitor-and-react engine. It watches the metrics from the applications that it is monitoring, and based on certain conditions being met in those metrics, it reacts and alters the application's runtime environment. Built to monitor stateful applications deployed on Kubernetes, it is a good example of a project designed to make it easier for businesses to make metric-based decisions and strengthen their operational resiliency.
Businesses that have invested in Chaos Engineering will need to isolate and profile failure domains in order to supplement their risk resilience tooling. Chaos Monkey, originally created by Netflix, randomly terminates virtual machine instances and containers that run inside of your production environment. This exposes engineers to failures more frequently and enables them to build resilient services. By injecting failures into a system but not having a good set of metrics by which to track them you are really just left with chaos.
The open source data-series visualization suite Grafana is a good choice in many scenarios because of its long list of supported data sources, including Prometheus, Graphite, and Amazon CloudWatch, among others. While Grafana’s core competency used to be the visualization and alerting of metrics, the recently added support for logs analytics with data sources for InfluxDB and Elasticsearch allows users to correlate metric data with log events and provides improved root cause analysis. Besides Grafana, many paid container monitoring solutions exist, but they usually require the use of specific agents or data collection protocols.
For containerized applications running on AWS resources, a similar cross data source monitoring experience can be realized with Amazon CloudWatch Container Insights. By automatically collecting and storing metrics and logs from solutions like fluentD and DockerStats, CloudWatch Container Insights can give you insight into your container clusters and running applications.
How does container monitoring work?
Tying together disparate metrics sources requires a robust infrastructure monitoring platform that can provide a single pane of glass to view data from various sources. It also requires a lot of thought and planning from your application development team to ensure that data can correlate to allow easy debugging end-to-end.
Conceptually, containerized applications are monitored in a similar way to traditional applications. It requires data at different layers throughout the stacks for various purposes. You measure and collect metrics data both at the container itself and at the infrastructure-layer for better resource management such as scaling activities. You also need application specific data, application performance management, and tracing information for application troubleshooting.
Robust observability and easy debugging are the key for building a microservices architecture. As your system grows, so does the complexity in monitoring the entire application. Traditionally, you would build an independent service that serves a single purpose, and it would communicate with other services to perform a larger task.
Infrastructure team
You can build services to include a networking function, such as service-to service communication, traffic control, retry logic, and more using an SDK approach — in addition to business logic. However, these tasks become very complex when you have hundreds of them that use different programming languages across different teams. This is where service mesh comes into the picture.
Service mesh
A service mesh manages the communication layer of your microservices system by implementing a software component as a sidecar proxy. This software component performs networking functions, traffic control, tracing and logging. Each service has its own proxy service and all proxy services form the mesh. Each service does not directly communicate with other services but only with its own proxy. The service-to-service communication takes place between proxies. Here’s what the architecture looks like:
Build a network of microservices
With a service mesh, you can build a network of microservices and separate most of the network layer out to streamline the management and observability. Ultimately, if problems occur, it enables you to identify the source of the problem quickly. Here are some of the benefits from using service mesh:
- Ensure consistent communication among services.
- Provide complete visibility of end-to-end communication.
- Control traffic throughout the application such as load balancing, scaling, and traffic routing during deployment.
- Provide insight into metrics, logging and tracing throughout the stacks.
- Streamline the networking layer and distributed architecture implementation such as retries, rate limiting, and circuit breaking out from business logic.
- It is service platform independent, so you can build services in any programming language you wish.
Envoy
One of the most popular implementations of the sidecar proxy for a service mesh is Envoy. Envoy is a high-performance C++ distributed proxy originally built at Lyft. Envoy runs alongside every service and provides common networking features in a platform-agnostic manner. It is self-contained and has a very small footprint (8 MB). It supports advanced load balancing features including automatic retries, circuit breaking, rate limiting, and so on. As service traffic flows through the mesh, Envoy provides deep observability of L7 (Application Layer) traffic, native support for distributed tracing, and wire-level observability of the system.
One last tip
The purpose of monitoring your application is to collect monitoring and operational data in the form of logs, metrics, events, and traces, in order to identify and respond to issues quickly and minimize disruptions. If your monitoring system is working well, you should ideally be able to easily create alarms and perform automated actions when thresholds are reached. In most cases static alarms are sufficient, but for applications that exhibit organic growth, cyclical, or seasonal behavior (such as requests that peak during the day and taper off at night), they require more thought and knowledge to define the threshold to implement an effective static alarm. The common challenge is finding the right threshold — either it is too broad and risks letting problems slip, or it is too tight and gives too many false alerts.
The way to solve this puzzle is with machine learning. Many monitoring tools now incorporate this capability to be able to learn what the normal baseline is and recognize anomalous behavior in the data when it arises. The tool can adapt to metric trends and seasonality to continuously monitor the dynamic nature of system and application behavior, and auto-adjust to situations such as time-of-day utilization peaks. With machine learning, you can preemptively identify runtime issues sooner, reducing system and application downtime.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages