Using OpsWatch to Create a Single Pane of Prometheus Metrics from Multiple Non-Native Sources

By Philipp Hellmich, Team Lead, AWS Cloud Services – Arvato Systems
By Patrick Robinson, DevOps Engineer – Arvato Systems
By Kobi Biton, Specialist Solutions Architect – AWS

Arvato Systems

Monitoring a multi-cloud IT landscape is becoming increasingly important. Having optimal insight into all aspects of your service enables you to keep it running and reduce downtime by determining the cause of an issue faster. A typical application has several layers of information that need to be combined in order to get a complete picture.

Classic monitoring systems can quickly become complex and maintenance-intensive in dynamic cloud environments. Further, they are often inflexible, not built for the cloud, and difficult to use at scale. Software-as-a-service (SaaS) solutions can be expensive in large-scale deployments and create additional external dependencies.

Arvato Systems, an AWS Advanced Tier Services Partner with Competencies in DevOps and Migration, was faced with this challenge in its managed services operations as well as in consulting engagements with customers across different industries.

Prometheus is an open-source systems monitoring and alerting toolkit which many companies and organizations have adopted. The Arvato Systems OpsWatch solution helps to bridge the gap between Prometheus native and non-native Amazon CloudWatch metrics and Amazon GuardDuty events.

OpsWatch does this by consuming metrics, transforming and enriching them. They can be displayed in dashboards and/or trigger alerts. This gives the operator a single pane of Prometheus with powerful monitoring capabilities.

In this post, we will describe how to integrate CloudWatch and other Amazon Web Services (AWS) data sources into a typical container and Prometheus-based monitoring world.

Challenges of a Single Pane

The team at Arvato Systems often observes customers adapting multiple tools to monitor their various cloud services. Adapting multiple tools adds high cost, complexity, and requires cross-domain expertise. In most cases, the result is a high total cost of ownership (TCO).

While Prometheus integrations such as Yace Exporter and CloudWatch Exporter exist, they are not scalable and are unable to utilize the entire source feature set.

The architecture of those tools involves multiple non-efficient calls to the relevant AWS APIs (CloudWatch, for example), capturing all data points in a pull vs. push mechanism. This often creates high concurrency on the AWS APIs which can lead to throttling and rate limiting, thus hindering the ability to get all of the needed data in a timely manner.

Proposed Solution

To meet internal requirements as an managed service provider, as well as customers’ needs for a fully managed single pane monitoring system, Arvato Systems developed OpsWatch, a solution based on core principles depicted in the following section.

Scalability

Compared to the alternatives listed above, Arvato Systems’ product reverses the process so metrics are transmitted in real time (push method) to OpsWatch when they actually get generated using Amazon CloudWatch metrics streams.

The architecture is de-coupled and can scale and process events in a near real-time manner. OpsWatch instances can be connected to any Prometheus instance, and this pluggable architecture allows customers to integrate OpsWatch with minor setup.

Metric Enrichment

Every CloudWatch metric is a combination of metric name and one or more dimensions. For example, ‘CPUUtilization’ is a metric name and ‘InstanceId’ is a dimension.

In many cases, metric results are used as input for alerting rules and triggered when thresholds are breached or back to normal.

It’s paramount that alerts will fire based on rules that consume multiple dimensions backed by multiple data points. OpsWatch enriches metrics with additional dimensions (tags, instance information, domain names), which allows the operator to significantly enhance the quality of the metric and its optional corresponding alert.

Metric Correlation

Certain metrics and their corresponding alerts are not effective if they are not correlated to additional metrics (i.e. AWS service limits). OpsWatch queries for those additional metrics probing the AWS APIs, allowing for the operator to view a correlated result.

For more detailed information, refer to the “Amazon RDS database connection” part of the usage example section later in this post.

Solution Overview

OpsWatch receives all metrics using an HTTPS endpoint, and all metrics are queued for further processing. Message processors check whether corresponding metadata-like tags and other information is available in the Redis cache. Otherwise, requests to get this additional information are sent to the corresponding worker queues.

The metric is returned to the original queue in a waiting state until all additional data is available. Once this data is complete, the metric is stored in the Redis cache with an appropriate time to live (TTL). Prometheus servers query the exporter endpoint to fetch all metrics, and OpsWatch exporters validate these requests and make sure only the corresponding metrics are presented.

Figure 1 – High-level metrics flow.

For Prometheus, Arvato Systems provides a comprehensive list of rules based on years of operational experience running workloads on AWS. These rules use the enriched metrics to provide the operations team at Arvato Systems (as well as its customers) a way to observe their AWS workload health in Prometheus.

Furthermore, Grafana dashboards and integration into systems like Thanos allow customers to observe CloudWatch data in a multi-region, multi-account, and multi-cloud world. Connecting to other non-CloudWatch metrics from containers or classical instances is also made possible.

Figure 2 – Architecture diagram of the solution.

The following stages depict the “life of a metric” from ingest to display:

Event source: Amazon GuardDuty and Amazon CloudWatch metrics are supported as event sources.
Event transport: For subscription to GuardDuty events, Amazon Simple Notification Service (SNS) and HTTPS endpoint are used. OpsWatch handles the subscription confirmation automatically, and CloudWatch metrics stream sends its data to Amazon Kinesis Data Firehose, where HTTPS is also used as the target.
Event in queue: The first step is simply enqueue each metric/event inside the message into the respective queue for decoupled processing.
Main workers: CloudWatch metrics and GuardDuty events from the queue get transformed and enriched with metadata from the various supported methods. On a cache-miss, the workers use a logic to request fresh data.
Specific data worker: Each different type of worker group will take the requests from its queue and look up details according to its task, typically either inside the customer account or from Arvato Systems’ customer database.
Metric generation: Data and its corresponding metadata are persisted into Amazon ElastiCache for further usage.
Metric export: When queried, the exporter returns the ElastiCache state in Prometheus format, while making sure the customer is correctly identified.
Prometheus integration flexibility: OpsWatch supports multiple Prometheus integration options per customers’ needs:
- Arvato Systems shared Prometheus environment, a typical choice for managed service customers.
- Integration into existing customer-owned Kubernetes environment, powered by customer self-managed Prometheus deployment.
- Integration into any existing Amazon Managed Service for Prometheus workspace.

Diving Deep: Usage Examples

You can find more examples in Arvato Systems’ Git repo.

1. High CPU Utilization (Simple Rule with Labels)

Let’s begin with a simple example. Amazon Elastic Compute Cloud (Amazon EC2) provides metrics like ‘CPUUtilization’ with dimensions like ‘InstanceId.’ OpsWatch provides all metrics using a consistent naming scheme:

aws_cloudwatch_NAMESPACE_METRIC_DIMENSION_1_DIMENSION_N

In this case, the metric will be named aws_cloudwatch_EC2_CPUUtilization_InstanceId

The corresponding metric looks like this in the CloudWatch console:

Figure 3 – Example ‘CPUUtilization’ metric shown in CloudWatch.

The same data will look like this in Prometheus:

Figure 4 – Example ‘CPUUtilization’ metric shown in Prometheus.

As you can see, compared to CloudWatch the Prometheus metrics look similar. The additional labels under the graph show the benefit of the enriched data; for example, ec2_instance_type, ec2_lifecycle and many more.

The following table shows a list of labels which are automatically created for each metric. There are lots of generic labels (with prefixes like aws, asy, dimension, tag, and overwrite) which are used for all metrics. In addition, OpsWatch provides specific labels like ec2_… for the EC2 service.

OpsWatch supports these specific services labels for a growing list of services like AWS Certificate Manager, Amazon CloudFront, Amazon Relational Database Service (Amazon RDS), and others. These labels give operators the right context for each metric.

Label	Sample Value	Purpose
aws_account_id	123456	AWS account ID
aws_region	eu-central-1	AWS region
aws_service	Amazon EC2	AWS service
asy_customer	Example	Arvato Systems customer name
asy_…	.	Internal labels for managed service offering
dimension_InstanceId	i-01617b7c7ab0717f0	.
dimension_…	.	Other CloudWatch dimensions
ec2_instance_type	m6i.large	EC2 instance type
ec2_…	.	Other service-specific labels
tag_name	exampleinstance	Name tag of the instance
tag_…	.	Other tags like cost center, department, project, or other custom tag
stack_name	examplestack	CloudFormation stack name
overwrite_…	.	Special labels to ensure the source value is not overwritten; for example, if the Prometheus instance is already using the aws_account_id label

The OpsWatch rule looks like this:

name: AwsCloudwatchEc2CPUUtilizationHigh
expr: round(aws_cloudwatch_EC2_CPUUtilization_InstanceId) > 90
for: 1h
labels:
       severity: warning
annotations:
       description: EC2 CPU utilization at {{$value}}% for instance {{
       $labels.dimension_InstanceId }} in customer {{ $labels.overwrite_asy_customer }}
       account {{ $labels.overwrite_aws_account_id }} region {{
       $labels.overwrite_aws_region }}
       summary: EC2 CPU utlization greater than 90%

2. Amazon EC2 Credit Balance (Prediction Rule)

Burstable instances should not have a CPU balance below a defined threshold. A low value in the CPU balance metric should therefore indicate an alarm.

If the instance is running in unlimited credit mode, however, the instance is not throttled even if the credits are exhausted. Therefore, it’s helpful to know in which mode the EC2 instance is operating in order to create a useful alerting rule.

The Prometheus linear prediction feature allows you to see errors even before they cause an issue.

name: AwsCloudwatchEc2CPUCreditBalancePrediction
expr: round(predict_linear(aws_cloudwatch_EC2_CPUCreditBalance_InstanceId{ec2_credits_mode!="unlimited"}[30m], 3600 * 2)) < 10
for: 30m
labels:
       severity: warning
annotations:
       description: EC2 CPU Credit balance will be at {{$value}} for instance {{
$labels.dimension_InstanceId }} in customer {{ $labels.overwrite_asy_customer }} account {{
$labels.overwrite_aws_account_id }} region {{ $labels.overwrite_aws_region }}
       runbook_url: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/burstable-credits-baseline-concepts.html
       summary: EC2 CPU Credit balance will be below 10 in 2 hours

3. Amazon RDS Database Connections (API Describe Rule)

Another example is active Amazon RDS database connections. Databases have a limited number of allowed connections. The active database connection metric itself only shows the current number of active connections, but there are no metrics for the current limit (which can be dynamic in the case of Amazon Aurora Serverless, user defined, or dependent on its instance type).

Again, it’s difficult to create a meaningful alerting rule without knowing the maximum number of connections. Therefore, OpsWatch provides an additional metric called aws_apidescribe_RDS_MaxConnections_DBInstanceIdentifier which can be used to calculate the percentage of used connections.

name: AwsCloudwatchRdsClusterMaxConnections
expr: round(aws_cloudwatch_RDS_DatabaseConnections_DBClusterIdentifier / aws_apidescribe_RDS_MaxConnections_DBClusterIdentifier * 100) > 90
for: 15m
labels:
       severity: warning
annotations:
       description: RDS connection count is {{$value}}% for DB cluster {{ 
$labels.dimension_DBClusterIdentifier }} in customer {{ $labels.overwrite_asy_customer }} account {{
$labels.overwrite_aws_account_id }} region {{ $labels.overwrite_aws_region }}
       runbook_url: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html#RDS_Limits.MaxConnections
       summary: RDS connection count is more than 90%

Summary

In this post, we have demonstrated how Arvato Systems approached the challenge of monitoring dynamic cloud environments by combining the open-source ecosystem of Prometheus and its years of experience managing production environments for customers to create the OpsWatch solution.

For further information, visit the Arvato Systems website.

.

.

Arvato Systems – AWS Partner Spotlight

Arvato Systems is an AWS Partner and service integrator with a strong footprint in full-service managed services, DevOps, and migration.

Contact Arvato Systems | Partner Overview

AWS Partner Network (APN) Blog