Enhancing workload observability using Amazon CloudWatch Embedded Metric Format

Builders who run their workloads on AWS have many needs. In order to best serve their own customers, they need access to a reliable platform on which to run those workloads. They need flexible compute options, scalable data storage, and robust networking. They must make their workloads both scalable and highly available.

Builders also desire to monitor and analyze their workloads. They want to identify and troubleshoot issues as soon as possible. They also want to perform resource planning to ensure that their customers have a great experience. Finally, builders want to optimize their workloads to get the best performance out of the AWS services they use, with the least cost.

Monitoring mechanisms

Traditionally, monitoring has been structured into three separate categories, using three different mechanisms.

The first category is the classic health check: run a test function or command that performs a specific check, and if it fails, alert someone responsible. Using AWS Lambda, customers can run remote health checks and use Amazon SNS to deliver notifications to any number of subscribers.

The second category is metrics: builders instrument their code—often by hand—using models such as counters, timers, and gauges. They then either make those values available for collection (the Prometheus model) or dispatch them to a collector for aggregation and presentation. Amazon CloudWatch is purpose-built for receiving metrics from customer workloads and then aggregating and displaying them in graphs and dashboards. And using CloudWatch Alarms, customers can set metric thresholds to trigger alerts and dispatch them via Amazon SNS. Many builders now use metric values as the primary basis for determining their workloads’ health, making traditional health checks (the first category) obsolete.

The third category is logs: logs are invaluable for observability. They provide builders continuous information about how their workloads are behaving. Logs help builders investigate issues whose causes can’t be readily determined from metrics alone. Logs also provide a way to derive and recover important metrics after the fact to aid in troubleshooting and analysis. Logs are also important for analytics. Many customers keep logs for months or even years, so they can deduce trends and make predictions to further their own business.

A challenge with metrics is that they tend to be great for real-time insights, but they are not always ideal for in-depth analysis or performing ad hoc queries. Granularity is often coarse—on the order of seconds or minutes—which can cause infrequent but significant spiky values to be lost through averaging. Challenges in storage and database technology make it difficult to use associate metrics with more than a few dimensions (up to 10 in CloudWatch metrics). And metrics tend to be rolled up into coarser granularities as they age, making deep insights even harder to derive with the passage of time.

What if builders had a way to massively improve the way they can observe their workloads, without making sacrifices in data granularity or richness? What if that solution could support high-cardinality data to meet the analytical needs of the most demanding customers? What if the solution could do all these things while making customers’ architecture less complex and reducing their reliance on multiple third-party tools?

Embedded Metric Format: A New Way to Observe your Workloads

Amazon recently introduced a new way to both unify and simplify the way builders instrument their software while gaining incredible analytical capabilities. By sending logs in the new Embedded Metric Format, builders can now easily create custom metrics without having to use multiple libraries or maintain separate code.

Using Embedded Metric Format is easy. You can use the open-source Javascript or Python client libraries to record an event with whatever metrics, dimensions, and metadata you want. Alternatively, you can submit logs in the format directly to CloudWatch Logs’ API endpoint.

The format, as follows, is simple:

{
  "_aws": {
    "Timestamp": 1565375354953123,     # UNIX timestamp (milliseconds since epoch)
    "CloudWatchMetrics": [             # Hints to CloudWatch, by namespace
      {
        "Namespace": "aws-embedded-metrics",
        "Dimensions": [                # List of dimension keys, up to 10 per set.
          [ "Operation" ],             # Values will be taken from top-level properties.
          [ "Operation", "Partition" ]
        ],
        "Metrics": [                   # List of metric names and optional units, referenced from top-level values
          { "Name": "Requests" },
          { "Name": "ProcessingLatency", "Unit": "Milliseconds" }
        ]
      }
    ]
  },
  "Message": "Completed processing",                 
  "CustomerName": "Globex Corp",                      
  "Requests": 1,                                       # Metric value (see above)
  "Operation": "Store",                                # Dimension key/value (see above)
  "Partition": "4",                                    # Dimension key/value (see above)
  "ProcessingLatency": 137.52,                         # Metric value (see above)
  "RequestId": "a4110ca8-4139-444d-ab95-ae8fe230aadc"
}

The _aws property is a metadata object that tells CloudWatch which fields are metrics and across which dimensions those metrics should be aggregated. It contains two subproperties, Timestamp and CloudWatchMetrics.

Timestamp indicates the time, as a UNIX timestamp, at which the event occurred (usually the current time) with millisecond precision.

Each CloudWatch Metric in the CloudWatchMetrics list is defined by a namespace, metric name, and a unique dimension combination. You can also specify a unit of measurement—the CloudWatch API documentation has a list of supported units.

All other properties in the JSON object are free for you to use as you like. Their values can be numbers, strings, or objects. If used as a metric, a number can represent a counter (in this case, Requests), a timer (ProcessingLatency), or a gauge. Strings can represent log messages or any attribute of the event you’d like to record. The example above records a customer name (Globex Corp), an operation name (Store), a partition (the string 4), a request ID, and the processing latency associated with the request. You can even embed objects as properties. Regardless of type, all of these properties can be accessed, filtered, or matched in CloudWatch Logs Insights.

Client libraries

For builders’ convenience, Amazon supplies Javascript (Node.js) and Python client libraries today, with other libraries coming soon. These libraries make it simple (but are not required) to emit events in the Embedded Metric Format. They’re designed using decorators so that existing callers do not have to change.

Here’s a Javascript example:

const { metricScope, Unit } = require("aws-embedded-metrics");

const handleEvent = metricScope(metrics => (arg) => {
    // Do something with arg here, then:
    metrics.putMetric("Requests", 1, "Count");
    metrics.putMetric("ProcessingLatency", 137.52, Unit.Milliseconds);
    metrics.setProperty("Message", "Completed processing");
    metrics.setProperty("RequestId", "a4110ca8-4139-444d-ab95-ae8fe230aadc");
    metrics.setProperty("CustomerName", "Globex Corp");
    metrics.setProperty("RequestUri", "/v1/catalog");
    metrics.setProperty("ServerVersion", "1.2.4");
    metrics.putDimensions(
        { Operation: "Store" },
        { Operation: "Store", Partition: "4" }
    );
});

// handle event and enqueue message to CloudWatch Logs
handleEvent(arg);

Here’s a Python example:

from aws_embedded_metrics import metric_scope

@metric_scope
def handle_event(arg, metrics):
    # Do something with arg here, then:
    metrics.put_metric("Requests", 1, "Count")
    metrics.put_metric("ProcessingLatency", 137.52, "Milliseconds")
    metrics.set_property("Message", "Completed processing")
    metrics.set_property("RequestId", "a4110ca8-4139-444d-ab95-ae8fe230aadc")
    metrics.set_property("CustomerName", "Globex Corp")
    metrics.set_property("RequestUri", "/v1/catalog")
    metrics.set_property("ServerVersion", "1.2.4")
    metrics.put_dimensions(
        { "Operation": "Store" },
        { "Operation": "Store", "Partition": "4" }
    )

# handle event and enqueue message to CloudWatch Logs
handle_event(arg)

The client libraries also allow you to configure the CloudWatch Logs group and stream names (in Lambda functions, they are automatically chosen). They also add some dimension values by default, including ServiceName, ServiceType, and LogGroupName. These link the metrics back to the application and the log streams that generated them.

Getting started is easy. In Lambda functions, you can use the provided client libraries to create and send Embedded Metric Format logs to CloudWatch. For other compute platforms (Amazon EC2, Amazon ECS, Amazon EKS, or on-premises), the CloudWatch Agent provides an integration. Otherwise, builders may use the CloudWatch PutLogEvents API.

Details about the client libraries, CloudWatch Agent, and the CloudWatch PutLogEvents API can be found in the documentation.

Viewing Metrics

Once you’ve begun emitting Embedded Metric Format logs to CloudWatch Logs, you can immediately start using them in CloudWatch.

To illustrate, I’ve written a program that simulates requests from various customers whose data is served from different partitions. Customer request rates vary, and response latency varies by partition and customer. During execution, the program records metrics via the aws-embedded-metrics client library.

Once I start my program, I can see that the logs begin to flow into CloudWatch Logs. I’ve specified a useful log group name (/service/MyService). Since I’m running the program on an Amazon EC2 instance, the log stream name is automatically set to its EC2 Instance ID. The library automatically adds some useful properties to the logs as well. This includes the Service Name, the Service Type (AWS::EC2::Instance, since I’m running it on an EC2 instance), and various instance properties, as shown in the following screenshot.

CloudWatch Logs entries

Now that the logs are flowing, I can switch over to CloudWatch Metrics.

First, I go to the aws-embedded-metrics namespace, as shown in the following screenshot:

Select aws-embedded-metrics

Then, I select the dimension set against which I want to plot metrics, as shown in the following screenshot:

Select metric dimensions

I see all of the metric dimension values associated with my program, as shown in the following screenshot. They’re auto-filled for me, even though all my program did was emit log entries!

Metric dimension values

Now I can start to plot some graphs, as shown in the following screenshot. Response latency affects my users the most, and so I’ll graph the ProcessingLatency by partition, and add a nice title:

Request latency by partition

I can now immediately take advantage of all the features CloudWatch Metrics has to offer, including metric math, anomaly detection, alarm thresholds, and much more.

Deep analysis with CloudWatch Logs Insights

Suppose I have a CloudWatch metric called DiskFreeBytes whose dimensions are:

The EC2 instance ID
The OS mount point or drive letter

The cardinality of that metric is the number of instances I’ve ever run, multiplied by the number of OS mount points/drive letters I’ve ever used. If I’ve launched 100 different Amazon EC2 instances, and each of those has had two mount points, the metric’s cardinality would be 200.

Now, suppose that I have a CloudWatch metric called ResponseLatency whose dimensions are:

The server’s instance ID
The customer ID

Also, suppose that I have 100 servers and one million customers. The cardinality of that metric would be 100 million.

In CloudWatch Metrics, each set of dimensions results in a new time-series metric being created. If the cardinality of my metric is 200, like in the first example, there will be 200 unique metrics being created. But if the cardinality of my metric is 100 million, like in the second example, there would be 100 million unique metrics created! This could have significant cost and performance implications. And if the metric is relatively sparse—that is, there’s not a value for a given dimension combination at every time interval—I might incur a cost to store that metric anyway.

Sparse data also makes graphs difficult to read: no lines or curves could be seen, but only points appearing here and there. So, to minimize cost, improve performance, and have useful metrics, it’s a good idea to record metrics that have relatively low cardinality.

The advice I typically give to customers is that if the possible values for a key could fit on a screen or two in a single column, it’s a good candidate for a metric dimension. Otherwise, it’s best to leave it as a property.

For example, I might host data for thousands of customers (a good candidate for a property), but have a much smaller number of partitions (a good candidate for a dimension). Similarly, my service might handle thousands of different URLs (best as a property), but only a few categories of goods (best as a dimension).

With CloudWatch Logs Insights, you’re not limited to the dimensions against which you can perform analysis. Suppose you run a multi-tenant service with millions of customers, and a customer is reporting a performance issue. You want to solve the problem, but average performance is normal. Ordinarily, this would be challenging and time-consuming to investigate. With CloudWatch Logs Insights, you can quickly and easily find the problem—even if it’s only impacting a single customer.

Recording as properties

If I go to CloudWatch Logs Insights and select the log group associated with my program, I immediately see a sample of the most recent logs. Since Embedded Metric Format is JSON, CloudWatch Logs Insights can determine their properties without my help.

Cloudwatch Logs Insights parsed fields

Suppose some of my customers report that they’re having a problem with slow performance, as shown in the following screenshot. I start an investigation. Is it limited to a particular customer?

Latency by customer

Sure enough, Globex is having problems. But Acme’s performance is also suffering.

I investigate what might be the cause, as shown in the following screenshot. Is it the partition Globex and Acme are on?

Partition count

Acme and Globex are on different partitions, so that’s not necessarily the problem. Maybe it’s an operation? I search for the worst-performing operations for these two customers, as shown in the following screenshot:

Highest latency operations by customer

It looks like the purchaseCart operation for these two customers is associated with the highest latency. How bad is it relative to the rest of my customers?

PurchaseCart latency

Maybe it’s not the purchaseCart operation alone, since Initech is also affected by it, as shown in the preceding screenshot. What else could it be? Perhaps it was caused by a recent update? I compare the performance across software versions, as shown in the following screenshot:

ServerVersion latency

I’ve found the culprit! Latency nearly doubled when I updated some instances of the program from version 1.2.4 to 1.2.5. This gives me the data I need to initiate a rollback. Afterwards, I can investigate the root cause of the performance regression.

Conclusion

With CloudWatch Embedded Metric Format, builders gain access to comprehensive real-time metrics for their workloads. Since every event is captured in the log, no fidelity is lost—even brief but important events can be investigated. Every event can be enriched with high-cardinality properties. Builders can perform complex analyses to gain insights into trends and correlations. Finally, they can minimize Mean Time to Recovery (MTTR) by quickly drilling down into the root causes of performance issues and errors. I encourage you to get started with it today.

About the Author

Michael Fischer is a Senior Specialist Solutions Architect at Amazon Web Services. He focuses on helping customers modernize, scale, and monitor their workloads. Michael has an extensive background in systems programming, monitoring, and observability. His hobbies include world travel, diving, and playing the drums.

AWS Cloud Operations & Migrations Blog