Observability using native Amazon CloudWatch and AWS X-Ray for serverless modern applications

Introduction

In this blog post, we will share how you can use AWS-native observability tools to measure the current state of your modern serverless applications and how to get started with the minimal effort. We will review tools like Amazon CloudWatch and AWS X-Ray and explore how these services can help you instrument your application for full observability of logs, metrics, and traces. You will learn about the seamless integration with other AWS services and how accessible additional capabilities can be enabled, such as custom dashboards, alarms, and anomaly detection. All this is achieved while meeting your compliance and security requirements. To make sure you have the required knowledge to get started, we will discuss training options to kick off the best of AWS Observability tools.

Using AWS observability with native tools while following best practices flywheel

Figure 1: AWS Observability best practices with native tools

We will walk you through the different mechanisms and best practices in the following sections:

Ease of getting started with no coding

AWS Observability tools offer the fastest and simplest option to get started with observability for your serverless application. Our tools are specifically designed to work with AWS services and can be set up and configured easily. CloudWatch allows you to aggregate logs, collect metrics, and build up service maps of your applications.

If you are using AWS Lambda, use Amazon CloudWatch Lambda Insights to start getting metrics about your functions. Getting set up is quick and requires no additional coding. You just need to enable Enhanced Monitoring for each function either using the Lambda console or programmatically. This will provide key metrics such as invocation rates, duration, and error count that provide insights of your function’s performance. Additionally, it provides system-level metrics such as CPU time, memory usage and network performance. To visualize these metrics, you can create custom dashboards to further tailor the data to your needs. The following image shows a dashboard of a single Lambda function’s metrics.

CloudWatch Lambda Insights dashboard monitoring performance for a single Lambda funcion

Figure 2: CloudWatch Lambda Insights Single Function Performance Monitoring Dashboard

If you have a need to review recent events in more detail, the most recent 1,000 invocations and the most recent 1,000 application logs are readily available at the bottom of the page.

CloudWatch Lambda Insights invocations view

Figure 3: CloudWatch Lambda Insights invocations view.

If any invocations that stand out and you want to review the related logs, you can bring up the Amazon CloudWatch Logs Insights page. The CloudWatch Logs Insights enables users to search and analyze your log data by performing queries. By selecting the invocation results that you want to investigate, the CloudWatch Logs Insights brings up an automatically pre-built query for fast analysis of the results. It also has a set of readily available queries and that may be customized depending on your needs. With CloudWatch Lambda Insights and CloudWatch Logs Insights together, mean time to detect the root cause of a problem can be reduced. Watch this demo on Using CloudWatch Lambda Insights to expand on the possibilities offered by this tool.

CloudWatch Logs Insights query results showing a Lambda function log histogram

Figure 4: CloudWatch Logs Insights query results showing a Lambda function log histogram.

If your serverless application is running with AWS Serverless Application Model (AWS SAM), you can enable metrics with minimal configuration. This allows you to use Amazon CloudWatch Application Insights where you can explore the machine learning generated dashboards that unveil potential problems in your monitored application such as metric anomaly and log error detection. You can watch the demo introduction to CloudWatch Application Insights to get more familiar with its advantages. The following snapshot shows a problem detected by CloudWatch Application Insights for a step-function application and presented with the root cause analysis, and error logs captured.

CloudWatch Application Insights dashboard showing an automatically detected issue on Step Functions application with AWS SAM

Figure 5: CloudWatch Application Insights dashboard showing an automatically detected issue on Step Functions application with AWS SAM.

If your application is running serverless containers with Amazon Elastic Container Service (Amazon ECS) and AWS Fargate, getting started with metrics and logs just requires to enable Amazon CloudWatch Container Insights. CloudWatch Container Insights tool provides real-time visibility into the performance and health of your containerized workloads and their underlying infrastructure. You can review task, service and cluster level metrics. It integrates with CloudWatch Logs Insights allowing you to access performance logs for deep diving. In the following figure, you will find a CloudWatch Container Insights dashboard showing the performance of a cluster.

CloudWatch Container Insights dashboard for ECS Fargate cluster performance metrics

Figure 6: CloudWatch Container Insights dashboard for ECS Fargate cluster performance metrics.

If you are running your serverless containers on Amazon Elastic Kubernetes Service (Amazon EKS) with Fargate, you can also leverage CloudWatch Container Insights. To set up this feature, install and configure a CloudWatch agent or the AWS Distro for OpenTelemetry agent. For more information about the Distro for OpenTelemetry, visit its official website.

CloudWatch Container Insights dashboard for EKS Fargate cluster performance metrics

Figure 7: CloudWatch Container Insights dashboard for EKS Fargate cluster performance metrics.

CloudWatch Container Insights offers a Container Map that will show a graphical representation of the container cluster logical distribution. To quickly identify any memory or CPU utilization outlier, each node’s color represents a heatmap. In the following image, several namespaces are shown that are part of the same cluster. The cluster node is turning yellow as its memory utilization percentage is the highest compared to the other nodes.

CloudWatch Container Insights Map View for EKS Fargate cluster

Figure 8: CloudWatch Container Insights Map View for EKS Fargate cluster

Additionally, CloudWatch Container Insights provides a Resource view to switch from the map view to a list view. This view integrates with the metrics and logs seamlessly. Clicking on View logs or View Dashboard will bring up its logs or metrics charts respectively. On the following image shows a list of resources of an EKS Cluster and its CPU and Memory Utilization.

EKS Cluster Resource view showing CPU and Memory Utilization

Figure 9: EKS Cluster Resource view showing CPU and Memory Utilization. Highlighted in green are View logs and View dashboard buttons that allows you to navigate a level deeper into specific logs in CloudWatch Logs Insights or to the metrics view.

All the tools that we reviewed in this section of the blog post work with metrics and logs that AWS services are publishing to CloudWatch by default. This offers you a quick starting point to gather critical data points for your application with the least effort and without the need to code.

Summary of the CloudWatch vended metrics and logs options for each of the serverless services

Figure 10: Summary of the CloudWatch vended metrics and logs options for each of the serverless services.

In the next section, we will review how you can get additional insights about your application if you instrument it to get custom metrics, custom logs and tracing.

Customizing observability with instrumentation to fit your needs

As you expand your application performance monitoring capabilities, you will need to gain insight beyond what the default metrics and default logs provide. AWS-native Observability tools can meet your special needs and allow you to create custom metrics, collect custom logs and set up distributed tracing. We will review how to instrument your application to take observability to the next level.

Instrumenting in AWS Lambda
Logs

Instrumenting application logs can be achieved by adding a couple lines of code, for instance for Node.js you have the option to use the methods on the console object or any logging library that writes to stdout or stderr. Lambda would automatically send these function logs to CloudWatch without any additional configuration needed.

Metrics

The best practice to collect custom metrics from Lambda is to use asynchronous publishing to CloudWatch Logs using the Embedded Metric Format (EMF). EMF allows CloudWatch Logs to automatically extract metrics from your logs enabling the creation of alarms and dashboards. Using EMF frees up the Lambda function from making calls to the PutMetricData API which extends the execution time. Reducing the function’s execution time decreases runtime cost and improves performance.

Amazon provides open-source client libraries for Node.Js, Python, Java and C# to make it simpler to start with EMF. Alternatively you may also manually generate the logs by following the defined JSON format and using the PutLogEvents API.

This blog post on how to operate Lambda logging and custom metrics can help you get started and review the most important metrics that Lambda vends by default and then how to leverage custom metrics using EMF to obtain application-specific information that can give you great insights rather than relying on performance-related data alone.

To get started with EMF, the AWS Serverless Observability Workshop offers hands-on instructions to learn more about the processing of logging metrics with EMF in Lambda.

Traces

Getting started with X-Ray traces in Lambda only requires to toggle the Active Tracing button in the configuration or doing it programmatically. As a result, the Lambda function would instrument itself for tracing. To propagate the trace downstream to other Lambda functions or AWS services, you need to add code. These are 2 options available:

· Distro for OpenTelemetry to configure Lambda tracing

· X-Ray SDK (Node.js, Python, Java, among others)

Powertools for AWS Lambda to simplify and accelerate best practice adoption

We recommend for development teams to implement observability best practices for Lambda, using the Amazon’s open-source project Powertools for AWS Lambda. It provides utilities for structured logging, EMF metrics and distributed tracing that simplify writing the code to implement observability and adhering to the best practices.

For example, if you want to implement a correlation ID for your application logs, using Powertools simplifies this process and it does not affect your function’s performance. Powertools are available for Python, Java, Typescript and .Net.

Instrumenting serverless containers (ECS/EKS with Fargate)

Amazon EKS integrates with CloudWatch Logs without installing a CloudWatch agent for the Kubernetes control plane. This Amazon EKS control plane logging provides audit and diagnostic logs directly to CloudWatch Logs, providing security to run your clusters.

CloudWatch Container Insights enables you to collect, analyze and visualize the performance of your containerized applications running on Amazon Elastic Container Service, Amazon EKS and AWS Fargate. Container Insights also provides diagnostic information, such as about CrashLoopBackOff errors in Kubernetes pods to help you isolate, investigate and resolve the issues faster.

To collect data and logs from disparate sources, unify and route them to multiple destinations such as CloudWatch Logs, Amazon S3, Amazon Relational Database Service (Amazon RDS), etc., we recommend Fluent Bit. Fluent Bit is a lightweight open-source multi-platform log processor and is fully compatible with Docker and Kubernetes. And due to its significant performance gains, it is recommended to use Fluent Bit as a default solution for CloudWatch Container Insights.

To learn more about Fluent Bit integration in CloudWatch Container Insights for Amazon EKS, check this blog post.

In the next section, we will review additional out-of-the-box capabilities that can be obtain by using the native observability tools.

Advanced out-of-the-box capabilities

AWS Observability tools come with advanced capabilities out-of-the-box such as automated and custom dashboards, alarms, anomaly detection and integration with third-party tools.

To perform operational remediation on your application, we recommend using CloudWatch alarms that can be triggered on configurable thresholds based on your application metrics. For example, you may create an alarm that will scale out your Amazon ECS cluster based on the CPU Utilization reaching 80%.

To identify bottlenecks in your serverless applications, we recommend using X-Ray to generate a service map which is a visual representation of the tracing results that can be used to identify nodes where errors are occurring, connections with high latency, or for requests that were unsuccessful.

The following image shows the elements of a ServiceLens Service Map. This map simplifies identifying any problem with a specific node of your application. If you select a Lambda function, like the one shown in the figure, you can get more granular metrics. Latency, Requests or Faults charts are displayed or you may navigate to the specific traces details or logs captured. This helps correlate any issues identified in the map with application logs and quickly determine a possible root cause and remediation.

CloudWatch ServiceLens Service Map showing the details of a Lambda function

Figure 11: CloudWatch ServiceLens Service Map showing the details of a Lambda function

To narrow down specific traces, navigate to View Traces. This will take you to the Traces console that allows to query and filter traces. Depending on your filtering a set of traces will show at the bottom of the console. When you click on any of the trace IDs, it will bring up the trace map, segments timeline and logs for the selected trace. This is a powerful tool where correlating trace results and logs can be done in a single view and quickly identify potential root causes for the issue being investigated.

Trace details showing the segments time and logs in a single view for potential correlation

Figure 12: Trace details showing the segments time and logs in a single view for potential correlation.

To learn more about this example and how to set up the ServiceLens Service Map, you may visit this blog post for well-architected serverless applications or the One Observability Workshop ServiceLens section.

In the next section, we will discuss how you can achieve the results, optimize costs and understand how you can get more details about how spend is distributed in the different CloudWatch tools.

Cost Efficiency with AWS-native Observability tools

AWS-native Observability tools offer a “pay as you go” price model that enables to start using the toolset without a long-term commitment or licensing. This provides flexibility in cost adjusting to your budget. Let’s review the cost optimization options for metrics, logs and traces.

Metrics

When working with metrics, there are basic monitoring metrics are part of the AWS Free Usage Tier. If you enable Enhanced Monitoring for Lambda to use CloudWatch Lambda insights and create custom business and application metrics, there will be a charge associated to each metric created.

The custom metrics are prorated by the hour, hence if your Lambda function gets invoked less than once per hour, you will only be billed for the hours that it is invoked.

Metrics are created for each namespace, metric name and dimension set that you set up. Pay attention to how granular you define your dimensions since each new combination will be recorded as a new metric.

For example, if you publish custom metrics with the following properties:

Dimensions: Env=Prod, Geo=EMEA, Unit: Count, Timestamp: 2023-09-14T12:30:00Z, Value: 120
Dimensions: Env=Dev, Geo=AMER, Unit: Count, Timestamp: 2023-09-14T12:31:00Z, Value: 130
Dimensions: Env=Prod, Geo=LATAM, Unit: Count, Timestamp: 2023-09-14T12:32:00Z, Value: 93
Dimensions: Env=Staging, Geo=APAC, Unit: Count, Timestamp: 2023-09-14T12:33:00Z, Value: 91

This will be counted as 4 different metrics as they have all unique dimension sets.

For high-cardinality systems, one option to optimize cost for custom metrics is to keep the information in properties other than the dimension sets. Using Embedded Metric Format (EMF) allows to log any additional properties. This keeps the original high-cardinality context in CloudWatch Logs which is accessible through queries in CloudWatch Log Insights.

Another cost saving opportunity is to identify metrics that are not being monitored regularly, but are still needed for root cause analysis or isolated scenarios. Store these metrics only as logs and query them only when needed. You pay only for the logs ingested and the queries ran on-demand and not for the metrics. If you need to turn on metrics on those logs later on, metrics filters allow you to generate metrics out of the log data. Turning on this feature for the time needed and then deleting the metric filter can help you avoid unnecessary custom metrics usage and cost.

Logs

To optimize the cost of your CloudWatch logs, make sure to keep an eye on the log groups with the highest log ingestion charges. This will allow you to identify opportunities to evaluate your logging levels and adjust as necessary. Identify your top log groups by following the steps in this AWS knowledge center article.

Keeping an updated retention policy for logs allows to reduce unnecessary costs for Log Archival. By default, CloudWatch log retention policy is set to Never Expire, hence you need to change this configuration if you are not required to keep logs indefinitely. This is a blog post that shows an automated method for keeping log groups retention policies updated automatically.

When running queries in CloudWatch logs insights, you are charged by the size of the data being scanned. Hence to lower your analysis costs, make sure to run queries over short timeframes.

Traces

To optimize cost for AWS X-Ray traces, choose an appropriate sampling rule for each application or environment. Extensive tracing will increase your monthly cost hence making decisions to increase the sampling rate need to be evaluated for each use case. For instance, for environments with typical low traffic like your development environment, having a high sampling rate to capture any potential issues might be a cost-effective option. In high traffic environments, having a lower sample rate will provide enough data to capture potential problems.

AWS Billing console tools

Being aware of any spike or change in cost before the end of the billing cycle will help you take timely remediation actions. Setting up billing alerts for CloudWatch and X-Ray will alert you, if your charges exceed a specific targeted budget.

In the event that costs have increased, AWS offers Cost and Usage reports (CUR) to understand the CloudWatch charges and you can get the level of detail to understand what has caused the increase and adjust as necessary. AWS offers a CUR query library to get started with Cost and Usage reports where you will find a query sample for CloudWatch costs.

In the next section, we will discuss how you may leverage CloudWatch native tools to improve your security posture while keeping compliance with different standards and requirements.

Compliance, Security and Data Protection

A key aspect of compliance and security within CloudWatch is its ability to collect, aggregate, and summarize logs and metrics from various AWS containerized, microservices and serverless applications. Container Insights is available for Amazon ECS, Amazon EKS, Kubernetes platforms on Amazon Elastic Compute Cloud (Amazon EC2) and CloudWatch logs for Lambda functions. By centralizing this data, CloudWatch enables you to maintain a comprehensive and consolidated view of your infrastructure and applications, using CloudWatch dashboards, making it easier to detect and respond to potential security threats and compliance violations.

Here are a couple case studies on how our customers have leveraged CloudWatch along with other AWS serverless services:

Thomson Reuters: Reducing Failover Time from 30 Minutes to 3 Minutes Using Amazon CloudWatch

Indusface: Cybersecurity Provider Indusface Guarantees 99.99% Application Firewall Uptime for Business-Critical Applications on AWS

CloudWatch also provides real-time notifications and automated responses when specific thresholds or conditions are met, such as unusual spikes in traffic, resource utilization, or API call patterns for compliance management and security monitoring. This proactive approach enables you to identify and remediate potential security risks and compliance issues before they escalate into more significant problems. Furthermore, CloudWatch supports integration with AWS Config, a service that continuously monitors and records AWS resource configurations, providing you with the visibility and insights needed to assess your overall compliance posture and detect any deviations from established baselines.

X-Ray, as a robust distributed tracing system, allows developers to analyze and debug their serverless applications in real-time and monitor both cloud cost and performance metrics by providing end-to-end visibility into requests and responses. However, this level of access to application data necessitates stringent security measures and adherence to compliance standards.

CloudWatch and X-Ray protect data security by implementing multiple layers of protection with their robust access control mechanisms. Using AWS Identity and Access Management (IAM), you can define granular permissions and policies that dictate who can access CloudWatch and X-Ray data and perform specific actions, such as creating or modifying alarms, viewing metrics and traces, or accessing logs. This fine-grained access control helps control that only authorized personnel can access sensitive data, reducing the risk of unauthorized access and potential data breaches.

Both CloudWatch and X-Ray support encryption of data both at rest and in transit, leveraging AWS Key Management Service (AWS KMS) to protect sensitive information from unauthorized access and ensuring adherence to industry best practices for data security. CloudWatch and X-Ray are also part of multiple AWS compliance programs which include HIPAA, SOC, PCI and others. To learn more about using CloudWatch compliance and security, see Compliance validation for Amazon CloudWatch, Security in Amazon CloudWatch, Compliance validation for AWS X-Ray and Security in AWS X-Ray in the AWS General Reference.

While we are discussing about compliance and security, data protection is not far away. CloudWatch logs data protection helps you to detect sensitive data using pattern matching and machine learning models. Using managed data identifiers, CloudWatch Logs can detect PII and PHI data such as credit card numbers, financial, medical, AWS Secret keys, device IP or MAC address or passport numbers for a particular country or region.

To detect sensitive data in CloudWatch logs, a data protection policy needs to be created. Once the policy is set up, you can view the sensitive data detection and count in CloudWatch Log Groups as shown below.

Sensitive data detected in CloudWatch Logs

Figure 13: Sensitive data detected in CloudWatch Logs

Now, when you query the Log Groups using CloudWatch Log Insights, you can see that the sensitive data has been masked as shown in the figure below. Here, only customer full name, email and credit card have been configured as sensitive data in the data protection policy.

CloudWatch Log Insights ‘Masked’ Sensitive data

Figure 14: CloudWatch Log Insights ‘Masked’ Sensitive data

To learn more about the Data Protection implementation in detail, you may visit CloudWatch Data Protection workshop.

Moving towards the final section, we will walk through the different options that are available to learn more about CloudWatch, X-Ray, and AWS-native observability for you and your teams and take them to the next level.

How to get started – Accessible Training Resources

To make learning CloudWatch and X-Ray easier and accessible, AWS provides the following resources.

CloudWatch and X-Ray official documentation is an invaluable starting point. It provides a comprehensive overview of CloudWatch, its features, and how to get started with the service. The documentation is regularly updated to reflect the latest changes and improvements, ensuring that users are always equipped with the most accurate and up-to-date information.

The AWS Observability course on Skill Builder delves into the core principles of monitoring, how to use CloudWatch to generate alarms, and how to visualize and analyze metrics.

For learners who prefer a more social and collaborative approach, attending AWS events, such as AWS re:Invent, AWS Summits, and AWSome Days, can be highly beneficial. These events provide an opportunity to learn about the latest advancements in CloudWatch, X-Ray and other AWS services, attend workshops, and network with fellow professionals. Additionally, joining AWS user groups and participating in online forums, such as the AWS Developer Community, can help learners stay informed about the latest trends, best practices, and real-world use cases.

AWS Training and Certification is a fee-based classroom training approach with several courses tailored to different roles, such as developers, architects, and operations, ensuring that learners can acquire the knowledge and skills relevant to their job responsibilities. AWS Training and Certification classes are provided globally in different languages and the training programs and offering are built by the AWS experts.

AWS Training Partners (ATP) are an alternate approach if you already have a training partner in place or would like to incorporate one for on-demand classroom and digital offerings. ATPs are selected by AWS to provide AWS-authored training and will help develop your skills to migrate or build cloud-native applications on AWS.

For hands-on learning, which can be both led by AWS Solutions Architects or self-paced, we provide the AWS One Observability Workshop and EKS Observability, which cover a comprehensive step by step guide on the deployment and setup of AWS resources.

To check out the best practices and FAQs, see Best Practices FAQs – Amazon CloudWatch and AWS X-Ray – FAQ.

Last but not the least, CloudWatch Quick Start is a one stop guide to access all the learning and training resources on how cloud-native observability tools like CloudWatch and X-Ray work on infrastructure, serverless, application monitoring and more.

Conclusion

AWS-native observability tools offer the fastest and easiest path for your journey into observability and adopting the best practices. The integrations and out-of-the-box capabilities that CloudWatch and X-Ray offer provide you with the tools to better understand your serverless applications from scratch. As your workloads grow, the AWS Observability tools have the necessary features for customization to simplify the complexities of your serverless deployment and tell you if the different areas of your application are delivering the desired results. All this is achieved while complying with security and regulations for each industry segment. To get started with the AWS Observability tools, there are plenty of resources, online courses, trainings, blog posts, hands-on workshops that will help your teams with gaining the confidence and knowledge to improve your application performance and business objectives.

About the authors

AWS Cloud Operations & Migrations Blog