Taking the first step
Introduction
Monitoring and observability are critical components for ensuring the availability, performance, reliability, and security of your cloud-based workloads and data.
- Monitoring involves the systematic collection and analysis of data, such as metrics, logs, and traces, to track the health and efficiency of cloud resources as well as supporting reactive incident management.
- Observability focuses on understanding the internal state of a system through dynamic, real-time insights, allowing for proactive issue identification and resolution.
AWS offers a range of tools and services for both monitoring and observability. They can be used to collect data, analyze metrics, and create alarms to notify you of issues. In addition, they can provide logs and metrics that you can use to identify and troubleshoot the root cause of problems.
These services integrate with more than 120 other AWS services (including EC2, EKS, ECS, Lambda, and S3) and partners, and integrates with a wide range of third-party observability and cloud management tools that use near real-time feeds of AWS-native telemetry.
This guide will help you select the AWS monitoring and observability services and tools that are the best fit for your needs and your organization.
In this four minute clip from his re:Invent 2023 presentation, senior AWS worldwide specialist Toshal Dudhwala outlines how to build an observability strategy.
Purpose
Help determine which AWS monitoring and observability services are the best fit for your organization.
Last updated
January 12, 2024
Understand
To choose the right AWS monitoring and observability tools for your needs, it may help to first understand the range of options available to you and how the main services fit together.
Start with your three key data sources: logs, metrics, and traces. The data from those sources can be consumed using Amazon CloudWatch, AWS X-Ray, or AWS Distro for OpenTelemetry (ADOT) agents.
Here’s when you might use each of these data collection sources:
- Use Amazon CloudWatch to collect custom metrics from your own applications to monitor operational performance, troubleshoot issues, and spot trends. You can also use the CloudWatch agent for collecting log, metrics and traces.. In addition, you can use open source tools such as Fluent D or FluentBit to collect logs and send them to CloudWatch logs.
- Use AWS X-Ray to perform distributed tracing across multiple applications and systems to help find latency in a system and target it for improvement. You can use the CloudWatch agent to collect traces and send them to X-Ray.
- Use AWS Distro for OpenTelemetry to collect metrics and traces.
Instrumentation
There are two major categories of instrumentation available within AWS monitoring and observability services: AWS Native Services and Open Source Managed Services.
- AWS Native Services include Amazon CloudWatch and AWS X-Ray. CloudWatch offers these key features of Container Insights, Lambda Insights, Contributor Insights, and Application Insights, that contribute to how you contextualize your data for insights and analysis.
- Open Source Managed Services include Amazon Managed Service for Prometheus (a managed monitoring service based on and compatible with the popular Prometheus open source monitoring and alerting solution), Amazon OpenSearch Service, and AWS Distro for OpenTelemetry (which not only supports AWS X-Ray, but also Jaeger and Zipkin Tracing).
Visualization and analysis
The data you collect and ingest with these AWS services can be visualized and analysed using the Amazon CloudWatch Service Map, the AWS X-Ray trace map, Amazon Managed Grafana and Amazon CloudWatch Logs Insights.
Other services
Other services important to monitoring and observability include:
- AWS Config provides a detailed view of your resource configurations in your AWS account. This view includes the relationship between your resources and the past configurations of your resources, so you can see how the relationships and configurations of your resources change over time. If you are using AWS Config rules, AWS Config evaluates your resource configurations for desired settings.
- AWS CloudTrail helps you enable operational and risk auditing, governance, and compliance by recording events of actions taken by users, roles or AWS services. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail. Events include actions taken in the AWS Management Console, AWS Command Line Interface, and AWS SDKs and APIs.
In addition, you can select from a range of machine learning and analytics services to gain further benefit from your monitoring and observability data.
Consider
Choosing the right monitoring and observability services on AWS depends on your specific requirements and use cases. Here are some criteria to consider when making your decision.
-
Monitoring Service capabilities
-
Ease of integration
-
Data retention and storage
-
Scalability
-
Alerting and notification
-
Cost
-
Customization and extensibility
-
Security and compliance
-
Machine learning and analytics
-
Global reach
-
Consider whether the service provides a comprehensive set of tools that encompass metrics, logs, and traces. Metrics offer quantitative data on system performance, logs provide detailed event information, and traces allow you to follow transactions across your infrastructure.
Also assess whether the service supports diverse data types and formats. Additionally, look for advanced features such as anomaly detection, machine learning-driven insights, and the ability to correlate data from different sources. A well-rounded solution should enable holistic visibility into your AWS environment, aiding in efficient troubleshooting, performance optimization, and proactive problem resolution.The more versatile and integrated the service capabilities, the better equipped you are to gain deep insights into your applications and infrastructure. Review the AWS Observability section of the Management and Governance Cloud Environment Guide (part of the AWS Well-Architected Framework) for more details on service capabilities.
-
Assess how seamlessly the service integrates with your existing AWS infrastructure, applications, and deployment processes.
Look for compatibility with popular programming languages, frameworks, and third-party tools that your organization uses. Also evaluate the availability of SDKs, APIs, and plugins that simplify the integration process. Better integration can facilitate the collection and analysis of data without imposing significant overhead on your applications.
Additionally, consider whether the service supports common protocols for data ingestion. Services that provide better integration can ensure a smoother onboarding experience, allowing your team to more quickly start monitoring and gaining valuable insights into your AWS environment.
-
Data retention and storage capabilities are pivotal considerations in selecting AWS monitoring and observability services. For any service you are considering, examine policies on storing and retaining historical data, as well as scalability to handle increasing data volumes over time.
Assess whether the service supports long-term storage of metrics, logs, and traces, enabling you to perform retrospective analysis and meet compliance requirements. Consider also the ease with which you can access and retrieve archived data.
The service (or services) you use should strike a balance between providing sufficient retention periods for meaningful trend analysis and managing storage costs effectively. A clear understanding of data retention and storage policies is important when considering how your monitoring setup aligns with both operational needs and regulatory obligations.
-
Evaluate how well the service can scale alongside your evolving infrastructure and growing workloads. A scalable solution should seamlessly handle increases in data volume, user activity, and the complexity of your applications.
Consider the elasticity of the service, its ability to accommodate spikes in demand, and whether it supports auto-scaling features to adapt to changing requirements dynamically. Robust scalability ensures that your monitoring system remains responsive and effective, providing timely insights even as your AWS environment expands.
By choosing a service with strong scalability, you can confidently support the continuous growth of your applications and infrastructure without compromising on performance or incurring unnecessary operational challenges.
-
Assess the alerting capabilities of the service, including the ability to set up alerts based on predefined thresholds, anomalies, or specific events. Look for flexibility in configuring alert conditions and the ease of managing notification channels such as email, SMS, or integrations with collaboration tools.
The service (or services) you choose should provide timely and actionable alerts, enabling your team to respond promptly to potential issues. Consider features such as escalation policies and the ability to acknowledge or suppress alerts.
Integration with popular incident management platforms can enhance the overall incident response workflow. Prioritize a monitoring service that empowers your team to proactively address issues, minimizing downtime and ensuring the continuous health of your AWS environment.
-
Understand the pricing model of each service, considering factors such as data volume, storage, and any additional features. Review cost information for any service you are considering (such as this billing and cost summary for Amazon CloudWatch).
Evaluate whether the pricing structure aligns with your budget and usage patterns. Some services may offer a pay-as-you-go model, while others may have tiered pricing or subscription plans. Consider the potential impact of all costs – including data transfer fees or charges for accessing historical data.
Additionally, assess whether the pricing scales efficiently with the growth of your infrastructure. A clear understanding of costs ensures that your monitoring solution remains cost-effective without compromising on essential features, allowing you to optimize your budget while meeting your operational requirements on AWS.
-
Assess whether the service allows you to tailor dashboards, reports, and alerts to meet your needs. Look for the flexibility to create custom metrics, queries, and visualizations. Integration with third-party tools and support for common APIs enhance the service's extensibility. Evaluate whether the monitoring solution can adapt to the unique needs of your applications and infrastructure.
A highly customizable and extensible service empowers your team to fine-tune monitoring parameters, adapt to evolving use cases, and integrate seamlessly with your existing workflows and tools. Prioritize solutions that provide a high degree of configurability, allowing you to optimize monitoring for your specific AWS environment and operational preferences.
-
Evaluate how a service provides adherence to AWS security best practices, ensuring data confidentiality, integrity, and availability. Check for features such as encryption in transit and at rest, access controls, and secure authentication mechanisms. Assess whether the service supports compliance with relevant regulations and standards applicable to your industry.
Look for audit trail capabilities and the ability to generate compliance reports. The goal is to help safeguard sensitive data by using monitoring practices to align with regulatory requirements.
Prioritize services that provide a robust security posture, enabling your organization to maintain a secure and compliant AWS environment while gaining insights into your applications and infrastructure.
-
Evaluate whether the service uses machine learning to provide advanced insights, anomaly detection, and predictive analytics. Look for features that automatically identify patterns, trends, and potential issues within your data.
A robust machine learning component can enhance the accuracy of anomaly detection, reducing false positives and improving the overall effectiveness of your monitoring system. Additionally, consider the depth of analytics provided, such as root cause analysis and trend forecasting. A service with strong machine learning and analytics capabilities empowers your team to proactively address issues, optimize performance, and gain deeper insights into the behavior of your AWS applications and infrastructure.
-
Global reach is a critical criterion for AWS monitoring and observability services, particularly if your infrastructure is distributed across multiple Regions. Assess whether the monitoring service provides visibility into the performance and health of your resources across different AWS Regions.
Consider the ability to aggregate and analyze data from diverse geographical locations, ensuring a comprehensive understanding of your global infrastructure. Look for features that support centralized management and monitoring, allowing you to efficiently oversee operations on a global scale.
A service with strong global reach ensures that you can maintain consistent monitoring practices, troubleshoot issues, and optimize performance seamlessly across the entire spectrum of your AWS deployment, irrespective of geographical boundaries. This capability is particularly valuable for organizations with a geographically distributed or multi-cloud infrastructure.
Choose
Now that you know the criteria by which you will be evaluating your monitoring and observabiity options, you are ready to choose which AWS monitoring and observability service(s) may be a good fit for your organizational requirements.
The following table highlights which services are optimized for which circumstances. Use the table to help determine the service that is the best fit for your organization and use case.
Track the health and performance of various resources and applications using tools such as Amazon CloudWatch, which enable you to set up alarms for specific thresholds and receive notifications when issues arise.
Amazon CloudWatch
Provides monitoring and management for AWS resources and applications.
Collects, stores, and monitors logs from various AWS resources and applications.
Monitors events in AWS resources and triggers automated responses through CloudWatch rules.
Focused on interpreting and optimizing application behavior, using tools such as AWS X-Ray and CloudWatch Synthetics to help trace user requests, identify performance bottlenecks, and help provide consistent user experiences.
Amazon CloudWatch Application Signals
Use CloudWatch Application Signals (currently in Preview release) to automatically instrument your applications on AWS so that you can monitor current application health and track long-term application performance against your business objectives.
Amazon Managed Service for Prometheus
Amazon Managed Service for Prometheus is a serverless, Prometheus-compatible monitoring service for container metrics that makes it easier to securely monitor container environments at scale.
Helps trace and analyze requests as they travel through applications, identifying bottlenecks and interpreting application behavior.
Monitors the availability and latency of web applications by running tests that simulate user interactions.
Provides insights into the state of infrastructure resources using services such as CloudWatch Metrics and Container Insights to help monitor resource utilization, automate scaling, and support efficient resource management.
Collects and monitors metrics from various AWS services, allowing you to track the health and performance of resources.
Amazon CloudWatch Container Inights
Provides monitoring and insights for containerized applications running on Amazon ECS, EKS, and Kubernetes clusters.
Collect, store, and analyze log data from various sources, with services such as CloudWatch Logs Insights and Amazon OpenSearch Service to support troubleshooting, pattern identification, and data visualization.
Amazon CloudWatch Logs Anomaly Detection
Use log anomaly detection to scan the log events ingested into the log group and find anomalies in the log data. Anomaly detection uses machine-learning and pattern recognition to establish baselines of typical log content.
Amazon CloudWatch Logs Insights
Enables interactive searching and analyzing of log data to identify patterns and troubleshoot issues.
Amazon Managed Grafana is a fully managed and secure data visualization service that you can use to instantly query, correlate, and visualize operational metrics, logs, and traces from multiple sources.
Amazon OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud.
Ingests and processes real-time data streams, suitable for collecting and analyzing large volumes of data.
Helps you detect and respond to security threats, monitor compliance with best practices, and maintain a secure and audit-ready environment.
Detects threats and suspicious activities in your AWS environment by analyzing logs and network traffic. AWS Config: Tracks changes to AWS resources and helps assess
Tracks changes to AWS resources and helps assess resource configuration compliance.
Records API activity and helps with auditing and compliance monitoring.
Involving capturing and analyzing network traffic, services such as AWS Network Firewall and features such as Amazon VPC Flow Logs enhance visibility into network behavior, enabling better security and performance management.
Amazon CloudWatch Network Monitor
Provides visibility into the performance of the network connecting your AWS hosted applications to your on-premises destinations and allows you to identify the source of any network performance degradation within minutes.
Amazon CloudWatch Internet Monitor
Provides visibility into how internet issues impact the performance and availability between your applications hosted on AWS and your end users.
Captures information about the IP traffic going to and from network interfaces in VPCs.
Provides intrusion detection and prevention capabilities for VPCs.
Understand interactions between various components in distributed applications and get assistance in collecting and visualizing traces to diagnose issues and improve application performance.
Offers a distribution of the OpenTelemetry project for collecting traces from applications.
Helps trace and analyze requests as they travel through applications, identifying bottlenecks and understanding application behavior.
Amazon CloudWatch Application Signals (Preview)
Provides a unified, application-centric view of your applications, services, and dependencies, and helps you monitor and triage application health.
Hybrid and multicloud observability collects and consumes multiple data sources such as logs, metrics, and traces; processes and contextualizes the data; and monitors, alarms, visualizes, and analyzes the data for insights from modern applications, workloads, and infrastructures whether they are on-premises, hybrid, containerized, multicloud, or open source.
Amazon CloudWatch (hybrid and multicloud support)
Supports data querying from multiple sources, enabling you to gain visibility across your hybrid and multicloud metrics in a single view.
Use
You should now have a clear understanding of what each AWS monitoring and observability service (and the supporting AWS tools and services) does, and which might be right for you.
To explore how to use and learn more about each of the available AWS observability services, we have provided a pathway to explore how each of the services work. The following section provides links to in-depth documentation, hands-on tutorials, and resources to get you started.
-
Amazon CloudWatch
-
Amazon CloudWatch Application Insights
-
Amazon CloudWatch Lambda Insights
-
Amazon CloudWatch Logs
-
Amazon CloudWatch Synthesis
-
Amazon EventBridge
-
Amazon CloudWatch
-
Getting Started with Amazon CloudWatch
Monitor your AWS resources and the applications you run on AWS in real time using Amazon CloudWatch. You can use CloudWatch to collect and track metrics, which are variables you can measure for your resources and applications.
Getting started with Amazon CloudWatch Metrics
This guide discusses basic monitoring and detailed monitoring, how to graph metrics, and how to use CloudWatch anomaly detection.
Set up Container Insights on Amazon EKS and Kubernetes
Set up the Amazon CloudWatch Observability ESK add-on and ADTO on your EKS cluster to send metrics to CloudWatch. You will also learn how to set up Fluent Bit or Fluentd to send logs to CloudWatch Logs.
Explore the guide »Getting started with Amazon CloudWatch Application Insights
Learn how to use the console to enable CloudWatch Application Insights to manage your applications for monitoring.
Using Container Insights
Learn how CloudWatch Container Insights collects, aggregates, and summarizes metrics and logs from your containerized applications and microservices.
Setting up Container Insights on Amazon ECS
Learn to configure cluster and service level metrics, deploy ADOT to collect EC2 instance level metrics, and set up FireLens to send logs to CloudWatch Logs.
-
Amazon CloudWatch Application Insights
-
Getting started with Amazon CloudWatch Application Signals
In this guide, you will learn how to automatically instrument your applications on AWS so that you can monitor current application health and track long-term application performance against your business objectives.
Amazon CloudWatch Application Signals for automatic instrumentation of your applications
This blog post provides an in-depth walk-through the AWS Management Console for Amazon CloudWatch Application Signals demonstrating how to collect telemetry for your EKS clusters.
How to monitor application health using SLOs with Amazon CloudWatch Application Signals
This blog post demonstrates how Amazon CloudWatch Application signals enables you to automatically instrument and operate applications on AWS to track application performance against your most important objectives.
-
Amazon CloudWatch Lambda Insights
-
Introducing CloudWatch Lambda Insights
Learn how to create a few “Hello World” Lambda functions and monitor them using Lambda Insights. You will be using the AWS CDK to deploy the architecture.
Using Amazon CloudWatch Lambda Insights to Improve Operational Visibility
Learn how to use Lambda Insights to provide simple and convenient operation oversight and visibility into the behavior of your AWS Lambda functions.
-
Amazon CloudWatch Logs
-
Getting started with Amazon CloudWatch Logs
Learn how to install the unified CloudWatch agent and how to configure metrics collection with AWS CloudFormation.
Analyzing log data with CloudWatch Logs Insights
This guide will demonstrate to get started with Logs Insights queries, visualize log data in in graphs, and adding queries to your dashboard.
Amazon CloudWatch Logs Insights – Fast, Interactive Log Analytics
Use Logs Insights to utilize the data points, patterns, trends, and insights present in all the various logs created by AWS services to understand how your applications and AWS resources are behaving, identify room for improvement, and address operational issues.
-
Amazon CloudWatch Synthesis
-
Using synthetic monitoring
This guide demonstrates how to create canaries, configurable scripts that run on a schedule, providing sample code for canary scripts.
Secure monitoring of user workflow experience using Amazon CloudWatch Synthetics and AWS Secrets Manager
How to create, deploy, and monitor synthetic monitoring solutions using Amazon CloudWatch Synthetics.
-
Amazon EventBridge
-
Getting started with Amazon EventBridge
Learn to create a basic rule to route events to a target.
Archive and replay Amazon EventBridge events
Create a function to use as the target for the EventBridge rule using the Lambda console.
Log the state of an Amazon EC2 instance using EventBridge
Create an AWS Lambda function to log state changes for an Amazon EC2 instance. You will log the launch of any new EC2 instance.
Building an event-driven application with Amazon EventBridge
Learn how to build and deploy and event-driving application using the AWS Serverless Application Model (SAM) CLI.
-
AWS CloudTrail
-
AWS Config
-
Amazon Managed Grafana
-
Amazon Managed Service for Prometheus
-
Amazon OpenSearch Service
-
AWS Distro for OpenTelemetry
-
AWS X-Ray
-
AWS CloudTrail
-
Getting started with AWS CloudTrail
AWS CloudTrail is an AWS service that helps you enable operational and risk auditing, governance, and compliance of your AWS account. Here's how to get started with it.
Review AWS account activity
Learn how to review the AWS API activity in your AWS account for services that support CloudTrail.
Create a trail
Learn how to create a trail to log AWS API activity in all Regions including data and Insights events.
AWS CloudTrail Log Monitoring workshop
Learn how to integrate CloudTrail logs into CloudWatch and use features such as CloudWatch Log Insights, CloudWatch Metric Filters, CloudWatch Metric Alarms and CloudWatch Dashboards.
AWS CloudTrail best practices
Best practices for using CloudTrail to enable auditing across your organization.
-
AWS Config
-
Getting started with AWS Config
AWS Config provides a detailed view of the configuration of AWS resources in your AWS account. This explains how to get started using it.
Setting up AWS Config (console)
Learn how to set up AWS Config in your AWS accounts using the AWS Management Console.
Setting up AWS Config with the AWS CLI
Learn how to set up AWS Config in your AWS accounts using the AWS CLI.
-
Amazon Managed Grafana
-
Getting started with Amazon Managed Grafana
Learn how to get started with Amazon Managed Grafana and create your first workspace and then connect to the Grafana console in that workspace.
Amazon Managed Grafana - Getting Started
Learn how to integrate with Amazon Managed Service for Prometheus and how to create custom dashboards.
Visualize and gain insights into your AWS cost and usage with Amazon Managed Grafana
Learn how to visualize and analyze your AWS cost and usage data with Amazon Managed Grafana.
-
Amazon Managed Service for Prometheus
-
Getting started with Amazon Managed Service for Prometheus
Create Amazon Managed Service for Prometheus workspaces, set up the ingestion of Prometheus metrics to those workspaces, and query those metrics.
Container Insights Prometheus metrics monitoring
Learn how to automate the discovery of Prometheus metrics from containerized workloads using CloudWatch Container Insights.
Amazon Managed Service for Prometheus FAQs
Frequently asked questions about Amazon Managed Service for Prometheus.
-
Amazon OpenSearch Service
-
Getting started with Amazon OpenSearch Service
Use Amazon OpenSearch Service to create and configure a test domain. An OpenSearch Service domain is synonymous with an OpenSearch cluster.
Getting started with Amazon OpenSearch Serverless
This tutorial walks you through the basic steps to get an Amazon OpenSearch Serverless search collection up and running quickly
Creating and searching for documents in Amazon OpenSearch Service
Learn how to create and search for a document in Amazon OpenSearch Service.
Getting started with Amazon OpenSearch Ingestion
Learn how to use Amazon OpenSearch Ingestion to ingest data into a domain and also a collection.
SIEM on Amazon OpenSearch Service Workshop
Build a security log analysis platform on Amazon OpenSearch Service and get started with building a cost-efficient solution for log ingestion, analysis and dashboarding.
Creating and searching for documents in Amazon OpenSearch Service
Learn how to create and search for a document in Amazon OpenSearch Service.
-
AWS Distro for OpenTelemetry
-
Getting Started with the AWS Distro for OpenTelemetry (ADOT) Collector
Walk through the steps to build the ADOT Collection locally.
AWS Distro for OpenTelemetry JavaScript
Learn how to instrument your JavaScript applications and send correlated metrics to various AWS monitoring solutions.
AWS Distro for OpenTelemetry Python
This guide will demonstrate how to instrument your Python applications and send correlated metrics to various AWS monitoring solutions.
-
AWS X-Ray
-
Getting started with AWS X-Ray
This guide will walk you through launching a sample application. Then you will learn how to instrument your application and explore other services that are integrated with X-Ray.
One Observability Workshop
This workshop provides you a hands-on experience with a wide variety of tool AWS offers for monitoring and observability including AWS X-Ray and ADOT.
Application logging and monitoring using AWS X-Ray
Learn how AWS X-Ray collects data about requests that your application serves, and it helps you view, filter, and gain insights into that data to identify issues and opportunities for optimization.
Explore
Explore solutions to help you implement monitoring and observability on AWS.
Explore whitepapers to help you get started, learn best practices, and understand your monitoring and observability options.
Explore additional architectural guidance covering common use cases for monitoring and observability services.