What is observability?
“Is my system up or down?” “Is it fast or slow as experienced by my end users?” “What KPIs and SLAs should we establish, and how do we know if they’re being met?” When you’re operating at cloud speed and scale, you can’t afford to fly blind: you need to be able to answer a wide range of operational and business questions like these. You need to be able to spot problems as they arise (ideally before they disrupt the customer experience), respond quickly, and resolve them as quickly as possible. To achieve this insight, you need observable systems.
“Observability” describes how well you can understand what is happening in a system, often by instrumenting it to collect metrics, logs, or traces. In the cloud, observability can be hard to achieve due to sheer system complexity. Whether in data centers or in the cloud, to achieve operational excellence and meet business objectives, you need to understand how your systems are performing. Observability solutions enable you to collect and analyze data from applications and infrastructure so that you can understand their internal states and be alerted to, troubleshoot, and resolve issues with application availability and performance to improve the end-user experience.
What is the difference between observability and monitoring?
Though the term “monitoring” is sometimes defined as different from observability, monitoring is an activity that makes a system observable, alongside other activities like tracing and logging. You’ll often see monitoring, tracing, and logging described as “three pillars of observability.” However, there are also other tools that help you achieve observability, such as profilers and AI/Ops, discussed below.
What does observability help me do?
Observability enables you to detect and investigate problems.
Timely detection of a problem (ideally before it affects end users) is the first step in observability. Detection should be proactive and multi-faceted, including alarms when performance thresholds are breached, synthetic testing, and anomaly detection. A common performance metric is mean time to detect (MTTD). You can improve MTTD with a number of activities and tools:
Monitoring tools record performance statistics over time so that usage patterns can be identified. Monitoring agents record selected metrics at set intervals and store the resulting data in a time-series format.
Application Performance Monitoring
Application Performance Monitoring (APM) lets you monitor the end-to-end customer experience, from browsers and mobile devices through the various layers of application stack. APM begins with Front-end monitoring – measuring and monitoring the experience of customers from the browser or mobile device. At the heart of APM, Application discovery, tracing, and diagnostics is the ability to identify which part of an application is causing performance issues and quickly pinpoint the reason for it.
When something goes wrong, you want timely alerts. However, too-sensitive detection can lead to alarm fatigue, so alert management is also key.
AI/Ops and anomaly detection
A new generation of tools are now bringing the power of artificial intelligence and machine learning to bear on observability, using machine learning models to identify anomalous application behavior and surface critical issues before they cause potential outages or service disruptions.
Infrastructure monitoring lets you correlate metrics and logs from an infrastructure stack to understand and resolve the root causes of performance issues.
Digital experience monitoring
Digital experience monitoring (DEM) provides insights into the experience of the end user engaging with system by collecting activity from their browser, mobile app, or voice interaction. Synthetic transactions involve creating scripts to emulate end-user behavior when interacting with a system, so that it can be monitored and tested even when not under real load. Real user monitoring (RUM) combines monitoring of the availability of a website or API to receive requests from different points of presence around the world, with automated A/B testing.
Profiling tools take a sample of measurements at regular intervals. For example, central processing units (CPUs) are commonly profiled by taking timed-interval samples of the on-CPU code paths.
Telemetry is the instrumentation of systems (usually via monitoring agents) so that they can collect data on how those systems are performing. Once telemetry is in place, a system starts producing data which can be monitored. However, different teams within a company may use different tools, which has led to a proliferation of monitoring agents that must be included in a company’s code base, or you have to re-instrument if a you decide to use different or additional tools. The OpenTelemetry project makes it possible to instrument applications just once and send correlated metrics and traces to multiple monitoring solutions.
Investigation is the most time-expensive phase of an operational event. When things are going wrong, it can be difficult to understand what is most important to fix. Using multiple observability sources together can help you investigate quickly to understand the root cause, but to do this effectively you need to correlate data across metrics, logs, and traces.
Tracing records system events, such as an HTTP request from a client. In distributed tracing, details captured about the event include the path of the request across multiple services/applications, along with metrics about the request such as latency at each step of the way.
Observability, especially at cloud scale, can generate huge volumes of data that become difficult for humans to parse. Visualization tools help to quickly make sense of data by correlating observability data into intuitive graphic displays.
When do I use observability?
Understand application health and performance to improve customer experience
The main goal of observability is to know what is going on – anywhere and everywhere – in your system, so that you can ensure the best possible experience for your end users. You want to detect problems quickly, investigate them efficiently, and remediate them as soon as possible to minimize downtime and other disruptions to your customers; a common metric is mean time to recovery (MTTR).
Improve developer productivity
Traditional debugging – by analyzing logs or instrumenting breakpoints into code – is tedious, repetitive, and time-consuming, and it doesn’t scale well for production applications or those built using a microservices or serverless architecture. To analyze performance across distributed applications, developers need correlated metrics and traces to identify user impact from any source, and to find broken or expensive code paths as quickly as possible. They need to do all this without having to re-instrument their code whenever they want to add new observability tools to their kit. The right suite of observability tools can help developers code and test better and faster.
Improve operational effectiveness and efficiency
Observability can help you find performance improvements in your cloud fleet that in turn let you reduce costs. For example, across thousands or hundreds of thousands of instances, a small percentage performance improvement in how much CPU an application uses can add up to millions of dollars in savings. Similarly, by using observability to understand and predict your future capacity needs, you can take advantage of the cost savings available from reserve and spot pricing.
What observability solutions does AWS offer?
Our AWS-native observability solutions have been developed from the ground up to observe other AWS services, to operate at cloud scale, and to provide enterprise-level security.
CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization.
Perform distributed tracing across multiple applications and systems to help find latency in a system and target it for improvement.
Spot the most CPU-intensive code paths in an application using flame graphs, and optimize your code to improve performance and reduce infrastructure costs.
Automatically ingests operational data from your AWS applications and applies machine learning models informed by years of Amazon.com and AWS operational excellence to identify anomalous application behavior and surface critical issues before they cause outages or service disruptions.
We offer services based on and fully compatible with popular open source observability software. You can continue using familiar tools you're already invested in, while avoiding the undifferentiated heavy lift of scaling and security.
A managed monitoring service based on and compatible with Prometheus, the popular open source monitoring and alerting solution optimized for container environments. Use the Prometheus query language (PromQL) to monitor the performance of containerized workloads.
Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).
Mapbox is an open-source mapping platform for custom-designed maps that reaches more than 300 million people each month. Mapbox uses Amazon CloudWatch for ingestion of multiple data sources—including native AWS metrics, custom metrics, and logs—as well as monitoring and visualization of key workloads and resource optimization.
“We were looking to consolidate all our monitoring, logging, metrics, and alerting under one tool. CloudWatch has helped us alleviate the operational burden to set up, configure, and learn third-party systems. Our teams use CloudWatch extensively to monitor error rates and status codes for multiple high-profile workloads. We also use CloudWatch to automate Auto Scaling actions, allowing us to optimize the cost of Amazon EC2 instance types powering our Amazon ECS clusters. CloudWatch Events enable us to provide utilization and pricing information to teams so they can audit account security, trigger AWS Lambda actions for compliance and security use cases, and schedule our resources using the cloud. CloudWatch enables next-level automation and expands the capacity of each individual.”
Emily McAfee, Platform Engineering Manager - Mapbox
Pushpay’s purpose is to bring people together by strengthening community, connection, and belonging. We build world-class giving and mobile app publishing solutions to help organizations grow their communities.
“Our current log analytics solution requires setup and maintenance overhead, has differing retention requirements, and is cost prohibitive, making it impossible for our Engineering team to be able to access and query logs in both development and test environments. With CloudWatch Logs Insights, we are now able to query logs within CloudWatch Logs reducing operational complexity. Pay per query gives us flexibility to scale at our own pace and our engineers can begin to consume and query logs without waiting for the setup, integration, and ingestion to take place with our current solution. We also benefit from viewing metrics and logs allowing faster troubleshooting. Logs Insights is an effective and in-expensive solution for our engineers to monitor their applications and perform log diving all from single AWS console.”
Peter Goodman, Director Site Reliability Engineering - Pushpay
SendGrid is a provider of cloud email and sends more than 40 billion emails each month for more than 69,000 paying customers. SendGrid adopted Amazon CloudWatch early in its migration to AWS in order to gain system visibility, operational insights, and resource optimization.
“CloudWatch allows us to collect metrics from AWS services such as Amazon EC2, Amazon Kinesis, Amazon DynamoDB, and Amazon API Gateway, as well as logs from AWS Lambda functions. We appreciated being able to integrate natively, without the need for a self-managed stack or third -party SaaS vendor. This helped us start alerting, auto scaling, and capacity planning very quickly. Being able to address our primary use cases quickly and simply made CloudWatch a preferred solution.“
Joshua Barratt, Architect II - SendGrid
Learn observability hands-on
Check out the interactive and immersive One Observability Workshop and get hands-on using Amazon CloudWatch and AWS X-Ray. In the workshop, you will deploy a complex microservices application and set up monitoring and observability in a modern environment. You will come away with a clear understanding of logging, metrics, container and serverless monitoring, and tracing techniques.
Discover other use cases for managing and governing in AWS
Build, provision, and share resources
Audit and remediate your resource configurations
Manage your cloud operations
Establish a centrally managed, secure, multi-account AWS environment