AWS Cloud Operations Blog

Category: Monitoring and observability

Using Generative AI to Gain Insights into CloudWatch Logs

Have you ever been investigating a problem and opened up a log file and thought “I have no idea what I am looking at. If only I could get a summary of the data.” Observability and log data play an important role in maintaining operational excellence and ensuring the reliability of your applications and services. […]

AWS named as a Challenger in the 2024 Gartner Magic Quadrant for Observability Platforms

AWS has been named as a Challenger in the 2024 Gartner Magic Quadrant for Observability Platforms, previously known as Gartner Application Performance Monitoring (APM) and Observability Magic Quadrant. This report assesses vendors based on their Ability to Execute and Completeness of Vision. Compared to the previous year, AWS has moved up higher on the Ability […]

Improve Amazon Bedrock Observability with Amazon CloudWatch AppSignals

With the pace of innovation with Generative AI applications, there is increasing demand for more granular observability into applications using Large Language Models (LLMs). Specifically, customers want visibility into: Prompt metrics like token usage, costs, and model IDs for individual transactions and operations, apart from service-level aggregations. Output quality factors including potential toxicity, harm, truncation […]

AWS GameDay billboard image displaying fictional unicorn at fictional company, Unicorn.Rentals on the billboard.

Observability Matters at Brightcove with AWS GameDay

Today, we’re pleased to announce the general availability of the Observability Matters on Amazon Web Services GameDay. AWS GameDay is a gamified learning event that challenges participants to use AWS solutions to solve real-world technical problems in a team-based setting. Unlike traditional workshops, GameDays are open-ended and non-prescriptive to give participants the freedom to explore and think outside […]

Centralize observability with Amazon Managed Grafana Enterprise plugins

Observability is a critical aspect for maintaining the health and performance of any distributed system. Organizations rely on data from diverse sources, including AWS services as well as third-party ISVs (independent software vendor) to gain insights into their system’s health. Establishing secure connections to these diverse data sources enables visualization and analysis of observability data […]

Use Amazon CloudWatch Contributor Insights for general analysis of Apache logs

Customers build, deploy, and maintain millions of web applications on AWS and many customers deploy these applications using the Apache web application server. Web application performance is a key metric in modern enterprise applications. On AWS customers leverage Amazon CloudWatch to monitor response times, uptime, and provide SLAs. Engineering teams that run large scale applications […]

Get Disk Utilization of Your Fleet Using AWS Systems Manager Custom Inventory Types

Get Disk Utilization of Your Fleet Using AWS Systems Manager Custom Inventory Types

Some of my customers need assistance while operating their Amazon Elastic Compute Cloud (Amazon EC2) infrastructure. They need to: Review the disk usage of various volumes/ disks within an EC2 instance. To do it in a scalable way, one does not need to access the instance either through a Remote Desktop Session (RDP) or use […]

Automate CloudWatch Dashboard creation for your AWS Elemental Mediapackage and AWS Elemental Medialive

Introduction Monitoring the health and performance of your media services is critical to ensuring a seamless viewing experience for your customers. Amazon CloudWatch provides powerful monitoring capabilities for Amazon Web Services (AWS) resources. Setting up comprehensive dashboards can be a time-consuming process, especially for organizations managing large number of resources across multiple regions. The Automatic CloudWatch […]

How SLAs, SLOs, and SLIs interact

Improve application reliability with effective SLOs

At AWS, we consider reliability as a capability of services to withstand major disruptions within acceptable degradation parameters and to recover within an acceptable timeframe. Service reliability goes beyond traditional disciplines, such as availability and performance, to achieve its goal. Components of a system or application will eventually fail over time. Like our CTO Werner Vogels […]

Alarm Context Tool Architecture Diagram

Respond to CloudWatch Alarms with Amazon Bedrock Insights

Overview When operating complex, distributed systems in the cloud, quickly identifying the root cause of issues and resolving incidents can be a daunting task. Troubleshooting often involves sifting through metrics, logs, and traces from multiple AWS services, making it challenging to gain a comprehensive understanding of the problem. So how can you streamline this process […]