AWS Cloud Operations Blog

Advanced analytics using Amazon CloudWatch Logs Insights

Advanced analytics using Amazon CloudWatch Logs Insights

Effective log management and analysis are critical for maintaining robust, secure, and high-performing systems. Amazon CloudWatch Logs Insights has long been a powerful tool for searching, filtering, and analyzing log data across multiple log groups. The addition of OpenSearch Piped Processing Language (PPL) and OpenSearch SQL language query support offers greater flexibility and familiarity in […]

Enhance your AIOps: Introducing Amazon CloudWatch and Application Signals MCP servers

Modern architectures generate vast amounts of observability data across metrics, logs, and traces. When issues arise, teams spend hours—sometimes days—manually correlating information across multiple dashboards to identify root causes, directly impacting MTTR and productivity. Amazon CloudWatch Application Signals addresses this challenge by providing deep application visibility through automatic instrumentation, capturing key metrics like latency, error […]

Gain visibility of AWS backup activities using Amazon Managed Grafana

AWS Backup is a comprehensive service that simplifies the process of centralizing and automating data protection across various AWS services, both in the cloud and on-premises, all managed seamlessly. Organizations have different requirements and want to track their backup, copy and restore activities across AWS cloud resources. Currently, in order to view status of resource […]

Best practices for analyzing AWS Config recording frequencies

Best practices for analyzing AWS Config recording frequencies

AWS Config tracks configuration changes across your AWS resources and AWS Organizations. AWS Config uses the configuration recorder to detect changes and records them as configuration items (CIs). As your infrastructure grows and becomes more complex, choosing the appropriate recording frequency becomes critical for maintaining operational visibility, meeting compliance requirements, and supporting your security posture. Since the launch of the periodic recording […]

Centralized Multi-Account Application Resilience Assessment Using AWS Resilience Hub

Centralized Multi-Account Application Resilience Assessment Using AWS Resilience Hub

Introduction As organizations scale their cloud environments across multiple AWS accounts and regions, managing and accessing resilience becomes increasingly complex. Traditional approaches of evaluating resilience separately for each workload, account, or region can lead to inefficiencies, inconsistencies, and coverage gaps. This challenge is particularly pronounced in distributed architectures utilizing various Infrastructure as Code (IaC) tools […]

Optimize querying AWS CloudTrail logs with partitioning in Amazon Athena

Organizations leveraging AWS CloudTrail to audit API access encounter a common challenge: CloudTrail data volume grows proportionally with AWS infrastructure expansion. A multi-account AWS organization generating millions of API calls daily can quickly amass terabytes of CloudTrail logs. When security teams conduct incident investigations or account activity audits, querying these logs in Amazon Athena becomes […]

Learn from AWS Fault Injection Service team’s approach to Game Days

Learn from AWS Fault Injection Service team’s approach to Game Days

In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]

Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 2

Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 2

In practice: SLO monitoring with CloudWatch Application Signals In the previous post, we’ve shared the basic concepts and benefits of burn rate monitoring. In this post, we, the Amazon Product Search team, will share anecdotes from our migration from an in-house solution to CloudWatch Application Signals, and introduce how we actually implement monitoring and dashboards. […]

Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 1

Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 1

In theory: SLO concepts applied to Amazon Product Search In this series of posts, we will show you how we, the Amazon Product Search team, monitor key systems using Service Level Objectives (SLOs) and share our migration journey from an in-house solution to Amazon CloudWatch Application Signals. Amazon Product Search is a large distributed system […]

Using Amazon Bedrock and Amazon Nova for AI-Powered Incident Response

In today’s cloud-native world, incident response teams face overwhelming challenges. When critical applications fail, engineers must sift through mountains of observability data across multiple services; all while under intense pressure to restore service quickly. This manual correlation process is time-consuming, error-prone, and often delays resolution, resulting in extended outages and frustrated customers. Traditional monitoring tools […]