AWS Cloud Operations Blog
Category: *Post Types
Enhance your AIOps: Introducing Amazon CloudWatch and Application Signals MCP servers
Modern architectures generate vast amounts of observability data across metrics, logs, and traces. When issues arise, teams spend hours—sometimes days—manually correlating information across multiple dashboards to identify root causes, directly impacting MTTR and productivity. Amazon CloudWatch Application Signals addresses this challenge by providing deep application visibility through automatic instrumentation, capturing key metrics like latency, error […]
Best practices for analyzing AWS Config recording frequencies
AWS Config tracks configuration changes across your AWS resources and AWS Organizations. AWS Config uses the configuration recorder to detect changes and records them as configuration items (CIs). As your infrastructure grows and becomes more complex, choosing the appropriate recording frequency becomes critical for maintaining operational visibility, meeting compliance requirements, and supporting your security posture. Since the launch of the periodic recording […]
Optimize querying AWS CloudTrail logs with partitioning in Amazon Athena
Organizations leveraging AWS CloudTrail to audit API access encounter a common challenge: CloudTrail data volume grows proportionally with AWS infrastructure expansion. A multi-account AWS organization generating millions of API calls daily can quickly amass terabytes of CloudTrail logs. When security teams conduct incident investigations or account activity audits, querying these logs in Amazon Athena becomes […]
Learn from AWS Fault Injection Service team’s approach to Game Days
In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]
Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 2
In practice: SLO monitoring with CloudWatch Application Signals In the previous post, we’ve shared the basic concepts and benefits of burn rate monitoring. In this post, we, the Amazon Product Search team, will share anecdotes from our migration from an in-house solution to CloudWatch Application Signals, and introduce how we actually implement monitoring and dashboards. […]
Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 1
In theory: SLO concepts applied to Amazon Product Search In this series of posts, we will show you how we, the Amazon Product Search team, monitor key systems using Service Level Objectives (SLOs) and share our migration journey from an in-house solution to Amazon CloudWatch Application Signals. Amazon Product Search is a large distributed system […]
Using Amazon Bedrock and Amazon Nova for AI-Powered Incident Response
In today’s cloud-native world, incident response teams face overwhelming challenges. When critical applications fail, engineers must sift through mountains of observability data across multiple services; all while under intense pressure to restore service quickly. This manual correlation process is time-consuming, error-prone, and often delays resolution, resulting in extended outages and frustrated customers. Traditional monitoring tools […]
Simulating partial failures with AWS Fault Injection Service
Modern distributed systems must be resilient to unexpected disruptions to maintain availability, performance, and stability. Chaos engineering helps teams uncover hidden weaknesses by deliberately injecting faults into a system and observing how it recovers. While traditional testing validates expected behavior, chaos engineering tests system resilience during failures. AWS Fault Injection Service (AWS FIS) is a […]
Observing Agentic AI workloads using Amazon CloudWatch agent
Introduction As the adoption of agentic AI applications continues to grow, ensuring the reliability, performance, and overall observability of these systems becomes increasingly critical. Agentic AI applications, powered by large language models (LLM) and integrated with various data sources and APIs, can quickly become complex, making it challenging to gain visibility into their inner workings […]
Best practices for utilizing AWS Systems Manager with AWS Fault Injection Service
Introduction In today’s cloud-centric world, ensuring the resilience of mission-critical applications is paramount. The ability to withstand and recover from unexpected failures, including degradation of cloud provider services, can mean the difference between seamless operation and costly downtime. This is where the powerful combination of AWS Systems Manager (SSM) and AWS Fault Injection Service (AWS […]