AWS Cloud Operations Blog
Optimizing metrics ingestion with Amazon Managed Service for Prometheus
Managing metrics collection at scale in complex cloud environments presents significant challenges for organizations, particularly when it comes to controlling costs and maintaining operational efficiency. As the volume of metrics grows exponentially with the expansion of container deployments and other cloud-native workloads, customers often struggle to balance comprehensive monitoring with resource optimization. This can lead […]
AWS Organizations launches account state information for granular account lifecycle management
AWS Organizations enables customers to centrally manage their AWS accounts. Since many customers prefer to automate the account creation process, they can leverage CreateAccount API, thereby creating an account vending pipeline. This pipeline standardizes the deployment of policies, roles, and resources across new accounts while managing the complete lifecycle through eventual account closure. Through this […]
AWS Systems Manager Run Command now supports interpolating parameters into environment variables
Introduction Today we are introducing an important enhancement to AWS Systems Manager (SSM) Documents environment variable interpolation when processing parameters. This feature, now available in schema version 2.2 with AWS Systems Manager Agent v3.3.2746.0 or later, simplifies document execution by ensuring parameter values are treated as literal strings, eliminating unexpected behavior and streamlining your automation processes. […]
Advanced analytics using Amazon CloudWatch Logs Insights
Effective log management and analysis are critical for maintaining robust, secure, and high-performing systems. Amazon CloudWatch Logs Insights has long been a powerful tool for searching, filtering, and analyzing log data across multiple log groups. The addition of OpenSearch Piped Processing Language (PPL) and OpenSearch SQL language query support offers greater flexibility and familiarity in […]
Enhance your AIOps: Introducing Amazon CloudWatch and Application Signals MCP servers
Modern architectures generate vast amounts of observability data across metrics, logs, and traces. When issues arise, teams spend hours—sometimes days—manually correlating information across multiple dashboards to identify root causes, directly impacting MTTR and productivity. Amazon CloudWatch Application Signals addresses this challenge by providing deep application visibility through automatic instrumentation, capturing key metrics like latency, error […]
Gain visibility of AWS backup activities using Amazon Managed Grafana
AWS Backup is a comprehensive service that simplifies the process of centralizing and automating data protection across various AWS services, both in the cloud and on-premises, all managed seamlessly. Organizations have different requirements and want to track their backup, copy and restore activities across AWS cloud resources. Currently, in order to view status of resource […]
Best practices for analyzing AWS Config recording frequencies
AWS Config tracks configuration changes across your AWS resources and AWS Organizations. AWS Config uses the configuration recorder to detect changes and records them as configuration items (CIs). As your infrastructure grows and becomes more complex, choosing the appropriate recording frequency becomes critical for maintaining operational visibility, meeting compliance requirements, and supporting your security posture. Since the launch of the periodic recording […]
Centralized Multi-Account Application Resilience Assessment Using AWS Resilience Hub
Introduction As organizations scale their cloud environments across multiple AWS accounts and regions, managing and accessing resilience becomes increasingly complex. Traditional approaches of evaluating resilience separately for each workload, account, or region can lead to inefficiencies, inconsistencies, and coverage gaps. This challenge is particularly pronounced in distributed architectures utilizing various Infrastructure as Code (IaC) tools […]
Optimize querying AWS CloudTrail logs with partitioning in Amazon Athena
Organizations leveraging AWS CloudTrail to audit API access encounter a common challenge: CloudTrail data volume grows proportionally with AWS infrastructure expansion. A multi-account AWS organization generating millions of API calls daily can quickly amass terabytes of CloudTrail logs. When security teams conduct incident investigations or account activity audits, querying these logs in Amazon Athena becomes […]
Learn from AWS Fault Injection Service team’s approach to Game Days
In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]