AWS Cloud Operations Blog
Automating Amazon CloudWatch Alarm Cleanup at Scale
Do you have thousands of Amazon CloudWatch alarms across AWS Regions and want to quickly identify which ones are low-value alarms or misconfigured alarms across regions? Are you looking for ways to identify alarms which are in ‘ALARM’ or ‘IN_SUFFICIENT’ state for several days and need to be revisited? Do you need a cleanup mechanism to review the low value alarms across regions and delete them periodically to optimize alarms costs?
In this blog post, we’ll explore how you can deploy a CloudWatch low-value alarms cleanup mechanism at scale across regions in an AWS account and discuss the mechanism that helps customers optimize costs on CloudWatch alarms by identifying different kinds of misconfigured or low-value alarms. Alarms that have remained in ALARM or INSUFFICIENT_DATA state continuously for several days are categorized as stale alarms. Alarms that don’t take any actions and are not referenced by a composite alarm might be of low value. We encourage you to review those alarms to either ensure they are useful, or delete them if you realize you don’t need them.
Deploying the solution
This solution and associated resources are available for you to deploy into your own AWS account as an AWS CloudFormation template.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
What will the CloudFormation template deploy?
The CloudFormation template will deploy the following resources into the AWS account:
- AWS Identity and Access Management (IAM) role for AWS Lambda
- CloudWatchAlarmHealthCheckerRole
- Allows writing to CloudWatch logs and S3 bucket, and access to the CloudWatch APIs
- AWS Lambda function
- CloudWatchAlarmHealthCheckerpy-<stackname>
- Amazon CloudWatch Log group for the Lambda function
- normal naming convention, i.e., /aws/lambda/<function name>
- set to seven days retention
- created explicitly here so that they will be deleted with stack deletion
- Amazon S3 bucket to upload the ‘suspicious alarms’ and ‘alarms to delete’ spreadsheets
- S3BucketName
- S3 bucket has DeletionPolicy set to retain
- S3 bucket policy to allow access from Lambda
How to deploy the CloudFormation template
- Download the yaml file.
- Navigate to the CloudFormation console in your AWS Account.
- Choose Create stack.
- Choose Template is ready, upload a template file, and navigate to the yaml file that you just downloaded.
- Choose Next.
- Give the stack a name (max. length 30 characters)
- Give an S3 bucket name to create, this is the S3 bucket that will be created to upload alarm spreadsheets, and select Next.
- Add tags if desired, and select Next.
- Scroll to Capabilities at the bottom of the screen, and check the box “I acknowledge that AWS CloudFormation might create IAM resources with custom names,” and Create stack.
- Wait for the stack creation to complete.
- Navigate to Lambda console > Functions.
- Select the Lambda function called CloudWatchAlarmHealthCheckerpy-<stackname>
- Scroll down to Code section of the Lambda function and select Test.
- Configure a Test event and input below.
Figure 1: Lambda function input json
Use below sample json and input your S3 bucket.
{
"nodata_days": 7,
"stale_days": 30,
"disabled_actions_days": 60,
"max_iterations": 1800,
"operating_mode":"report_only",
"regions": ["us-east-1", "us-west-1", "us-east-2", "us-west-2", "eu-west-1"]
}
You can change ‘nodata_days’, ‘stale_days’, ‘disabled_actions_days’ to arrive at suspicious alarms list as per your use case or requirement. ‘max_iterations’ is configurable by you based on the count of your alarms and metrics in the account. ‘regions’ is configurable by you based on regions of presence in the account. Use ‘operating_mode’ and set it to ‘actual_deletion’ to delete all the alarms mentioned in ‘alarms_to_delete’ spreadsheets, by default ‘operating_mode’ is set to ‘report_only’ so that alarms are not deleted by accidentally running this Lambda. ‘report_only’ mode will help you first review all the alarms in spreadsheets, you can set ‘operating_mode’ to ‘actual_deletion’ to delete all the alarms mentioned in ‘alarms_to_delete’ spreadsheets.
Once you run this solution in your account, this solution will output spreadsheets for you to review, uploaded into an S3 bucket of your choice. The below spreadsheets contain a list of alarms that is ready for deletion, and alarms that are suspicious and would need your review.
- A spreadsheet containing list of alarms that are ready for deletion per region. This spreadsheet produced is the list of alarms that can likely be deleted, because they reference a metric that does not exist, which could be that the metric is not emitted anymore or that it was misspelled at alarm creation time.
- Another spreadsheet containing list of all suspicious alarms per region. This file produced is the list of alarms that are stale or that might be of low value.
Figure 2: Spreadsheets uploaded to S3 bucket by the solution
Suspicious alarms spreadsheet contains your list of alarms in your account across regions that are:
- in ‘ALARM’ state for more than ‘stale_days’ with no data
- in ‘IN_SUFFICIENT’ state for more than ‘nodata_days’
- alarms that do not have any action associated and do not have a parent.
- alarms that have been continuously disabled for more than ‘disabled_actions_days’
‘stale_days’, ‘nodata_days’, ‘disabled_actions_days’ are configurable by you as part of the Lambda function Test event configuration shared above in this blog post. Suspicious alarms list is for you to review and warrant clean up. Essentially, alarms that do not have any action associated and do not have a parent alarm are in the suspicious list, and since alarm state change could be monitored by EventBridge, these are in the suspicious alarm list for your review.
Alarms to delete spreadsheet contains alarms if:
- an alarm references an invalid namespace or that does not exist.
- an alarm targets an unknown metric
- an alarm references a dimension that does not exist for metric.
Costs
There is a cost associated with using this solution as it stores data in an S3 bucket. The solution runs Lambda code, and in this case the Lambda functions make API calls. The cost should be minimal. For example, 100,000 alarms with 300,000 metrics in your account costs less than a few cents.
All pricing details are available on the Amazon S3 and AWS Lambda pages.
Cleanup
If you decide that you no longer want to keep the Lambda and associated resources, you can navigate to CloudFormation in the AWS Console, choose the stack (you will have named it when you deployed it), and choose Delete. All of the resources will be deleted expect the S3 bucket which has deletion policy set to retain.
Should you want to add this cleanup mechanism back in at any point, you can create a stack again from the CloudFormation yaml.
Conclusion
You can use this solution to get better understanding of your CloudWatch alarms that are of low value or obsolete and take an action to delete them. You can run this once and review the alarms for deletion or run this periodically using Amazon EventBridge Events. Customers can quickly identify and delete alarms of low value or obsolete ones across regions and save costs.