AWS Config for resource housekeeping and cost optimization
This guest blog post is contributed by Bradley Segobiano, a Lead Software engineer at Genesys. Bradley works with the DevOps team and helps developer teams build and run a stable and highly available application platform.
The elasticity Cloud Computing provides is a powerful enabler of innovation. But as new infrastructure is deployed, it is important to balance agility with governance. This blog post will share how Genesys uses AWS Config to help it implement an efficient process for resource management and compliance for its Genesys Cloud platform.
About Genesys Cloud
Genesys Cloud is deployed on AWS across multiple regions to serve customers around the world. The platform consists of over hundreds of microservices and scales significantly depending on load and time of day. Deployments are fully baked sets of AWS CloudFormation stacks and Amazon Machine Images (AMI) with a few thousand deploys per week across Production, Development, and Test environments. Genesys Cloud development teams are responsible for their own infrastructure, write the application code, choose what AWS resources they need, write their own AWS CloudFormation templates, and deploy and manage it in production.
At Genesys, our teams have total freedom to experiment in AWS using a development account. For example, our developers may need to create an Auto Scaling group to test and application or prototype an Amazon CloudWatch alarm. This allows them to innovate quickly and continue improving their applications. While this approach is great for fostering an environment of innovation, it does create some unique challenges. The flexibility it provides and the frequency of changes it introduces creates an interesting problem. How can we identify, flag and cleanup unwanted resources in those accounts?
It’s important to understand why it is so critical to track these unused resources. Clutter can lead to confusion and slower results for AWS API calls because of increased numbers of paginated results. Also, unused resources consume Service Quotas. Exhausting the quota on a particular Amazon EC2 instance type could prevent your Auto Scaling group from scaling, for example. Another important reason that should be highlighted is that identification and cleanup of resources reduces your AWS cost. The accumulation of idle resources over time can quickly increase these costs. Lastly, leaving unused resources running or provisioned can increase the attack surface of an account and present a significant security risk.
AWS Config in Action
To address our resource management and compliance requirements, we used AWS Config for several key reasons:
- Configuration based rules keep the compliance state in sync with the current status of AWS resources by running every time a resource configuration changes.
- Configuration based rules include a Configuration Item(CI) in the AWS Lambda invocation event. This CI essentially describes the result of the AWS resource. Often the combination of the Configuration Item difference and the Configuration Item provides enough context to determine compliance without needing to make any AWS API calls.
- The advanced query feature can be used when the Configuration Item and Configuration Item Diff provide inadequate data to determine compliance state.
- The AWS Config rules and related Genesys Cloud tooling are all serverless.
We deployed a set of custom AWS Config rules to each AWS account and Region. All of the accounts authorize a centralized account as the aggregator. Each rule written determines the compliance of a single AWS resource type. One AWS resource is mapped to one rule. A rule might contain multiple conditions that make a resource non-compliant if any of the conditions are met.
An example rule:
All instances must be inside an Auto Scaling group
Here’s one example of how we implemented an AWS Config rule. We use the AWS Auto Scaling API operations to describe all Auto Scaling groups in the account. Then, we used the EC2 API operations to describe all instances. We compared both results to extract the difference between the lists. This requires a large number of describe API calls to gather the information necessary to determine compliance state. Additionally, as these describe calls are being made and lists are built, scaling events are happening which immediately causes the results to drift from the real state of the infrastructure. Some resources that were not complaint at the time of last scan may no longer exist in the account due to scale-in. Because of this, we needed to add verification steps between scans to validate results for cleanup. This adds more time and complexity to the process.
We used AWS Config to simplify this process. A configuration base change rule is applied to the EC2 instance resource type. The rule checks for changes to the tag: ‘aws:autoscaling:groupName’. This is a tag injected onto instances by AWS when they are part of an Auto Scaling group and exist on all instances in an Auto Scaling group. There are some key advantages to this approach:
- This required zero additional API calls.
- The AWS Config service handles cleanup when a resource is deleted.
- Rules are simple and easy to write.
How does it work? When a resource changes compliance status in AWS Config it emits an event for ‘ComplianceChangeNotification’. A centralized Lambda subscribes to the AWS Config topics from a different AWS account to record data in a DynamoDB table.
Here is what happens when a resource changes states compliant -> non-compliant.
- A Lambda function writes a record to a DynamoDB table with a termination time in the future.
- A Lambda function notifies the owner that the resource is no longer compliant.
- After the termination time, the resource is deleted.
When a resource changes state from non-compliant -> compliant OR any state to NOT_APPLICABLE
- The Lambda function removes the record from the DynamoDB Table.
Is important to mention also that in this process, several additional Lambda functions run periodically to review records in DynamoDB, notify owners before deletion, and then to delete resources past the termination date.
Here is an example of how the process would work step by step:
- Instance i-12345abcd is detached from an Auto Scaling group and the tag ‘aws:autoscaling:groupName’ is removed.
- AWS Config records the tag change because of the Auto Scaling group removal.
- The resource is marked non-compliant in Config.
- AWS Config emits a ComplianceChangeNotification to an Amazon SNS topic and writes a record to DynamoDB Table to shut down the instance in 5 days.
- Instance i-12345abcd is attached back to its Auto Scaling group.
- AWS Config records the change to the tag ‘aws:autoscaling:groupName’.
- The resource is marked compliant in AWS Config.
- AWS Config emits a ComplianceChangeNotification to the SNS topic and removes the termination record from the DynamoDB Table.
This implementation of AWS Config has allowed us to simplify the process of tracking, validating, and cleaning up resources in our development environment. We plan to continue leveraging AWS Config and augment this solution in the future by combining the features of AWS Config with other AWS services like Lambda and DynamoDB.
About the Author
Bradley Segobiano is a Lead Software Engineer with the Infrastructure team at Genesys. He has spent the last 5 years building CICD pipelines and infrastructure management applications on AWS. Bradley works closely with service teams to help them leverage the latest AWS technologies. He has built tools used to reduce AWS spend, enforce automated governance and keep AWS accounts in good standing and has focused on migrating existing services to AWS Lambda and DynamoDB