Amazon CloudWatch (CloudWatch) is a service used by Amazon Web Services (AWS) customers to monitor their cloud resources and applications running on AWS. Customers use Amazon CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in AWS resources. CloudWatch provides system-wide visibility into resource utilization, application performance, and operational health.
For the Amazon CloudWatch team, performance and speed are critical in empowering customers to identify and address issues. “Our goal is to reduce the time it takes for us to get metrics to our customers, so they can react faster to any issues in their systems,” says Sebastian Rodriguez, software development manager for the CloudWatch monitoring team.
The CloudWatch team moved its metrics storage to the Amazon DynamoDB (DynamoDB) database service in 2014 in an effort to improve efficiency over a solution based on relational databases. “We considered several data stores, but we chose Amazon DynamoDB for its scalability and performance,” says Rodriguez.
However, the team struggled with the need to manage the lifecycle of its different tables and items, including creation, monitoring, and deletion—a feature not available through DynamoDB previously, which the team had to manage manually. “We had to build, deploy, and maintain our own solution to manage the lifecycle of the tables,” says Rodriguez. “We had to constantly monitor the solution to make sure it was creating and deleting metric data in the tables, and it was a time-consuming process for us.”
With a rapidly growing customer base and region footprint, the CloudWatch team needed a simpler solution for managing the lifecycle of its data tables. “The operational burden of our services was increasing as we added customers, traffic, and new AWS regions, so we needed to accommodate that growth in the most cost-effective and efficient way possible,” Rodriguez says.
The CloudWatch team’s lifecycle-management problems were solved when DynamoDB launched Time to Live (TTL) in 2017, a feature that allows users to define when items in a table expire so they are automatically deleted from the database.
The CloudWatch team now uses a single DynamoDB table to automate management of all its items, allowing the team to retrieve data more efficiently because fewer tables need to be accessed. “The process is very simple now,” says Rodriguez. “We only have to create one table to manage the data, and we don’t have to spend time manually creating an application to remove expired items. DynamoDB was able to scale for us to have a single table with millions of IOPS.”
Adopting DynamoDB TTL enabled the CloudWatch team to cut its operational burden, since the team no longer has to manually manage table deletion and has lower provisioned capacity. “Because we can automate the process of deleting items from tables, we reduced our overall provisioned throughput for tables by 75 percent,” says Rodriguez. “And with that reduction, data-retrieval latencies were also reduced by up to 10 percent.”
The team has also been able to reduce its costs thanks to smaller table sizes. “We were pleased to see that TTL provided us the disk-space cost savings we expected,” says Rodriguez. “But it was a pleasant surprise that we were able to simplify our architecture, dramatically reduce throughput for our tables, and ultimately save millions of dollars annually.”
Most importantly, DynamoDB TTL enables the CloudWatch team to deliver the low latency its customers expect from the monitoring platform. “In our previous environment, we had to process multiple tables of data, aggregate them, and then return the data to our customers,” Rodriguez says. “Now, with Amazon DynamoDB TTL, we only need one table and we don’t have to manage the data-expiration process. The entire data-storage process is simpler, and our customers get their CloudWatch metrics faster.”
Learn more about Amazon DynamoDB TTL.