AWS Cloud Financial Management
How CrescoNet Optimized Their Architecture and Reduced Their AWS Bill by Over 40%
This blog was written in partnership with Michael Peterson of CrescoNet.
In this blog, you will learn how CrescoNet has employed both basic and advanced techniques to reduce their costs without compromising the performance, scalability or reliability of their critical data pipeline.
The challenge for CrescoNet
CrescoNet is a leading integrator of smart metering solutions using cutting-edge technologies across electricity, water, and gas industries. CrescoNet operates in an environment where they are required to meet strict SLA’s for their customers and regulators. By partnering with AWS, CrescoNet has developed a leading-edge solution that provides them with the reliability, performance and scalability required to process over 4.5 billion meter readings each day, making them one of the largest consumers of data streaming services in the Australian Energy sector. CrescoNet integrates with a rapidly growing fleet of electricity meters into a pipeline to be ingested, validated, interpolated, and delivered to the regulated energy market. With increasing adoption of rooftop solar and electric vehicles, smart meters are a key part of the Consumer Electricity Resources (CER) landscape. Australia is in a period of rapid expansion in the deployment of smart meter technology, and this is expected to drive significant scale and cost through the CrescoNet meter-to-cash data pipeline.
In this context, the challenge for CrescoNet is threefold:
- Controlling cost – Operating at this scale means that even small inefficiencies can have an outsized influence on the overall cost.
- Scalability – Ensuring the pipeline can scale to meet the significant expected growth.
- Resilience – Maintaining throughput and reliability to consistently meet business SLAs. Missing data delivery SLAs has a direct impact on CrescoNet revenue.
How CrescoNet is optimizing their spend
To address these challenges, CrescoNet and AWS took a purposeful and deliberate approach to optimizing for cost, by:
- Identify – Amazon QuickSight dashboards driven by AWS Cost and Usage (CUR) data, together with observability data from Prometheus and Grafana were used to identify, quantify and validate each cost saving opportunity.
- Assess – for each cost saving opportunity, the projected benefits were assessed holistically, identifying any tradeoffs between cost, effort, performance, scalability and reliability.
- Implement and iterate – After implementing changes, CrescoNet continued a data driven approach to validate results and iterate as necessary.
Through this approach, CrescoNet was able reduce their overall AWS spend by over 40%, with an 80% reduction in their meter-to-cash data pipeline. In the following sections, we will run through the top 8 optimizations implemented by CrescoNet.
1.Transitioning from AWS Glue to Amazon EMR on EKS
Early in their journey, AWS Glue was the ideal service for data processing – serverless, scalable and reliable. AWS Glue allowed CrescoNet to establish their pipeline, deliver value early and scale to meet growing demand of the business and industry. As their usage and costs increased, CrescoNet identified Amazon EMR on Amazon EKS as a more cost-effective option. EMR on EKS allowed CrescoNet to take full advantage of the Amazon EMR runtime efficiencies, as well as take more control over the container orchestration and infrastructure provisioning with Amazon EKS.
CrescoNet was also able to take advantage of spot instances, using a wide range of Amazon EC2 instance families to ensure that they have appropriate resilience in the event of a spot interruption. Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Transitioning to Amazon EMR on EKS, however, required CrescoNet to develop and mature their technical expertise in building and operating a Kubernetes cluster with Apache Spark, as well as build a solution to replace the native bookmarking features in AWS Glue.
With this change, CrescoNet was able to reduce the cost of their ETL compute by over USD $50,000/month, without sacrificing performance. CrescoNet continues to operate AWS Glue ETL jobs where it makes sense, and takes advantage of the AWS Glue catalog to store technical metadata for their data in the data lake.
2. Adopting Apache Iceberg
As interval, scalar and event messages transition through the data pipeline, data is progressively stored in Amazon S3. As interval data is, by nature, comprised of a high volume of relatively small messages arriving almost constantly, this causes significant overhead in both performance and cost of data retrieval from S3. In addition to the data pipeline regularly fetching large numbers of files from S3, customers running Amazon Athena queries across the data were slow and expensive.
CrescoNet currently stores over 450TB of meter data in S3, and while storage costs on Amazon S3 remained relatively low (priced per Gb/hour), uploading and retrieving large volumes of small files (i.e. performing PUT, COPY, POST, LIST, GET and SELECT requests, charged per 1000 requests) incurred a large cost. In fact, when using compressed Parquet files with a HIVE table format in S3 for a large number of small files, the costs for S3 GET and PUT operations were 5 times higher than the storage costs. This small file approach also impacted performance, with significant overhead in retrieving, decrypting and scanning large numbers of small files.
The above graph shows relative costs of APIs request charges compared to storage costs.
Figure 1: The above graph shows relative costs of APIs request charges compared to storage costs.
CrescoNet modified their pipeline to use Apache Iceberg on S3, using compaction to provide a quick and simple way to regularly reduce the high volume of small files into a smaller number of larger files (around 128Mb). Compaction allowed CrescoNet to achieve a 27 times reduction in the S3 GET and PUT costs, while also improving performance.
3. AWS Lambda optimization – runtime and right sizing
CrescoNet makes heavy use of AWS Lambda functions to process messages. AWS Lambda allows CrescoNet to run code without provisioning or managing the underlying infrastructure. CrescoNet executes over 110 million requests per month across 87 lambda functions, consuming 12.3 million GB-seconds of compute resource. While AWS Lambda is already a very cost-effective solution for CrescoNet, CrescoNet used Amazon CloudWatch metrics to identity functions that were either over-provisioned or operating with legacy runtime environments.
By addressing these issues, CrescoNet was able to reduce their Lambda costs while also improving performance. By uplifting their runtime from Python 3.8 and 3.9 to 3.11, AWS Lambda execution times reduced by 30% on average, often enabling a reduction in memory allocation.
4. Amazon DynamoDB Time to Live (TTL)
CrescoNet has developed an application that brokers interval delivery with the market regulator. This application uses a combination of Amazon Aurora and Amazon DynamoDB, with over 20 TB of data stored in DynamoDB tables for fast, low latency retrieval for users. After reviewing data access patterns, CrescoNet identified that data is retrieved less and less frequently as it ages, and the requirement for low-latency access diminished for data over 1-2 months. CrescoNet applied a Time To Live (TTL) value on items in tables to remove infrequently accessed data, and modified their access patterns, continuing to retrieve “hot” recent data from DynamoDB, while retrieving “warm” and “cold” data from the data lake.
Introducing this TTL allowed CrescoNet to reduce the amount of data stored in DynamoDB by 66%, reducing storage cost by $3,000/month. Using TTLs had the added benefit of not consuming write throughput when removing items from DynamoDB, this in turn reduced write-throughput costs by 10%. CrescoNet was able to find a good balance between maintaining low latency access to frequently accessed data, while reducing overall storage costs.
5. Managed Streaming for Apache Kafka optimizations
Apache Kafka is a fundamental part of the data pipeline, with every message delivered to multiple consumers through Kafka topics. CrescoNet uses Amazon Managed Streaming for Apache Kafka (MSK), taking advantage of the scalability, performance, and reliability of Kafka without having to maintain and manage the underlying infrastructure. CrescoNet identified two cost saving opportunities with Amazon MSK:
- Tiered storage – With tiered storage for Amazon MSK stores streaming data in a performance-optimized primary storage tier, before automatically moving data into the low-cost storage tier at topic retention limits. CrescoNet now uses tiered storage to improve performance (faster partition rebalancing), scalability (virtually unlimited storage), and reliability (longer duration safety buffers), while also reducing the cost of message retention.
- Consolidation of non-production instances – Amazon MSK pricing is based on broker instance usage at 1s resolution as well as data storage. By consolidating from multiple non-production instances into a shared instance, CrescoNet was able to reduce the per broker costs of the service and benefit from enhanced operational efficiency.
6. Batching Amazon Simple Queue Service message
CrescoNet makes extensive use of Amazon Simple Queue Service (SQS) queues, processing over 140 million messages per day. SQS is priced on a per request basis, and by compressing and aggregating multiple message payloads into a single request, CrescoNet was able to significantly reduce the number of requests made to SQS. This simple change allowed CrescoNet to reduce their SQS costs by over 20 times.
7. Maximizing cost savings and flexibility with RDS Reserved Instances (RI)
Following the successful implementation of the new intelligent Meter Data Management (iMDM) solution, CrescoNet consulted the recommendations provided in the AWS Billing and Cost Management console to determine the ideal purchase amount of Amazon RDS Reserved Instances (RIs). CrescoNet opted for size flexible RIs where possible, giving them the ability to resize their database instance within the same instance family without losing the benefits of their RIs. CrescoNet’s purchase of RIs resulted in significant savings for their database workloads.
8. Turn off non-production workloads outside business hours to lock in savings
The Instance Scheduler on AWS solution automates the starting and stopping of various AWS services, CrescoNet uses the Instance Scheduler on AWS solution to ensure that non-production and development compute and database workloads that aren’t required outside of business hours are automatically stopped when not required.
Results
By adopting these measures, CrescoNet was able to reduce their AWS bill for the meter-to-cash process, without compromising performance, scalability or reliability of their workloads. Not only has CrescoNet reduced their spend significantly, they’ve been able to increase performance in a number of areas while also introducing new capabilities into their platform.
This post has demonstrated how CrescoNet was able to significantly reduce their costs, through a mix of techniques. The AWS account team, together with AWS Enterprise Support, worked closely with the customer to identify and implement these techniques, which enabled the customer to scale to meet the changing needs of their business.