Optimizing costs in Amazon RDS
One of the key benefits of AWS is the flexibility to provision the number of resources an application needs and scale up or down as requirements change. This requires monitoring your current resource utilization and having a policy to take actions when required. Without these proactive measures in place, you can end up under-provisioning or over-provisioning your resources. Under-provisioning a resource has a direct impact on performance, so it gets escalated and corrected. However, over-provisioned resources don’t have a direct functional impact that results in application disruption, so it’s often missed. As a result, you may find yourself with over-allocated resources and need to downsize and optimize.
This post provides recommendations for optimizing your Amazon Relational Database Service (Amazon RDS) footprint for your cloud infrastructure using best practices.
This Amazon RDS cost-optimization approach consists of the following key steps:
- Tag and track resource utilization.
- Define the utilization policies for your Amazon RDS resources.
- Educate the owners and implement the policies.
- Learn and optimize the policies and processes.
The following diagram illustrates this workflow.
The following sections provide more details for each step.
Tagging and tracking the resource utilization
The first step towards cost-optimization is to properly tag your resources and start tracking their utilization. AWS provides the capability to easily tag any of your RDS instances. To do so, locate your instance on the Amazon RDS console. On the Tags tab, add the tags you need to clearly identify the owner of the RDS instance. For more information about tagging, see Tagging Amazon RDS resources.
You can add any tags that can help identify your Amazon RDS resources, such as application name and organization group, but the following two tags are recommended:
- Database owner
- Application owner
This information helps quickly identify the owners and alert them about the opportunity for cost-optimization.
You don’t need to do anything to track your resource utilization because Amazon RDS automatically sends metrics to Amazon CloudWatch every minute for each active database instance. There are no additional charges for Amazon RDS metrics in CloudWatch. For more information about these metrics, see Overview of monitoring Amazon RDS.
This integration of Amazon RDS with CloudWatch makes it easy to track resource utilization. The next step is to develop a policy that helps you identify the opportunities for cost-optimization. The following section focuses on the policies.
Defining the utilization policies for the Amazon RDS resource
In this step, stakeholders define policies to indentify under-utilizated RDS instances and define the actions to take when these instances are identified. These policies cover three key Amazon RDS aspects:
- Read replicas
- Unused instances
- Primary instance
The following sections provide recommendations for each policy.
Read replica policy
Amazon RDS read replicas provide enhanced scalability and durability for RDS DB instances. These read replicas provide the capability to scale read traffic horizontally. This is particually beneficial for read-heavy database workloads.
In Amazon RDS, Multi-AZ and read replicas are two different types of instances. The standby instance created for Multi-AZ deployment is not accessible and is only used for high availability. On the other hand, in Amazon Aurora, the Multi-AZ standby is just another read replica that is accessible. So for high availability of an Aurora cluster, one read replica is required even if it’s unused. For all other Amazon RDS and Aurora read replicas, you need to evaluate CPU and I/O utilization to determine if they are actually required. When making read replica decisions, consider the following criteria:
- If primary instance utilization for I/O and CPU usage is under 30% constantly, don’t spin up a read replica.
- If the CPU and I/O capacity of the read replica is under 30% constantly, explore the possibility of using a smaller instance size. If the primary instance has capacity, you can also consider transferring the load to the primary and shut down the read replica.
For example, if you’re using a read replica of type db.r4.4xlarge (16 vCPUs) that is 30% utilized, you should consider downsizing to db.r4.2xlarge (8 vCPUs). Now that you have half the number of vCPUs, your CPU utilization is expected to increase to around 60–70%. This leaves a buffer of around 30% for unexpected spikes and future organic increases in traffic.
I/O throughput is also very important because each instance type supports a maximum I/O bandwidth. When you scale down, the supported I/O bandwidth also drops, so it’s important to make sure that your current requirements are met by the reduced bandwith. For example, when you downsize from db.r4.4xlarge to db.r4.2xlarge, the available maximum bandwidth drops from 7000 Mbps to 3500 Mbps. For more information, see Hardware specifications for DB instance class.
For read replicas in a production environment, the 30% threshold is recommended. For non-production environments that are used for functional testing only, the utilization threshold can be made more aggressive to 50%. This means any read replica with CPU utilization less than 50% can be a candidate of right-sizing to a smaller instance type.
Unused instances policy
Unused RDS instances add to the overall cost and don’t add any value. It’s recommended that all unused instances are indentified and shut down as per the policy defined by your organization. Instances are sometimes created in non-production environments for quick testing and never cleaned up after the work is complete. These unused instances stay unutilized and unnecessarily add to the cost. To identify unused instances, consider the following criteria:
- No database connections for 1 month (or less, depending on your requirements)
- CPU utilization and I/O are less than 5% constantly
After you identify the under-utilized RDS instances, you should inform the owners. If they don’t take any action, you need to escalate the issue. You can consider the following steps:
- Send an email to the database and application owners to shut down the instance.
- Include a deadline in the email to shut down the instance or get an exception.
- Send a reminder and escalate to management when the deadline expires.
It’s a good practice to create a DB snapshot before deleting any RDS instance. This makes sure that if you need to restore the RDS instance, you can do so using the last snapshot.
Primary instances policy
The primary RDS instance handles read and write traffic for your application. Therefore, it’s essential that it’s sized correctly to meet your application requirements. At the same time, you don’t want to leave it under-utilized. To identify under-utilized primary instances, look for instances with CPU utilization less than 30% and I/O less than 30% constantly.
After you identify the under-utilized instances, complete the following steps:
- Notify the database and application owners to right-size the instance.
- Include a deadline to right-size the instance or get an exception.
- Send a reminder and escalate when the deadline expires.
For non-production primary instances used for functional testing and not related to performance validation, you can increase the I/O and CPU utilization threshold for right-sizing from under 30% to under 50%. This allows you to identify more instances that are candidates for right-sizing.
Educating the owners and implementing the policies
After you define the policy, it’s important to educate the database and application owners about these cost-optimization best practices. They should be aware of the importance of these policies and how it impacts the cost of their portfolio.
With all stakeholders onboard, you should start implementing the policies across the organization. For a small Amazon RDS database fleet, this may be a once-a-month manual assessment. For bigger organizations, it can be a daily or weekly automated process that checks the metrics and generates email notifications to alert the owners.
Learning and optimizing the policies and processes
After you implement this process, it’s important to continue to monitor and identify areas for improvements. This makes sure that you continuously evolve the policies and processes to match your workloads and organizational needs. In this phase, you may consider implementing automation to autoscale and right-size the instances in your fleet as your workload changes. Changing the instance type results in downtime, so it’s important to plan so that it doesn’t impact or minimally impacts your workloads.
In addition to the RDS instance utilization policy, you can optimize cost in several other areas. In this section, we discuss some additional recommendations that you can consider based on your application requirements.
Consider reviewing the Amazon RDS backup lifecycle policy and removing manually and automatically created database backup snapshots based on your organization’s compliance and retention policies. Automatic backups are deleted as per the retention policy you define, but manual backups are never deleted automatically. For more information, see Working with backups.
In Aurora, you don’t control the type of storage, but for rest of the RDS instances, you get to choose the Amazon Elastic Block Store (Amazon EBS) volume types that you want to use with your instance. Working with many different customers, we see the majority of dev and staging workloads perform well with General Purpose SSD storage. For I/O intensive workloads, Provisioned IOPS SSD storage is certainly a good option, but for other workloads, General Purpose SSD should work fine. If your General Purpose EBS volume size is less that 1 TB, it’s important to understand the concept of EBS burst mode. Burst mode enables you to reach up to 3000 IOPS as long as you have the I/O credits available. As soon as the I/O credit is consumed, the IOPS drop to the actual available IOPS limit. For more information, see Understanding Burst vs. Baseline Performance with Amazon RDS and GP2. You should regularly monitor IOPS utilization to ensure optimal performance.
Available RDS instance memory is essential for database performance, but the decision to downsize can’t be based on memory utilization. This is because a significant part of the instance memory is allocated for internal database buffers (SGA in Oracle, Shared Buffers in Auroa PostgreSQL). Due to this, even an idle RDS Oracle instance may show 70% of memory used even though there are no connections. Similarly, in Aurora PostgreSQL, shared_buffers is configured to use around 75% of the available memory by default, so even an idle instance shows used memory.
The database engines rely on available memory to cache data blocks. This cached data helps speed up queries. If your application needs to meet a specific low-latency SLA for queries, downgrading the instance type can have an impact. For example, when you downsize from db.r4.4xlarge to db.r4.2xlarge, the available memory drops from 122 GB to 61 GB. This results in a smaller cache for the database, therefore the database engine needs to read more pages from the storage. Because a fetch from storage is slower than the cache fetch, query time may increase. Also note that with a smaller cache, the storage I/O increases due to which application may need more IOPS. It’s important for application owners to evalutate the impact on latency-sensitive applications before downsizing the instances in production. In Amazon Aurora you pay for the IOs your database consumes therefore IO cost impact should also be analyzed before deciding to downgrade the instance type.
Amazon RDS provides the flexibility to choose the instance type you need for your database workloads. Each instance type supports a certain number of CPUs, memory, EBS bandwidth, and network performance. The application owner should choose the instance type based on workload requirements. For example, for CPU-intensive workloads, an M* family instance is better suited, whereas for a memory-intensive workload, the R* family is better. As discussed in the previous section, you should only change instance types after carefully looking at your requirements. Because the majority of database workloads are memory intensive, you should evaluate using the latest offering in R* and X* family instances. For more information, see Amazon RDS Instance Types.
RDS instance policies summary
The following table provides a summary of the sample policies discussed in this post.
|Environments||RDS Instance Stats||Action|
|Read Replica||All||CPU utilization < 30% and I/O throughput < 30%||Transfer load to primary and shut down or downsize.|
|Under-utilized Instances||Production||No connections for 1 month , CPU utilization < 5% and I/O throughput < 5%||Alert the owner and escalate if no action is taken within a given time window.|
|Under-utilized Instances||Non-Production||No connections for 1 month , CPU utilization < 5% and I/O throughput < 5%||Alert the owner and escalate if no action is taken within a given time window. Take a snapshot and shut down if no action is taken within the given time.|
|Right-size Instances||Production||CPU utilization < 30% and I/O throughput < 30%||Alert the owner and escalate if no action is taken within a given time window.|
|Right-size Instances||Non-Production||CPU utilization < 50% and I/O throughput < 50%||
Alert the owner and escalate if no action is taken within a given time window.
Take a snapshot and downsize if no action is taken within the given time.
CPU utilization and I/O throughput metrics are available in CloudWatch. The CPU utilization metric (
CPUUtilization) should be the maximum CPU utilization for a period of time you monitor. The I/O throughput is the sum of read throughput (
ReadThroughput) and write throughput (
WriteThroughput), compared to the maximum allowed instance I/O throughput. For more information about the maximum allowed throughput, see Hardware specifications for DB instance classes.
The goal of the recommendations provided in this post is to help you optimize the cost of your RDS instances. It’s important to keep in mind that every application is unique in its architecture and has different usage patterns and SLAs. It’s therefore essential to always validate and review changes before deploying them in a production environment. For non-production database instances, you can put automation in place to shut them down or right-size them. However, for production instances, it’s recommended to work with the application owners to understand their usage pattern prior to right-sizing or stopping the instances.
If you have any questions or comments, please post your thoughts in the comments section.
About the Authors
Samujjwal Roy is a Principal Consultant with the Professional Services team at Amazon Web Services. He has been with Amazon for 16+ years and has led migration projects for internal and external Amazon customers to move their on-premises database environment to AWS Cloud database solutions.
Yaser Raja is a Senior Consultant with Professional Services team at Amazon Web Services. He works with customers to build scalable, highly available and secure solutions in AWS cloud. His focus area is homogenous and heterogeneous migrations of on-premise databases to AWS RDS and Aurora PostgreSQL.
Li Liu is a Database Cloud Architect with Amazon Web Services. She helps customers to migrate their databases to AWS cloud.