Making better decisions about Amazon RDS with Amazon CloudWatch metrics
If you are using Amazon Relational Database Service (RDS), you may wonder about how to determine the best time to modify instance configurations. This may include determining configurations such as instance class, storage size, or storage type. Amazon RDS supports various database engines, including MySQL, PostgreSQL, SQL Server, Oracle, and Amazon Aurora. Amazon CloudWatch can monitor all these engines. These CloudWatch metrics not only guide you to select the optimal instance class, but also help you choose appropriate storage sizes and types. This post discusses how to use CloudWatch metrics to determine Amazon RDS modifications for optimal database performance.
CPU and memory consumption
In Amazon RDS, you can monitor CPU by using CloudWatch metrics
CPUCreditBalance. All Amazon RDS instance types support
CPUCreditBalance are only applicable to burstable general-purpose performance instances.
CPUCreditUsage is defined as the number of CPU credits spent by the instance for CPU utilization. CPU credits govern the ability to burst above baseline level for burstable performance instances. A CPU credit provides the performance of a full CPU core running at 100% utilization for one minute.
CPUUtilization shows the percent utilization of CPU at the instance. Random spikes in CPU consumption may not hamper database performance, but sustained high CPU can hinder upcoming database requests. Depending on the overall database workload, high CPU (70%–90%) at your Amazon RDS instance can degrade the overall performance. If a bad or unexpected query, or unusually high workload, causes a high value of
CPUUtilization, you might move to a larger instance class. Amazon RDS Performance Insights help to detect bad SQL queries that consume a high amount of CPU. For more information, see Using Performance Insights to Analyze Performance of Amazon Aurora PostgreSQL on YouTube.
The following CloudWatch graph shows a pattern of high CPU consumption. The CPU is consistently high for a long duration. This sawtooth pattern is a good indication that you should upgrade the Amazon RDS instance to a higher instance class.
Memory is another important metric that determines the performance of the Amazon RDS and helps to make decisions regarding Amazon RDS configurations. Amazon RDS supports the following memory-related metrics:
- FreeableMemory – The amount of physical memory the system isn’t using and the total amount of buffer or page cache memory that is free and available. If you have configured the database workload optimally and one or more bad queries are not causing low
FreeableMemory, a pattern of low
FreeableMemorysuggests you should scale up the Amazon RDS instance class to a higher memory allocation. When making decisions based on
FreeableMemory, it’s important to look at enhanced monitoring metrics, especially
Cached. For more information, see Enhanced Monitoring.
- SwapUsage – The amount of swap space used on the DB instance. In Linux-hosted databases, a high value of
SwapUsagetypically suggests that the instance is memory-deficient.
Disk space consumption
Amazon RDS Oracle, MySQL, MariaDB, and PostgreSQL engines support 64 TiB of storage space. Amazon Aurora storage automatically grows in 10 GB increments up to 64 TB. Amazon RDS engines also support storage auto scaling. This option automatically increases the storage by 5 GiB or 10% of currently allocated storage, whichever is higher. The CloudWatch metric
FreeStorageSpace measures the amount of available storage space of an instance. If the amount of data increases, you see a decline on the
FreeStorageSpace graph. If
FreeStorageSpace is around 10%–15%, it’s a good time to scale storage. A sudden spike in storage consumption suggests that you should look at the database workload. Heavy write activity, detailed logging, or large numbers of transactional logs are significant contributors to lower free storage.
The following graph shows an Amazon RDS PostgreSQL instance’s
FreeStorageSpace metric. It shows that free storage dropped approximately 90% in 20 minutes.
While troubleshooting this issue, the parameter
log_min_duration_statement was set to 0. This means each SQL statement was being logged and filling transactional log files. These troubleshooting steps, and CloudWatch graphs help you to decide when to tune the database engine or scale out instance storage.
DatabaseConnections metric determines the number of database connections in use. For an optimal workload, the number of current connections should not exceed approximately 80% of your maximum connections. The
max_connections parameter determines the maximum number of connections permitted in Amazon RDS. You can modify this in the parameter group. For more information, see Working with DB Parameter Groups.
The default value of this parameter depends on the total RAM of the instance. For example, for Amazon RDS MySQL instances, the default value is derived by the formula
You should move to an Amazon RDS instance class with higher RAM if the number of database connections is consistently around 80%
max_connections. This ensures that Amazon RDS can have a higher number of database connections. For example, an Amazon RDS PostgreSQL instance is hosted on an db.t2.small instance, and the formula
max_connections to 198 by default.
You can find the value of this instance through the Amazon RDS console and AWS Command Line Interface (CLI) or by connecting to the instance. See the following code:
The following CloudWatch graph shows that database connections have exceeded the
max_connections value multiple times between 12:40 and 12:48.
When the number of DB connections exceeds
max_connections, you receive the following error message:
In this situation, you should determine the reason behind the high number of database connections. If there are no errors in the workload, you should consider scaling up to an Amazon RDS instance class with more memory.
I/O operations per second (IOPS) metrics
Storage type and size govern IOPS allocation in Amazon RDS SQL Server, Oracle, MySQL, MariaDB, and PostgreSQL instances. With General Purpose SSD storage, baseline IOPS are calculated as three times the amount of storage in GiB. For optimal instance performance, the sum of
WriteIOPS should be less than the allocated IOPS.
Beyond burst capacity, increased usage of IOPS may lead to performance degradation, which manifests in increased
DiskQueueDepth. If the total IOPS workload is consistently 80%–90% of the baseline IOPS, consider modifying the instance and choosing a higher IOPS capacity. You can achieve this through a few different methods: increasing General Purpose SSD storage, changing the storage type to Provisioned IOPS, or using Aurora.
Increasing General Purpose SSD storage
You can increase your General Purpose SSD storage so that the instance gets three times the amount of storage in GiB. Though you can create MySQL, MariaDB, Oracle, and PostgreSQL Amazon RDS DB instances with up to 64-TiB of storage, the max baseline performance you can achieve is 16,000 IOPS. This means that for 5.34-TiB to 64-TiB storage volume, the instance has a maximum 16,000 IOPS baseline performance. If you see that
ReadIOPS are contributing more toward total IOPS consumption, you should move to a higher instance class with more RAM. If the database working set is almost all in memory, the
ReadIOPS should be small and stable.
In the following example, an Amazon RDS PostgreSQL instance is configured with 100 GiB GP2 storage. This storage provides 300 IOPS capacity with burst capability for an extended period. As the following graphs show, at 03/27 12:48,
WriteIOPS was at 480 and
ReadIOPS was 240. The total sum of these (720) was far beyond the baseline capacity of 300. This caused high
WriteLatecny and high
The following graph shows the
WriteIOPS value as 480 at 12:48.
The following graph shows the
ReadIOPS value as 240 at 12:48.
The following graph shows a high
WriteLatency of 78 ms at 12:48.
The following graph shows a high
DiskQueueDepth of 38 at 12:48.
If the instance requires more than 16,000 baseline IOPS or low I/O latency and consistent I/O throughput, consider changing your storage type to Provisioned IOPS. For MariaDB, MySQL, Oracle, and PostgreSQL, you can choose PIOPS in the 1000–80,000 range.
Consider using Amazon Aurora if the database IOPS performance isn’t limited to a certain number. Limits may be governed by size or type of storage volume. Aurora doesn’t have the same type of IOPS limit; you don’t have to manage, provision, or expand IOPS capacity. Instance size primarily determines the transactional and compute performance of an Aurora workload. The maximum number of IOPS depends on the read/write throughput limit of the Aurora instance. You are not throttled due to the IOPS, but due to the instance’s throughput limit. For more information, please see Choosing the DB Instance Class.
Aurora is a MySQL and PostgreSQL compatible relational database solution with a distributed, fault-tolerant, and self-healing storage system. The Aurora storage automatically scales up to 64 TiB. Aurora offers up to 15 Read Replicas, compared to Amazon RDS engines, which provide up to five replicas in a Region.
An Amazon RDS instance has two types of throughput limits: Instance level and EBS volume level limits.
You can monitor instance level throughput with the metrics
WriteThroughput is the average number of bytes written to disk per second.
ReadThroughput is the average number of bytes read from disk per second. For example, a db.m4.16xlarge instance class supports 1,250-MB/s maximum throughput. The EBS volume throughput limit is 250 MiB/S for GP2 storage based on 16 KiB I/O size, and 1,000 MiB/s for Provisioned IOPS storage type. If you experience degraded performance due to a throughput bottleneck, you should validate both of these limits and modify the instance as needed.
Amazon RDS Performance Insights
Performance Insights monitors your database instance workload so you can monitor and troubleshoot database performance. With the help of database load and wait event data, you get a complete picture of the state of the instance. You can use this data for modifications in the instance or workload for overall better database performance.
The following CloudWatch graph shows a high
CommitLatency in an Aurora PostgreSQL instance. The
CommitLatency is 68 ms at 15:33.
The following graph shows a high
IO:XactSync between 15:33 and 15:45.
Looking at Performance Insights, you see that at the time of high
CommitLatency, the wait event
IO:XactSync was high too. This wait event associates with the
CommitLatency and is the time spent waiting for the commit of the transaction to be durable. It happens when a session is waiting for writes to stable storage. This wait most often arises when there is a high rate of commit activity on the system. During this latency, Aurora is waiting for Aurora storage to acknowledge persistence. In this case, the storage persistence might be competing for CPU with CPU-intensive database workloads. To alleviate this scenario, you can reduce those workloads or scale up to a DB instance with more vCPUs.
This post discussed CloudWatch metrics related to Amazon RDS and Performance Insights, and how you can use those to make decisions about your database. These metrics help you decide on compute and storage scaling, database engine performance tuning, and workload modifications.
The post also reviewed various storage classes that Amazon RDS offers and how Amazon Aurora works differently compared to Amazon RDS instances with EBS volumes. This knowledge can help you to troubleshoot, evaluate, and decide on Amazon RDS modifications.
About the Author
Vivek Singh is a Senior Database Specialist with AWS focusing on Amazon RDS/Aurora PostgreSQL engines. He works with enterprise customers providing technical assistance on PostgreSQL operational performance and sharing database best practices.