Capture and diagnose I/O bottlenecks on Amazon RDS for SQL Server

In our previous post, Capture and tune resource utilization metrics for Amazon RDS for SQL Server,’ we demonstrated how to use Amazon RDS Enhanced Monitoring and Amazon RDS Performance Insights to diagnose and debug CPU utilization bottlenecks for Amazon Relational Database Service (Amazon RDS) for SQL Server. We discussed the importance of capturing database utilization metrics as a baseline. A baseline allows for a better understanding of workload characteristics and usage trends.

Aside from CPU and memory, I/O performance is critical for overall database performance. It’s important to understand the I/O requirements of a SQL Server workload, which is dependent on various factors like query access patterns, database schema, and state of database maintenance. Understanding your workload’s, I/O patterns can guide you in selecting the optimal storage type for your RDS instance, balancing performance needs with cost-effectiveness.

In this post, we demonstrate how you can use Amazon RDS monitoring tools along with SQL Server monitoring capabilities to capture, diagnose, and resolve I/O issues on an RDS for SQL Server instance.

Solution overview

Let’s look at a typical online transaction processing (OLTP) application, with large number of users running concurrent transactions. It performs reads and writes to and from the data and log files, with various I/O access patterns on Amazon RDS, and uses temporary objects in the tempdb.

When I/O needs are not met, you might see performance degradation and query timeouts during heavy workload. We assume you have already captured and reviewed the performance metrics to rule out CPU and memory bottlenecks, this post focuses on detecting and evaluating potential I/O resource constricts for RDS for SQL Server.

I/O bottleneck can be detected by checking how fast the I/O request is processed, understanding I/O latencies and throughput, identifying top SQLs contributing to high I/O activity and understanding database physical and logical design. These metrics can be collected from various SQL Server DMVs, I/O wait types, perfmon counters and logs. There are some of the database design considerations that impact I/O optimizations, as we discuss later in the blog post.

Identify I/O bottlenecks

Before we dive into understanding I/O bottlenecks, let’s examine the configuration of an RDS instance, including the instance class, storage type, and storage size. This information serves as a guide to identify the IOPS and throughput limits of the instance.

For this post, we use a db.r5.xlarge instance class with a 150 GB gp2 volume type. The volume configuration gives us a baseline value of 450 IOPS and can burst up 3,000 IOPS for 10 minutes. With gp2 volume, the baseline IOPS is tightly coupled with storage size and scales with storage. The workload consists of AdventureWorks sample database simulating concurrent users running stored procedure to run simple to complex SQL statements.

When this OLTP workload was run on Amazon RDS, we experienced degraded performance.

We used Amazon CloudWatch, Enhanced Monitoring, and Performance Insights to complete a bottleneck analysis and diagnosis, as discussed in Capture and tune resource utilization metrics for Amazon RDS for SQL Server. The initial analysis ruled out CPU and memory as bottlenecks and highlighted I/O bottlenecks.

With these tools, you can monitor storage metrics such as DiskQueueDepth, ReadLatency, WriteLatency, ReadIOPS, WriteIOPS, ReadThroughput, and WriteThroughput to determine if there are I/O issues. For more details, refer to How do I troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck in my Amazon RDS instance?

From Performance Insights, we see the RDS instance reaches the gp2 maximum burstable IOPS of 3,000 during peak load, as shown in the following figure. The high I/O usage continues to persist for a period of time. When the burstable limit has been reached, the IOPS will be throttled at the baseline value, which can cause performance issues.

IO Bottleneck

Figure 1: I/O bottleneck

Performance can also be impacted by IOPS size. High IOPS size can lead to throughput breach causing IO bottlenecks and slowness due to inadequate IO resources, even when IOPS threshold are not breached as can be seen in figure 3. Using a gp3 disk as an example, we have the following characteristics:

Baseline disk throughput is 125 MiB/s
Assuming SQL Server perform an average IOPS size of 64KB
Baseline IOPS is 3,000

Let’s do the math, if SQL Server performs on average 64KB IOPS Size, to hit the baseline throughput of 125MiB/s, it would require about 2,000 IOPS (125/0.064), which is lower than the default baseline of 3,000 IOPS.

Looking further in the Performance Insights dashboard, we see the majority of wait events are I/O related. In figure 2, the query is waiting on PAGEIOLATCH_SH waits. You can also get specific SQLs driving I/O utilization on the instance from Performance Insights. For more information about I/O waits, refer to Troubleshoot slow SQL Server performance caused by I/O issues.

Performance Insights

Figure 2: Performance Insights

Using the metrics in Performance Insights, we noticed high network throughput during query execution. This has prompt us to investigate further.

Performance Insights Metrics

Figure 3: Performance Insights Metrics

We use dynamic management function (DMF) sys.dm_io_virtual_file_stats to identify file-level latency on Amazon RDS. sys.dm_io_virtual_file_stats captures the I/O count and latency metrics for each database file on the instance. This can help you identify a particular database that is facing an I/O issue, especially when you have multiple databases within an instance. In Figure 4, we see I/O stalls for database ID 6 (this is the database with performance issues for this post). We also identified that tempdb usage is contributing to the additional load we have seen so far using the DMV as shown in figure 5. We use the db.r5.xlarge instance class, so the tempdb files are located in the Amazon Elastic Block Store (Amazon EBS) volume.

SQL Server DMV - User Database

Figure 4: Microsoft SQL Server DMV – User Database

SQL Server DMV - TempDB

Figure 5: Microsoft SQL Server DMV – TempDB

Analyze I/O bottlenecks

Based on the data we’ve collected; we can start an analysis and compare the data against the configuration and limits of the current RDS instance. As mentioned earlier, we are using the gp2 volume type. We see during peak load that IOPS generated on the RDS instance reaches the burstable limit of the gp2 volume. We also see the top wait type captured in Performance Insights is PAGEIOLATCH_SH. Looking further into sys.dm_io_virtual_file_stats, we can identify the database that generates the most I/O request and shows I/O stalls.

This data points to the I/O bottleneck in the RDS instance. We observed that tempdb is also busy contributing to I/O and network utilization because tempdb is on an EBS volume, indicating a tempdb bottleneck.

Resolve I/O bottlenecks

Addressing bottlenecks on a database typically requires multiple approaches. The first and foremost is to ensure that the database design, schema, and maintenance are optimal for best performance. We provide some database design practices below which you can adopt before opting to scale your resources. You can also look at optimizing the infrastructure architecture for performance. In this scenario, the I/O bottleneck is caused by throttling of the IOPS limit for Amazon RDS storage. To address this issue, you can modify the RDS instance volume type from gp2 to gp3. This change does not cause downtime. GP3 has a baseline IOPS of 3,000 and allows you to scale IOPS independently from storage. Therefore, you don’t need to increase storage to get more IOPS. For this use case, we set gp3 IOPS to 6,000. Comparing the old configurations to new shows that the RDS instance can now perform more IOPS compared to before, improving performance.

IOPS metrics for GP2 vs GP3 EBS Storage

Figure 6: IOPS metrics for GP2 vs GP3 EBS storage

Comparing the same stress test (using SQLQueryStress tool) across gp2 and gp3 configurations shows improved runtimes with the gp3 volume for our workload. Figure 7 shows total execution time and execution time per iteration improved in gp3 EBS storage type as compared to gp2.

SQLQueryTest results of gp2 vs gp3

Figure 7: SQLQueryTest results of gp2 vs gp3

Amazon RDS offers HDD, gp2, gp3, IO1 and IO2 storage types that cater to different requirements. Each storage type has its own performance and threshold limits. For example, a gp3 storage type on Amazon RDS for SQL Server allows you to scale the throughput and IOPS to a maximum of 1,000 MiB per second and 16,000 IOPS, whereas an io2 storage has a maximum limit of 4,000 MiB per second and 256,000 IOPS. For more information, refer to Amazon RDS DB instance storage. This is one of the benefits of Amazon RDS—it allows you to scale to meet your workload requirements as you grow instead of investing for peak loads upfront.

The following are some database design areas to focus on, that reduces the occurrence of I/O bottlenecks:

Database partitioning can help reduce the number of database pages being retrieved by selecting only a subset of the data instead of the entire table. This is particularly helpful in situations where your queries are aligning to the partition key. For example, if your query selects all records for a particular month (filter criteria), your table can be partitioned using a date column (filter column in the query). You can also compress data in one or more partitions, further reducing the amount of data retrieved by the SQL engine, lowering IOPS and throughput requirements, and therefore improving performance.
Data compression can be considered in situations where a large number of records is being retrieved and causing I/O bottlenecks. SQL Server supports page-level and row-level compression, and if the tables are partitioned, you can define a different compression type for each partition. For example, partitions that contains historical data and are accessed infrequently can use page compression, while the partitions that contain more recent data can use row compression. There is a trade-off when using compression because it can introduce additional CPU overhead with the benefit of reading fewer pages, leading to fewer I/O requirements.

Let’s look at an example of the impact of compression on the I/O profile of this post’s workload. We have a large table with a few GBs of data. We run one simple SQL against this table in both compressed and decompressed state. As you can see in the following figure, there is a difference between the logical reads and read-ahead reads counter, with I/O being significantly lower after compression. We saw a high CPU time but lower elapsed time overall, improving query performance as shown in table below.

Compression	Logical reads	CPU time	Elapsed time
No	726385	66594	176378
Yes	38102	70719	172491

Fill factor setting can also impact I/O performance. By default, the fill factor value is set to 0, which means the database page is fully filled (100%) while writing data. This is great in situations where the inserts happen at the end of the table, for example when a clustered index key column is an IDENTITY However, it may not be ideal where inserts are random. Random inserts with a high fill factor can cause page splits and increase I/O requirements and storage size.
Page splits cause logical fragmentation of the indexes. When page splits occur, the database engine moves half of the records in a page to a new page, making space for new inserts to happen. This causes both pages to be half-full, leading to additional storage usage and higher I/O operations to read the same amount of data into the memory, leading to performance issues and I/O bottlenecks. You can address this with the fill factor setting while rebuilding the indexes to redistribute the data. For more information, refer to Optimize index maintenance to improve query performance and reduce resource consumption

For an application that requires high I/O performance and low latency for tempdb, you can use instance classes that provides local NVMe storage that are physically attached to the host computer. Tempdb files are automatically placed on the local NVMe drive during instance creation. For more information, refer to Instance store support for the tempdb database on Amazon RDS for SQL Server. You can use the various dynamic management views (DMVs) to monitor the usage of your tempdb.

Because we found, in our test workload, that tempdb was also contributing to the I/O bottleneck, we changed the RDS instance type from db.r5.xlarge to db.r5d.xlarge which supports locally attached NVMe storage and the placement of tempdb files. This change reduced the network throughput usage as compared to before.

Performance Insights for Network metrics

Figure 8: Performance Insights for network metrics

Running same stress test (using SQLQueryTest tool), Total runtime and per iteration execution times improved when instance type was changed from db.r5.xlarge to db.r5d.xlarge as shown in figure 9.

SQLQueryTest results for TempDB

Figure 9: SQLQueryStress comparison of tempdb performance

Besides right-sizing your RDS instance, table and index design as well as optimized code can help to minimize I/O bottlenecks. The following are some common examples:

Reducing I/O by creating indexes based on query access patterns
Selecting the smallest data type for table columns that meets requirements
Keeping the table statistics updated
Avoiding functions that wrap around joins or where clauses leading to potential index scan
Avoiding implicit data type conversion
Avoiding a wildcard (%) search where the wildcard is at the beginning of the string in a LIKE operator
Avoiding retrieving all columns (SELECT *)

Conclusion

In this post, we showed you how to use Amazon RDS Performance Insights to identify and diagnose I/O performance-related issues. We also showed you how some of the I/O issues can be resolved through proper sizing and instance class usage. Additionally, we discussed other performance optimization strategies that can serve as guidance to improve the I/O performance of your RDS instance.

If you have any comments or questions, leave them in the comments section.

About the Author

Rita Ladda is Microsoft GTM Specialist Senior Solution Architect at Amazon Web Services with over 20 years of experience in many Microsoft Technologies. She specializes in designing database solutions in SQL Server and other databases. She provides architectural guidance to customers in migration and modernization of their Microsoft workloads to AWS.

Barry Ooi is a Senior Database Specialist Solution Architect at AWS. His expertise is in designing, building and implementing data platform using cloud native services for customers as part of their journey on AWS. His areas of interest include data analytics and visualization.

AWS Database Blog