Scaling SQL Server performance beyond 1M transactions per minute with Amazon FSx
In this blog post, we introduce a strategy for scaling Microsoft SQL Server deployments on Amazon Web Services (AWS) that uses Amazon FSx, a service that provides fully managed, high-performance file systems in the cloud. This strategy increases SQL Server performance on AWS, providing 2 to 3 times as many Transactions per Minute (TPM) as previously achieved results. Our approach also results in a more cost-effective solution by providing a lower cost per transaction.
Overall, our strategy can be used by customers looking for cost-effective ways of running the highest-performance SQL Server database deployments on the cloud.
In 2021, our team released a blog post that set a new benchmark for SQL Server performance on AWS. The authors used memory-optimized Amazon Elastic Compute Cloud (Amazon EC2) R5b.24xlarge instances capable of up to 60 Mbps of Amazon Elastic Block Store (EBS) bandwidth and 260,000 input/output operations per second (IOPS). To take full advantage of this instance’s IO capabilities, a 3.5 TB HammerDB TPCC test database was stored on two io2 Block Express (io2-BE) volumes, arranged in RAID-0 configuration. Performance tests using HammerDB with this configuration achieved nearly 830,000 TPM.
In the same blog post, the authors showed that hosting the database on the Amazon EC2 instance store volume improves performance by approximately 20%. However, instance store volumes are ephemeral, providing only temporary block-level storage that is lost if you stop the instance. This makes instance store volumes suitable for tempdb storage, while both EBS volumes and Amazon FSx file systems provide durable and scalable storage solutions for hosting your databases.
Newer instance families, such as x2iedn and r6idn, offer significant performance improvements over the r5b.24xlarge that was used in the blog post. These improvements lead to about 30% of performance increase, resulting in close to 1 million TPM. However, these performance gains come at a cost. The increased number of virtual CPUs (vCPUs) require additional SQL Server licensing, and the need to use RAID-0 across 3 io2-BE volumes to saturate the instance throughput also adds to the EBS storage costs.
In this blog post, we will show how the innovative use of Amazon FSx enables you to achieve higher SQL Server performance than is possible with EBS or instance store volumes alone.
2. Test Configuration
For our SQL Server performance testing on Windows, we used Amazon FSx file systems compatible with Windows, namely Amazon FSx for NetApp ONTAP (FSx for ONTAP) and Amazon FSx for Windows File Server. We configured each Amazon FSx file system with 26 TB SSD storage, 80,000 IOPS, and 2 GBps throughput. FSx for ONTAP offers throughput up to 4 GBps and 160,000 IOPS, and FSx for Windows offers throughput up to 12 GBps and 350,000 IOPS, but these offerings are only available in certain AWS Regions at the time of this writing, so we excluded these configurations from our testing. (We plan on following up these performance tests using the higher throughput capacity levels.)
For our performance tests, we used Amazon’s license included Microsoft Windows Server 2019 with SQL Server 2019 Enterprise edition in the AWS Virginia (us-east-1) Region. We used r5dn.24xlarge EC2 instances, which offer 768 GB of RAM, 96 vCPUs, 100 Gbps of network bandwidth, and four 900 GB NVMe SSD instance store volumes. We selected this instance because it is comparable in terms of vCPU and RAM with the r5b.24xlarge used in the original blog post, but offers increased network throughput, which is important, since we used network-attached storage for the database.
For the tempdb, we created a RAID-0 volume over four instance store volumes, which are available on r5dn.24xlarge EC2 instances. We also used io2-BE EBS volumes provisioned with 64,000 IOPS for SQL Server database log files. We stored the database data files on Amazon FSx file systems.
We used HammerDB, a widely used benchmarking tool, to simulate a database workload. HammerDB’s OLTP workload was the basis of our performance testing because similar workloads are common in SQL Server migrations to AWS. We generated a test database that includes 30,000 Data Warehouses, which is about 3 TB in size.
HammerDB virtual users are simulated users that put a load on the database for performance testing. To gauge the peak performance of a system, it’s advisable to begin with a few virtual users and incrementally increase this number until the database TPM hits a plateau. When we increase the number of virtual users, the performance metric will grow until it reaches a saturation point, in which the growth stops, or even declines.
For each configuration, we chose the following number of virtual users: 256, 362, 512, 724, and 1024. In order to achieve reliable and consistent results, we conducted three separate tests for each virtual user load point. We then calculated the average of these tests.
3. Performance testing using FSx for NetApp ONTAP
FSx for ONTAP offers two protocols – iSCSI and SMB. We used both protocols in our performance testing.
3.1. Using FSx for ONTAP with iSCSI protocol
We attached iSCSI interfaces, provided by four separate FSx for ONTAP instances, as drives to the EC2 instance. These drives were then striped in a RAID-0 configuration using the Windows Storage Spaces feature. To enhance performance of iSCSI drives, we enabled MPIO and assigned four paths to each drive. We show the setup for this scenario in Figure 1.
HammerDB performance test results for this configuration are presented in Table 1 and Figure 2. As indicated by the results, this configuration enabled us to double the performance detailed in the original blog post. The plateau point was reached with 724 virtual users in this setup, meaning any further increase in the load led to diminishing performance.
3.2. Using FSx for ONTAP with SMB protocol
FSx for ONTAP also supports the SMB protocol, a client-server communication protocol used to share access to files over the network. Unlike the iSCSI protocol, SMB doesn’t present virtual drives, but instead offers file shares. Therefore, we couldn’t use RAID-0 striping to enhance performance like we did in the previous scenario.
Instead of volume striping, we used a feature of SQL Server that distributes database files in a file group across multiple volumes or file shares. We distributed the primary file group across four file shares presented by four FSx ONTAP file systems.
In a RAID-0 scenario, each record of a table is split across multiple underlying drives. However, when we distribute the primary file group across several shares, each table record allocated to this file group is stored entirely on a single SMB share. All the records of a table will be evenly distributed across multiple shares. This setup resembles RAID-0. We illustrate this configuration in Figure 3.
We present the performance test results from HammerDB for this setup in Table 2 and Figure 4. As shown by the results, this configuration outperformed the one we discussed earlier in Section 3.1 above. As the load increases, the performance also improves, albeit at a decelerated pace, reaching a plateau at 1,024 virtual users.
4. Performance testing using FSx for Windows File Server
FSx for Windows File Server supports only the SMB protocol, so for our testing we used similar configuration to the one we presented on Figure 3, with the exception that we used FSx for Windows File Server instead of FSx for ONTAP, as shown on Figure 5.
When running SQL Server using this configuration for storage, we achieved the performance results presented in Table 3 and Figure 6.
We achieved over 2 million TPM – almost 3 times more than the best result achieved with IO2-Block Express EBS volumes mentioned in the original blog post.
5. Price/performance comparison
Table 4 summarizes results of our performance testing across the three configurations previously discussed, plus the original configuration using io2-BE EBS volumes (from the original blog post). We also provide cost estimates for each configuration. When computing the cost for Amazon FSx, we did not consider backup cost or duplication savings, since these do not apply to database workloads.
The performance for configurations using FSx for ONTAP used the steady-state performance values, which, for dynamic workloads like SQL server, typically 10-15% lower than results provided in Tables 1 and 2. The compute cost is based upon EC2 instance r5dn.24xlarge with license included Windows Server and SQL Server Enterprise Edition in the us-east-1 (N. Virginia) AWS Region, valid at the time of writing.
Adding four Amazon FSx filesystems increased the total cost of the system. However, because of the increased performance, it reduced the overall cost per transaction.
FSx for ONTAP, when used with the SMB interface, offered the best price-to-performance ratio. FSx for Windows File Server provided the highest overall performance but comes at the greatest total cost. However, when we account for the superior SQL Server performance achieved with this configuration, the cost per 1,000 TPM is comparable to that of FSx for ONTAP using the SMB interface.
6. Extending the scope of performance testing
Our analysis would be incomplete if we did not consider newer AWS instance types, specifically, the x2iedn memory-optimized family of EC2 instances, which are used for a wide range of large-scale memory-intensive applications. This family offers 32:1 ratio of memory to vCPU on all sizes, scaling up to 4 TB of RAM, which makes them attractive to use for SQL Server workloads. For comparison with r5dn.24xlarge instance, used in all our previous tests, we opted for the x2iedn.24xlarge and x2iedn.32xlarge, with 2 TB and 4 TB of RAM, respectively.
Using a 3.5 TB database on EC2 instances with RAM close to or exceeding the size of the database might not provide an adequate load to test the IO subsystem. Therefore, we generated a HammerDB database with 75,000 warehouses, which, at 8.5 TB, allows us to stress the IO subsystem. We also aimed to further increase the SQL Server load by increasing the number of virtual users to 2,048.
For the storage subsystem, we selected the best performing configuration with four FSx for Windows File Servers in a configuration matching the one on Figure 8. Results of our tests are presented in Table 5 and Figure 7.
SQL Server on the r5dn.24xlarge instance reached a peak of about 1.5 million TPM for the 8.5 TB database. The increased amount of RAM on x2iedn family was the determining factor in allowing SQL Server on these instances to surpass the 2 million TPM mark. With the 8.5 TB database, the extra memory on x2iedn family, compared to 768 GB available on r5dn.24xlarge, indeed made a significant difference.
Price-to-performance analysis for this set of tests provide in Table 6. The EC2 instance x2iedn.24xlarge emerged as a clear winner in terms of cost per 1,000 TPM, although it offered a marginally lower performance as compared to the larger x2iedn.32xlarge EC2 instance. To truly benefit from the 4 TB of RAM available on the x2iedn.32xlarge EC2 instance, we may need to consider an even larger database.
Another interesting observation, illustrated in Figure 8, is that as we transitioned to instances with increased RAM, the IOPS and throughput consumed from the underlying FSx for Windows File Server decreased. This is because SQL Server completed more operations in the cache, reducing the demand on the filesystem.
Our innovative use of multiple Amazon FSx file systems enabled us to achieve storage performance significantly higher than what a single filesystem might offer. This led to a significant increase in SQL Server performance. Although the use of multiple Amazon FSx systems increases the cost of the deployment, the resulting boost in SQL Server performance reduces the overall cost per transaction to levels comparable to or even lower than other configurations.
While FSx for ONTAP with the iSCSI interface showed somewhat lower performance in our SQL Server testing, using RAID-0 offers a simpler deployment pathway as we do not have to distribute the primary file group across four volumes.
We performed our tests using four Amazon FSx file systems, but “four” is not a “magic number” for this case. You may achieve a significant boost in SQL Server performance using just two or three Amazon FSx filesystems depending on the EC2 instances used for hosting the SQL Server and your specific performance and cost requirements.
This method provides another option for our customers to manage high-performance SQL Server workloads. Specifically, our strategy can be used by customers looking for cost-effective ways of running the highest-performance SQL Server database deployments in the cloud.
AWS has significantly more services, and more features within those services, than any other cloud
provider, making it faster, easier, and more cost effective to move your existing applications to the cloud
and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they
need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for
additional guidance and options for your Microsoft workloads. Contact us to start your migration and
modernization journey today.