AWS Compute Blog
Reduce your Microsoft licensing costs by upgrading to 4th generation AMD processors
This post is written by Jeremy Girven, Solutions Architect at AWS and Chase Lindeman, Senior Specialist Solutions Architect at AWS.
Amazon Web Services (AWS) and AMD have collaborated since 2018 to deliver cost effective performance for a broad variety of Microsoft workloads, such as Microsoft SQL Server, Microsoft Exchange Server, Microsoft SharePoint Server, Microsoft Systems Center suite, Active Directory, and many other Microsoft workload use cases. This post shows how the performance improvements of the latest generation AMD-powered Amazon Elastic Compute Cloud (Amazon EC2) instances can help you reduce licensing costs on Microsoft workloads running on AWS.
AWS has been running Microsoft workloads for over 16 years. The most common of these workloads are those running Microsoft Windows Server and Microsoft SQL Server. Both can be brought to AWS using the Bring Your Own License (BYOL) or License Included (provided by AWS) licensing models. Many BYOL licensing restrictions need workloads to be run on dedicated tenancy and need Dedicated Hosts. For these workloads, a license would be needed to cover each physical core of the Dedicated Host (for example if the Dedicated Host has 96 physical cores, 96 licenses would be necessary to cover the host). For License Included EC2 instances, the cost of the associated Microsoft licenses is a per-vCPU fee bundled into the total price of the EC2 instance.
Regardless of which licensing option works best for you, the licensing cost is directly related to the number of virtual cores (vCPUs) or physical cores used by your workloads. Using high-performance processors allows you to potentially reduce the total number of cores necessary to run a workload. Reducing the total number of cores subsequently reduces your total cost of ownership (TCO) by reducing the number of licenses. One potential option available for running Microsoft workloads on AWS are EC2 instances, which use fourth generation AMD processors.
The AWS Nitro EC2 instance families using fourth generation AMD EPYC processors are M7a, C7a, R7a, and Hpc7a. These fourth generation AMD EC2 instances use DDR5 memory to deliver 2.25x more memory bandwidth and up to 50% higher performance as compared with previous generation AMD EC2 instances. For performance-per-watt improvements across integer performance, floating point, and natural language processing (NLP) throughout, these fourth generation AMD EPYC processors offer up to 2.7x greater results than those of previous generation AMD EC2 instances.
AMD has publicly available performance testing comparing the General Purpose M7a instances with the previous generation M6a instances. You can find the information in this link. We wanted to expand their testing to Compute Optimized and Memory Optimized EC2 instances to observe if their results hold true for different instance families.
In the following section we dive into our performance testing methodologies, and we review our results.
Method 1: CPU calculation speed
The following is the configuration of the EC2 instances used for testing:
- Instance Types: C6a.large and C7a.large (2 vCPUs, 4 GiB Memory, and 30 GiB (3000 IOPS, 125 MB/s) GP3 EBS volume)
- Operating System: Microsoft Windows Server 2022 Datacenter (10.0.20348 N/A Build 20348)
- Installed Software: AWS device drivers (NVMe 1.5.1 & ENA 2.7.0), Amazon EC2 Launch Agent v2 (2.0.1981.0), Amazon SSM Agent (3.3.551.0), and PowerShell 7.4.5 (all non-essential software has been removed)
- AWS Region and AZ: us-west-2 / us-west-2a (usw2-az1)
We performed a direct, yet CPU-intensive math test by calculating prime numbers in a range of 2 through 10,000 using Windows PowerShell (version 7 needed). This runs in a loop ten times, which allows us to use the processing time over all the runs. The following is the code used for testing:
Function Start-PrimeNumberTest {
[CmdletBinding()]
param(
[Parameter(Mandatory = $True)][Int32]$TestRunLimit, #The number of times the test will run in a loop
[Parameter(Mandatory = $True)][Int32]$UpperNumberRange #The upper number of the range to find prime numbers in (larger the number the longer it takes to process)
)
$DoCount = 0
$NumberRange = 2..$UpperNumberRange
[System.Collections.ArrayList]$TimeArray = @()
[System.Collections.ArrayList]$OutputArray = @()
$vCPUCount = Get-CimInstance -ClassName 'Win32_Processor' | Select-Object -ExpandProperty 'NumberOfLogicalProcessors'
Do {
$Time = Measure-Command {
$Range = $NumberRange
$Count = 0
$Range | ForEach-Object -Parallel {
$Number = $_
$Divisor = [Math]::Sqrt($Number)
2..$Divisor | ForEach-Object {
If ($Number % $_ -eq 0) {
$Prime = $False
} Else {
$Prime = $True
}
}
If ($Prime) {
$Count++
If ($Count % 10 -eq 0) {
$Null
}
}
} -ThrottleLimit $vCPUCount
}
$DoCount++
[void]$TimeArray.Add($Time.TotalSeconds)
Start-Sleep -Seconds 5
} Until ($DoCount -eq $TestRunLimit)
$Output = $TimeArray | Measure-Object -Average -Maximum -Minimum | Select-Object -Property 'Count', 'Average', 'Maximum', 'Minimum'
[void]$OutputArray.Add("Number of runs : $($Output.Count)")
[void]$OutputArray.Add("Average time to complete (seconds) : $($Output.Average)")
[void]$OutputArray.Add("Maximum time to complete (seconds) : $($Output.Maximum)")
[void]$OutputArray.Add("Minimum time to complete (seconds) : $($Output.Minimum)")
Write-Output $Output
}
To run the code, invoke the function and specify the Test Run Limit and Upper Number Range. For example, the following code mimics our test by finding prime numbers up to 10,000 and run the test 10 times:
Start-PrimeNumberTest -TestRunLimit 10 -UpperNumberRange 10000
Test results: CPU calculation speed
Figure 1. C7a.large and C6a.large performance results over ten tests
Although this is a direct CPU performance test, it demonstrates a clear performance advantage of using the latest generation of AMD powered instances as compared with previous generations:
- Slowest test: The C7a.large was over seven seconds faster than the quickest run on the C6a.large. This is a delta of more than 25% faster in the worst-case scenario for the C7a.large.
- Fastest test: The C7a.large completed over 13 seconds faster than the C6a.large, showing a 47% faster processing time.
- Average: There is an 11 second difference in processing time between the two instances. The C7a.large is averaging over 38% faster than the C6a.large.
Price-performance
The latest generation of AMD instances is more expensive than the previous generation. However, when we consider the performance delta between the two instances, using the average test duration length and the on-demand price of both instances in us-west-2, the C7a.large cost $0.000957791 per run to process the workload while the C6a.large cost $0.001352344. The C6a.large costs approximately $0.0004 per second more to process the same workload. Although that might sound small, this cost delta is greater than $12,000 over a 1-year period. These results show the value using the latest generation of AMD powered instances, especially with CPU bound workloads.
Method 2: SQL Server performance
We wanted our second testing method to focus more on real-world applications related to Microsoft workloads. For this test, we wanted to measure SQL Server performance.
SQL Server can be tested with an open source load testing tool called HammerDB. SQL Server is primarily used for OLTP workloads, thus we used the TPROC-C benchmark from HammerDB because it is specifically tailored for OLTP database testing.
The following is the configuration of the EC2 instances used for testing:
- Instance Types: R7a.8xlarge and R6a.8xlarge (32 vCPUs, 256 GiB Memory)
- Storage: io2 EBS volumes w/ 40,000 IOPS (EC2 instance maximum)
- SQL Server: Microsoft SQL Server 2022 (RTM-CU14) (KB5038325) – 16.0.4135.4 (X64) Jul 10 2024 14:09:09 Copyright (C) 2022 Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows Server 2022 Datacenter 10.0 <X64> (Build 20348: ) (Hypervisor)
- Maximum Server Memory: 240 GB
- Database File Size: 220 GB
- Database Data Size: 2000 warehouses (~200 GB)
- MAXDOP: 1
HammerDB creates a test database based on “warehouses.” Each warehouse is approximately 100 MB of data. Our test server used 2000 warehouses, leaving approximately 20 GB for overhead in the 220 GB database file size. The total database size was also purposely sized smaller than the total memory allocated to our SQL Server. This allows SQL Server to cache as much of the database as possible in memory to avoid latency reading from disk.
When testing with Hammer DB, it uses “virtual users” as a method of applying load to the database. Our testing on each EC2 instance started with a small load of 32 virtual users to match the number of virtual users to vCPUs. Tests used a warmup time of five minutes and five minutes of processing. Then, the virtual users were increased by logarithmic scale to apply a larger performance load on the servers. Testing continued until we saw a decline of the of the total Transactions Per Minute (TPM). Three full runs were completed on each EC2 instance to create an average TPM at each level of virtual users.
Test results: SQL Server performance
Figure 2. R7a.8xlarge and R6a.8xlarge average TPM
Figure 3. R7a.8xlarge and R6a.8xlarge average TPM
The R7a.8xlarge consistently outperformed the R6a.8xlarge, even on tests with low load. The most notable difference was a 34% increase in TPM at peak performance. These results are similar to the 32% difference that AMD published when testing the M7a.8xlarge and M6a.8xlarge instances using another OLTP benchmark, TPROC-E.
Cost savings
Our test results are good news if you’re running SQL Server workloads. The ability to process more transactions with the same number of vCPUs translates into needing fewer vCPUs to run your current workloads, thereby lowering the total number of SQL Server licenses in your environment. With SQL Server Enterprise Edition licensing costing over $15,000 per 2-core pack as of this writing, being able to reduce your SQL Server licensing costs could save you hundreds of thousands of dollars for your total cost of ownership.
Conclusion
When evaluating the cost of CPU license-based workloads, such as those available with Microsoft workloads, the results show looking at the price alone isn’t optimal for selecting instances to use for your workloads. Commercial software such as Microsoft’s Windows Server or SQL Server are typically licensed at the vCPU level or the physical core level (BYOL). When dealing with CPU-bound workloads, choosing the instance with the highest performance to price ratio is the best evaluation method.