Instance sizes in the Amazon EC2 Hpc7 family – a different experience
In an earlier post, we discussed performance results from the new Amazon EC2 Hpc7g instances, powered by the AWS Graviton3E processor. Check out that earlier post to get the scoop on a full range of applications we tested in our labs and with our customers and partners.
The hardware behind these instances has 64 physical cores, 128 GiB of DDR5 memory, and 200 Gbps of network performance with Elastic Fabric Adapter (EFA), optimized for traffic between instances in the same VPC. These instances (and many others in Amazon EC2) are possible because of the AWS Nitro System, a combination of dedicated hardware and a lightweight hypervisor that makes their performance virtually indistinguishable from bare metal.
Hpc7g is the first Amazon EC2 HPC instance offering with multiple instance sizes, but this is quite different from the experience of getting smaller instances from other non-HPC instance families. Today, we want to take a moment to explore why this is different, and how it helps.
In this post we’ll provide details on what “instance sizes” mean for Hpc7g (and, going forward, 7th generation HPC instances generally) and show how you can benefit when you’re aiming to extract the absolute best performance you can get from them.
Hpc7g is available in the following instance sizes with the specs shown in Table 1.
The only difference among the different sizes is the number of available physical cores – the rest of the specs (RAM, memory bandwidth, CPU frequency, and network bandwidth) remain the same.
The price is also the same.
Why is that helpful? The smaller sizes (with 32, and 16 cores) increase the memory per-core and memory bandwidth per-core and – as you’ll notice later in this post – that can have a serious impact on your code’s performance, and becomes tangible when you’re using commercial software that’s licensed on a per-core basis.
These different sizes provide an easy way for customers to use the Amazon EC2 Optimize CPU options feature on the Hpc7g instances. This enables customers to choose from a range of instance sizes to target maximum performance per instance or maximum performance per core for their HPC workloads.
Since the only variable among the instance sizes is the number of physical cores – with applications getting access to the full memory and network performance – the per-hour instance cost is the same for all the sizes.
As we mentioned, a lot of HPC customers need to customize the number of CPU cores on an instance to optimize the licensing costs of their application software. Often, they’re aiming to provide enough RAM and memory bandwidth per core for memory-intensive workloads.
CFD/CAE applications fit this model. They’re hungry for maximum performance per-core because they’re limited to a run on a fixed number of cores due to domain decomposition limitations and sometimes software licensing costs. It’s also typical for these codes to have license costs significantly higher than that of the underlying hardware, so customers are smart to want to get the best performance possible for these applications so they can maximize the bang for their buck.
Let’s look at some workloads where these variables can impact the outcomes in various ways.
OpenFOAM is one the most widely used computational fluid dynamics (CFD) packages and helps companies in a range of industries (automotive, aerospace, energy, and life-sciences) conduct research and design new products.
Figure 1 shows the per-core-scaling solver performance for a 4M cell motorbike case on a single instance of Hpc7g.16xlarge. The x-axis represents the number of cores and the y-axis represents solver performance (higher is better).
As shown in the figure, the performance starts to flatten beyond 16 to 24 cores per instance. This is mainly due to the memory-bandwidth limitations – this code and test case can drive the memory bandwidth limits (at the instance level) using lower core counts, beyond which the performance gains start to diminish.
Figure 2 shows the solver performance for the larger 100M cell motorbike case across the three different Hpc7g instance sizes (Hpc7g.16xlarge, Hpc7g.8xlarge, Hpc7g.4xlarge). As before, the y-axis is performance (higher is better) and the x-axis is the number of cores – but this time spanning multiple instances.
As shown in the figure, for a fixed core count the absolute performance is higher on the smaller instance sizes. For example, at 128 cores Hpc7g.8xlarge has 1.68x better performance than Hpc7g.16xlarge and Hpc7g.4xlarge has 2.64x better performance than Hpc7g.16xlarge. This is pretty much exactly what we’d expect if the x-axis was, in fact, tracking memory bandwidth.
Now the next obvious question is: but what about cost? The smaller sizes (Hpc7g.8xlarge – 32 cores per instance and Hpc7g.4xlarge – 16 cores per instance) mean we’re running the job on more instances to achieve the same core count as the Hpc7g.16xlarge – 64 cores per instance. So, won’t that increase the cost? The answer may not be what you’d expect.
Looking at cost per simulation
Figure 3 shows the solver cost-per-simulation for the larger 100M cell motorbike case across the three different Hpc7g instance sizes (Hpc7g.16xlarge, Hpc7g.8xlarge, Hpc7g.4xlarge) using the Amazon EC2 On-Demand pricing.
Since the only difference between the instances sizes is the varying number of physical cores – keeping the other specs the same – the per-hour instance cost is the same for all the sizes.
The x-axis is the number of cores (spanning multiple instances) and y-axis is cost-per-simulation (lower is better). For a fixed core count, the absolute cost-per-simulation is higher on the smaller instance sizes. For example, at 128 cores Hpc7g.8xlarge is 1.19x higher cost than Hpc7g.16xlarge and Hpc7g.4xlarge is 1.5x higher cost than hpc7g.16xlarge.
So, the cost-per-simulation is higher for the smaller instance sizes but at a significant increase in performance.
Now let’s look at both the performance and cost numbers together.
Figure 4 shows the performance-increase and cost-increase ratio for the motorbike 100M cell case at a fixed 128-core count across the different Hpc7g instance sizes.
The blue bar indicates the performance increase and the orange bar indicates the cost increase. As shown in the figure, at 128 cores Hpc7g.8xlarge provides a 1.68x increase in performance with a 1.19x increase in cost over Hpc7g.16xlarge and Hpc7g.4xlarge provides a 2.64x increase in performance with 1.51x increase in cost over Hpc7g.16xlarge.
Thus for this case at 128 cores the Hpc7g.8xlarge and Hpc7g.4xlarge instance sizes provide a better price-performance ratio over the Hpc7g.16xlarge.
In this post we provided an overview of the different Hpc7g instance sizes along with their instance specifications. We used a known memory bandwidth-sensitive CFD workload (using OpenFOAM) and explained how customers can benefit from using the different instance sizes to maximize the performance per core by choosing the smaller instance sizes, while keeping all the other engineering specs (and the price per instance) the same.
This led to a modest increase in cost but a disproportionately significant improvement in the overall performance, which means you get the result sooner, and net-net, with a better price-performance ratio. This is meaningful for memory-intensive and license-constrained use cases, so we think this is a feature you should be paying attention to.
The result only sounds unintuitive. If you have any thoughts about this, or you want to discuss this in depth, reach out to us at firstname.lastname@example.org. We’d love to know how this helps you.