Bare metal performance with the AWS Nitro System
This post was contributed by Matt Koop, Principal Solutions Architect, HPC
High Performance Computing (HPC) is known as a domain where applications are well-optimized to get the highest performance possible on a platform. Unsurprisingly, a common question when moving a workload to AWS is what performance difference there may be from an existing on-premises “bare metal” platform. This blog will show the performance differential between “bare metal” instances and instances that use the AWS Nitro hypervisor is negligible for the evaluated HPC workloads.
AWS Nitro System
The AWS Nitro system is a combination of purpose-built hardware and software designed to provide performance and security. Recent generation instances, including instance families popular with HPC workloads such as c5, c5n, m5zn, c6gn, and many others are based on the Nitro System. As shown in Figure 1, the AWS Nitro System is composed of three main components: Nitro cards, the Nitro security chip, and the Nitro hypervisor. Nitro cards provide controllers for the VPC data plane (network access), Amazon Elastic Block Store (Amazon EBS) access, instance storage (local NVMe), as well as overall coordination for the host. By offloading these capabilities to the Nitro cards, this removes the need to use host processor resources to implement these functions, as well as offering security benefits. The Nitro security chip provides a hardware root of trust and secure boot, among other features to help with system security. The Nitro hypervisor is lightweight hypervisor that manages memory and CPU allocation.
With this design, the host system no longer has direct access to AWS resources. Only the hardened Nitro cards can access other resources, and each of those cards provides software-defined hardware devices that are the only access points from the host device. With the I/O accesses handled by Nitro cards, this allows the last component, the Nitro hypervisor, to be light-weight and have a minimal impact to workloads running on the host. The Nitro hypervisor has only necessary functions, with a design goal of being quiescent, which means it should never activate unless it is doing work for an instance that requested it. This also means there are no background tasks running consuming any resources when it is not needed.
The Nitro system architecture also allows AWS to offer instances that offer direct access to the “bare metal” of the host. Since their initial introduction in 2017, many instance families offer *.metal variants, which provide direct access to the underlying hardware and no hypervisor. As in the case where the Nitro hypervisor is used, the Nitro cards are still the only access points to resources outside of the host. These instances are most commonly used for workloads that cannot run in a virtualized environment due to licensing requirements, or those that need specific hardware features only provided through direct access.
With both the option of “bare metal” instances and Nitro virtualized instances, this provides a method to show the performance differential between HPC application performance on bare metal vs running on the AWS Nitro hypervisor.
Performance Comparison: Nitro Hypervisor vs. Bare Metal
The Amazon EC2 User Guide offers the following statement as part of a description of the Nitro hypervisor: “A lightweight hypervisor that manages memory and CPU allocation and delivers performance that is indistinguishable from bare metal for most workloads.” Given the intense performance demands that HPC applications have on the underlying hardware, do they also fit into the “most workloads” description?
We will compare c5n.18xlarge and the c5n.metal instance types in this post. The C5n instance family is commonly used for HPC applications and is based on the Intel Xeon Platinum 8000 series (Skylake-SP) processor with a sustained all core Turbo CPU clock speed of up to 3.5 GHz. Both of the c5n.18xlarge and the c5n.metal instances provide two CPUs with 18 physical cores each. Both these instances have 100Gb/s networking and Elastic Fabric Adapter (EFA) to run HPC applications at scale. Both these instances use the same underlying HW, only difference being the non-metal instance (c5n.18xlarge) using Nitro hypervisor
We have evaluated four different workloads: Weather Research and Forecasting (WRF) Model (weather forecasting), OpenFOAM (computational fluid dynamics), GROMACS (molecular dynamics), and High Performance Linpack (synthetic matrix solver). These were selected to show applications with different characteristics. In each case we run the application at a scale of 16 instances (576 cores) using AWS ParallelCluster and FSx for Lustre as the shared filesystem.
WRF (Weather Research and Forecasting Model): WRF is one of the most widely used numerical weather prediction (NWP) models with over 48,000 registered users spanning over 160 countries. The benchmark case used for this study is the CONUS 2.5km (version for WRF v4). As with other weather models, this workload will include I/O heavy portions of the workload to both read initial conditions as well as generate output. The performance metric used is the rate of simulation, based on the total wall-clock time that includes both I/O and compute time.
GROMACS: GROMACS is a molecular dynamics (MD) package designed for simulations of proteins, lipids, and nucleic acids. For this evaluation we run the benchRIB input set (Ribosome in water with 2M atoms) from Max Planck Institute for Biophysical Chemistry. The performance number used in comparisons is the ‘ns/day’ metric that GROMACS generates at a completion of a run.
OpenFOAM: OpenFOAM is an open-source computational fluid dynamics package. For this we use a scaled up (15 million cell) version of the 4 million cell motorbike case that is part of the standard OpenFOAM v2012 tutorial suite. The performance metric used for comparison is based on the total time for all iterations and converted to iterations per hour.
High Performance Linpack (HPL): HPL is a software package that solves a dense linear system in double-precision floating point. It forms the basis for the rankings of the Top 500 list, which ranks the “fastest supercomputers in the world.” Although this is not an end-application, we have included it for completeness. For these runs the optimized HPL implementation provided by Intel in the OpenAPI package was used. The performance metric used for comparison is the GigaFLOPs reported at completion of the run.
As shown in Figure 2, the normalized performance between the metal instance and the full-sized virtual instance is nearly identical. The differential in all of the evaluated cases is within 1% of the performance level. Each data point is the average of 3 separate runs for each configuration.
This result shows that for these evaluations, the performance achieved by using the AWS Nitro hypervisor is indistinguishable from that of using the bare metal instance type.
HPC applications utilize every bit of performance that an underlying system can provide. As we have shown in this post, the AWS Nitro system is able to deliver performance at that level, while still having benefits of virtualized hardware including faster provisioning times. For these applications the performance differential between using the Nitro hypervisor (c5n.18xlarge) and c5n.metal is negligible.
If you are interested in learning more about the AWS Nitro system, we have a number of other resources. These include two past re:Invent presentations, including ‘Evolution of Nitro System’ and ‘Nitro Deep Dive’ that provide background into how AWS designed and built Nitro. James Hamilton, VP/Distinguished Engineer, also provided insights on his blog regarding AWS Nitro. The AWS Nitro system page also provides additional resources and details on the security benefits that the AWS Nitro system also provides.