Running the Harmonie numerical weather prediction model on AWS
This post was contributed by Jacob Poulsen, Senior HPC Researcher at DMI, and Karthik Raman, Senior HPC Solutions Architect at AWS.
The Danish Meteorological Institute (DMI) is responsible for running atmospheric, climate and ocean models covering the kingdom of Denmark. We worked together with the DMI to port and run a full numerical weather prediction (NWP) cycling dataflow with the Harmonie Numerical Weather Prediction (NWP) model to AWS. You can find a report of the porting and operational experience in the ACCORD community newsletter.
In this blog post, we expand on that report to present the initial timing results from running the forecast component of Harmonie model on AWS. We also present these as-is timing results together with as-is timings attained on the supercomputing systems based on Cray XC40 and Intel Xeon based Cray XC50.
The Danish Meteorological Institute (DMI) is the National Met Service in Denmark, responsible for running atmospheric, climate and ocean models covering the kingdom of Denmark. The DMI activities which includes both national obligations and research are centered around numerical weather prediction (NWP). DMI is a member of the United Weather Centre and the NWP workflow studied here stems from the United Weather Centre West joint effort towards a common setup for Numerical Weather Prediction for the United Weather Centre West. All 10 members of the United Weather Centre and the larger NWP model development community ACCORD relies on the same underlying NWP model which we here refer to as the Harmonie model.
We have recently published a report in the ACCORD newsletter which provides a description on the porting experience and execution of a full NWP cycling dataflow with the Harmonie model to AWS. The report also talks about the increasing relevance of Cloud computing for operational NWP and the numerous use-cases of cloud computing in the fields of weather forecasting and climate research.
For our evaluation we used a production size workload with a Grid size of 1920x1620x90 at 2km resolution. This workload is currently being developed as a joint effort towards a common setup for Numerical Weather Prediction for the United Weather Centre West. A typical production forecast length is 48H with hourly output. The target time for forecasting is 40min with additional post-processing required for an hourly run cycle. For the purposes of this evaluation, we reduced the forecast length to 1H. The success metric was to provide a 1 hour forecast in 50 seconds.
Amazon EC2 C5 instances are built on the AWS Nitro system and are powered by AWS-custom Intel(R) Xeon(R) Platinum 8000 series processors. The C5n instances leverage fourth generation of custom Nitro card and Elastic Network Adapter (ENA) device to deliver 100 Gbps of network throughput to a single instance. These instances are ideal for network intensive HPC workloads. C5n.18xlarge (36 physical cores, Intel Skylake based instance) support Elastic Fabric Adapter (EFA), that enables running applications with high levels of inter-node communication using MPI at scale on AWS.
The Cray-XC40 system used in this study consist of 36-core Intel (R) Xeon (R) CPU E5-2695-V4 @ 2.10GHz and the Cray-XC50 system consist of 36-core Intel(R) Xeon (R) Gold 6140 CPU @ 2.30GHz nodes. Both systems use Cray proprietary Aries interconnect and Cray Sonexion storage systems for the Lustre file-systems.
As NWP models benefit from high-speed networking, we evaluated the Harmonie model performance on C5n.18xlarge instances with EFA on AWS and compared it to two Cray machines, Cray XC40 and Cray XC50 both with Cray Aries interconnect.
AWS deployment architecture
As shown in Figure 1, we used AWS ParallelCluster to launch the cluster of Amazon EC2 instances. This is an AWS-supported open-source cluster management tool that makes it easy to deploy and manage HPC clusters on AWS. With AWS ParallelCluster, the user just needs to create a simple text configuration file to model the resources of the cluster which is used to provision compute, storage and networking capabilities in an automated manner.
For running Harmonie on AWS, we launched a cluster using a custom Amazon Machine Image (AMI) configured with all the necessary components to launch and run Harmonie. The compute, storage and networking requirements are defined in the ParallelCluster configuration file. For compute we used c5n.18xlarge instance type with EFA enabled. For storage we enabled Amazon FSx for Lustre for parallel I/O and an Amazon S3 bucket to store the Harmonie input data. We also enabled the Slurm scheduler to be used for job submission. With that configuration the system experience is not far from what users are used to on on-premises systems.
Optimal performance settings
The goal of our evaluation was to understand the feasibility and performance of running a production class Numerical Weather Prediction (NWP) model as-is on AWS and as such we have not focused on detailed optimizations across the different systems.
Compiler selection and tuning: We evaluated the performance of running Harmonie on AWS using GCC 8.4 vs. Intel Compiler and found the Intel Compiler was better by up-to 20%. The Intel compiler (version 2021.2) was used on AWS c5n.18xlarge and Intel compiler (version 17.0.3.053/126.96.36.199) on the Cray XC40/XC50 systems.
Hybrid MPI x OpenMP: The model was run with MPI-only on all three systems with 1 MPI rank per core. The Harmonie code supports hybrid MPI and OpenMP parallelization, we will focus on evaluating the right balance of MPI ranks to OpenMP threads as an optimization exercise in the future.
I/O Performance: Harmonie is very sensitive to I/O performance. The code has an option to enable a separate I/O server and allocate MPI tasks which explicitly do I/O. We have enabled the I/O server option on all the systems evaluated. Based on experiments we allocate 4 dedicated MPI tasks for I/O as that was optimal across all the systems. Further for improved I/O performance on AWS, we have used AWS FSx for Lustre which is a fully managed high performance Lustre file-system service. The Lustre file-systems on CrayXC40 and CrayXC50 are both running on Sonexion storage systems.
Porting and reproducibility
The whole production suite has been built on AWS and the main HPC components (Forecast and Data assimilation (4dvar)) have been scientifically tested and benchmarked in the sense that we have collected the initial as-is runtimes attainable without any efforts on tuning on any of the three systems. This initial benchmarking has demonstrated AWS performance that is on-par with on-premises platforms.
The binaries and their third-party library dependencies have been wrapped in a self-contained Harmonie AMI (Amazon Machine Image) and a corresponding AWS ParallelCluster configuration.
Harmonie forecast performance
Figure 2 shows the performance of the 1H operational Forecast using the Harmonie model across AWS, Cray XC40 and Cray XC50. The X-axis shows the number of instances (nodes) and corresponding decomposition used for each run; the Y-axis is the Time to solution thus lower is better. Comparing across the systems AWS c5n.18xlarge (Intel Skylake based) has a 43% performance advantage over the Cray XC40 (Intel Broadwell based) at 145 instances (nodes) and is 16% better over Cray XC50 (Intel Skylake based) at 86 instances (nodes). There were not enough free nodes in the Cray XC50 machine to scale to more than 86 nodes at the time of this evaluation. Since all the systems evaluated have hosts with 36 physical cores per node, it is easier to compare performance on a per node basis.
We need between 65 to 77 c5n.18xlarge instances to achieve the target time (50 secs.) for 1H forecast. Notably the model scaled on AWS to up-to 145 c5n.18xlarge instances using EFA with a parallel efficiency of around 70%.
In this blog, we have demonstrated the performance of running a large-scale production class tightly-coupled NWP simulation using Harmonie on Amazon EC2 instances. There is up to a 43% performance advantage running the NWP forecast simulations on C5n instances compared to Cray XC40 and up to 16% better compared to Cray XC50. The high network bandwidth provided by EFA helps to run the Harmonie model at scale and achieve 37% faster turn-around time than the set target for 1H forecast.
We encourage you to test your NWP model on Amazon EC2 instances. Detailed instructions on obtaining the Harmonie AMI and deployment on AWS using AWS ParallelCluster are provided in the GitHub repository HarmonieAWS. Reach out to us if you have any questions!
Jacob is a Sr. HPC Performance specialist at DMI with more than 20 years of experiences in scientific computing focusing on performance analysis and optimizations of applications within the fields of Weather, Ocean and Climate model.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.