How Maxar builds short duration ‘bursty’ HPC workloads on AWS at scale

This post was contributed by Christopher Cassidy, Maxar Principal DevOps Engineer, Stefan Cecelski, PhD, Maxar Principal Data Scientist, Travis Hartman, Maxar Director of Weather and Climate, Scott Ma, AWS Sr Solutions Architect, Luke Wells, AWS Sr Technical Account Manager

Introduction

High performance computing (HPC) has been key to solving the most complex problems in every industry and has been steadily changing the way we work and live. From weather forecasting to genome mapping to the search for extraterrestrial intelligence, HPC is helping to push the boundaries of what’s possible with advanced computing technologies.

Maxar’s WeatherDesk^SMleverages these advanced computing technologies to deliver weather forecasts faster to customers, enabling them to make better informed business decisions. WeatherDesk builds HPC solutions on AWS to provide access to global numerical weather forecasts to stay ahead of emerging conditions that affect agriculture production, commodity trading, and financial markets. These forecasts are also vital for protecting critical infrastructure like power grids around the world, energy exploration and production, and even transportation. The WeatherDesk platform provides access to data services, web applications, and information reports to customers around the clock via a software-as-a-service (SaaS) suite of offerings designed for specific personas – data scientists and developers, researchers, and executives and operators, respectively.

Maxar uses a number of HPC services like Elastic Fabric Adapter (EFA), the AWS Nitro System and AWS ParallelCluster to deliver their solutions to their customers. All of this allows Maxar to scale HPC applications to tens of thousands of CPUs with the reliability, scalability, and agility of AWS that would otherwise be extremely difficult to achieve.

In this post, we will discuss how Maxar deploys all these tools to run short duration HPC workloads using the “fail fast” software development technique.

Constraints

HPC workloads come in all different shapes and sizes, but can generally be divided into two categories based on the degree of interaction between the concurrently running parallel processes:

Loosely-coupled are those where the multiple processes don’t strongly interact with each other in the course of a simulation.
Tightly coupled are those where the parallel processes are simultaneously running and regularly exchanging information between cooperating processes at each step of a simulation.

Solutions like Maxar’s WeatherDesk platform using HPC for numerical weather prediction are tightly coupled due to the complexity of the calculations that go into making a numerical weather forecast. Mainly, the codependency of global weather parameters and computing algorithms require over two billion calculations per second spread across hundreds of Amazon Elastic Compute Cloud (Amazon EC2) instances within an AWS HPC cluster environment. Each of these calculations depends on each other – so the reliable exchange of lots of data in the least time, is important.

Additionally, HPC workloads – including Maxar’s WeatherDesk – are often constrained in other ways that can impact the final solution:

HPC workloads are often very bursty, performing computations only a few hours per day. This requires a large number of cores for a short time when they run.
Time-bound workloads must complete on a specific schedule with the exact number of instances or physical cores available throughout that time period.
Spot instances require workloads to “checkpoint” to handle interruptions, but not all can. Typically, tightly coupled applications, like weather simulation, find this hard.

Workloads often prefer homogeneous instance types or instances with the same underlying architecture and, as we discussed, they need reliable, fast, high-throughput connectivity between instances.

Maxar needs to ensure they’re able to launch the HPC clusters when needed, using the suite of architectures given the problem set, and they must dynamically adjust (or fail fast) to minimize downstream impact.

Solution overview

With these constraints in mind, Maxar built a solution on AWS that uses hpc6a instances. These instances are powered by 3rd generation AMD EPYC processors and offer up to 65% better price performance over comparable Amazon EC2 x86 based compute-optimized instances.

Their solution uses AWS ParallelCluster is an open-source cluster management tool that makes it easy for you to deploy and manage HPC clusters on AWS. ParallelCluster uses EFA, cluster placement groups, and supports On-Demand Capacity Reservations (ODCR) – important technologies and techniques for creating a highly performant and flexible HPC solution.

They also use AWS CloudFormation to provision the head node and then use shell scripts to dynamically adjust input parameters to an AWS ParallelCluster CLI invocation which builds a cluster of over 25,000 physical cores … and they do this several times a day!

They manage the cluster provisioning workflow using CloudFormation but controlled via CI/CD pipelines like AWS CodeCommit and AWS CodeBuild.

The ODCR allows Maxar to reserve large numbers of a specific instance type in a specific availability zone (AZ) and using a cluster placement group for the duration of their workload. This can happen at any time of any day and without entering into a one-year or three-year commitment.

Based on the response from the synchronous CreateCapacityReservation API call, Maxar can pivot to other instance types, AZs, or even other AWS Regions based on the proximity to the datasets and other variables. For Maxar, pivoting to a similar instance type (or two) in the same AZ, in the same region provided enough flexibility (and resiliency) to boost both performance and confidence.

Walkthrough

Let’s walk through a sample CloudFormation template that incorporates the newly supported AWS ParallelCluster resources.

AWS CloudFormation provisions and configures AWS resources for you, so that you don’t have to individually create and configure them and determine resource dependencies. In our solution we’ll use c6g instances powered by Arm-based AWS Graviton Processors.

Figure 1: Solution architecture to outline the steps

In this walkthrough we’ll take you through six parts, including one that’s optional:

Prerequisites
Deploying the solution
Running the workload
Results of the workload (optional)
Cluster cleanup
Conclusion

Prerequisites

For this to work in your own AWS account, you’ll need:

A list of instance types suitable for the workload (e.g. if the workload requires a GPU, g5, g5g, or g4dn might be suitable). In this example, we will use c6g.8xlarge and c6g.16xlarge.
A list of availability zones that would meet the proximity requirement to the data for your workload.
The duration of the workload and the number of instances required.

Deploying the solution

You can find our one-click Launch Stack to deploy the CloudFormation template with all the necessary components for this solution. The solution is made up of:

CloudFormation custom resources to create ODCRs
A ParallelCluster head node with Slurm Workload Manager
Security groups
Cluster placement group

The input parameter NodeInstanceTypes is in comma delimited format used to specify the instance types you want to try to request On-demand Capacity Reservation (ODCR).

The input parameter NodeInstanceDurations is in comma delimited format used to specify the durations you want to reserve for in minutes corresponding to the number of NodeInstanceTypes.

The input parameter NodeInstanceCounts is in comma delimited formation used to specify the number of nodes you want to reserve for corresponding to the number of NodeInstanceTypes.

The input parameter KeyName specifies the EC2 key pair you would like to use.

Be careful to ensure you specify a key pair you can access as you’ll need this later on!

Figure 2: CloudFormation Input Parameters

Figure 3: CloudFormation Deployment Stacks

Running the workload

The workload is based on the WRF on AWS HPC workshop. This uses the Weather Research and Forecasting (WRF) model – a mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting applications. It runs on AWS Graviton processors with AWS ParallelCluster.

To run the workload, login to the HPC head node created by the CloudFormation template you deployed above using AWS Systems Manager Session Manager by following these instructions.

Once you’ve logged in to the head node, you can execute some commands to start a weather forecast simulation. Note that this will take approximately 2 hours to complete: one hour to download and prepare the data and software, and another hour to complete the simulation:

sudo -i
cd /shared
touch /var/log/user-data.log
./rundemo.sh &
tail -f /var/log/user-data.log

The shell script implements the ODCR logic by using the create-capacity-reservation API through the AWS Command Line Interface (CLI) to reserve the capacity you need. The script reserves a specific instance type in a specific AZ, which is tied to the cluster placement group. With ODCRs, the capacity becomes available and billing starts as soon as Amazon EC2 provisions the Capacity Reservation.

If the ODCR request fails, the script will pivot to alternative instance types until the ODCR request is successful. It’s also possible to loop through different AZs and regions if those are viable options for your workload.

Once the ODCR is successful, the script updates the instance type and then kicks off the HPC workload.

Navigating to the Capacity Reservations section of the Amazon EC2 Console you can see the Capacity Reservations that are actively being used by the solution.

Figure 4: On-Demand Capacity Reservation

Visualize the results of the workload (optional)

NICE DCV is a remote visualization technology that enables you to securely connect to graphic-intensive 3D applications hosted on a remote server.

To see the results, you’ll need to connect to the head node through NICE DCV remotely using AWS ParallelCluster python cluster management tool.

To do this, open a terminal from your local machine and run these commands to install the AWS ParallelCluster CLI and connect to the head node. You’ll need to specify the PEM file that corresponds to the EC2 key-pair you selected as an input parameter to your CloudFormation stack:

# run pip3 command if you do not already have pcluster cluster management tool installed.
pip3 install aws-parallelcluster

Once you’re connected to the head node, launch a terminal, and run the following command to start ncview to visualize your results:

cd $WRFWORK
ncview wrfout*

Here’s a couple of screenshots from the simulation results.

Figure 5: ncview graphic application to visualize WRF output

Figure 6: QVAPOR – Water Vapor Mixing Ratio

Cluster Cleanup

To remove the cluster, go to AWS CloudFormation console and delete the CloudFormation stack that you created in this solution.

Conclusion

With this solution, an ODCR reserves sufficient Amazon EC2 instances for the burstiness of the Maxar WeatherDesk workloads that run – at most – a few hours a day. They’re able to do this efficiently and cost-effectively without the interruptions from EC2 Spot or the need to sign up to a one-year or three-year commitment.

This method also introduces the concept of failing fast – pivoting to other HPC-optimized EC2 instance types as needed.

In the case of Maxar’s WeatherDesk, speed is everything…but so is dependability. If a workload can’t run because resources aren’t available, a quick pivot is essential so that Maxar’s customers still receive the critical Earth Intelligence information they need to operate their business, mission, or operation.

With a few lines of code, layering in ODCR workflows into their solution enabled the flexibility they need at the scale necessary for timeliness requirements.

Building Maxar’s WeatherDesk on AWS means they can rely on foundational concepts of reliability, scalability, and agility to address the rigid and demanding needs of weather workflows. By leveraging hpc6a instances, Maxar further reduced the runtime of its award-winning numerical weather prediction HPC workload by 35% while also creating efficiency gains. This results in a solution that is more attractive to more users, particularly in terms of speed, ease of access and dependability, all of which are critical for supporting our customers’ decision making for their business, mission, and operations.

Using ODCRs gave them peace of mind regarding capacity, because they can dynamically shift EC2 instance types and still be on time, and keep their compute costs low. Since Maxar implemented this solution on AWS, their cost-to-run shrank by more than 50% and they’ve been able to take advantage of numerous AWS offerings that help reduce even further – while also increasing resiliency.

These efficiency gains result in a solution that is more attractive to more users, particularly in terms of cost, ease of access, and dependability – all of which are critical for supporting our customers’ decision making for their business, mission, and operation and delivering on Maxar’s purpose, For A Better World.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

AWS HPC Blog