AWS HPC Blog

A library of HPC Applications Best Practices on AWS

Being an HPC specialist in AWS comes with some critical responsibilities. Key among them is to help customers run their applications as fast as possible, but also in the most cost efficient way, too. We aim to help customers find the most appropriate services for every workload with optimal drivers, settings, and options to get great outcomes.

But the number of HPC related services – and their capabilities sometimes grows fast enough that it’s not easy for you to ensure you’re using AWS effectively.

So today we’re announcing a resource containing the best practices from our HPC Specialist Solution Architect (SSA) community to help you get the most from running your workloads on AWS. We’re hosting these in a GitHub repository that is publicly available starting today. In addition to Application Best Practices, this repo also contains CloudFormation Templates to create clusters, and launch scripts (with some benchmark results) for selected applications.

We expect to be regularly updating and expanding the list of HPC applications included in the repo, based on your feedback, and participation form our teams. You can propose new applications that would benefit from being included using GitHub Issues.

Background

Today AWS has more than 750 different instance types, but only some of these are helpful for HPC applications. Even if you just include the HPC-specific instances, and the instance types that support EFA (the Elastic Fabric Adapter), there’s more than a hundred options.

Customers often benchmark their HPC applications on AWS to understand which of these instance types is best for their code, and to ensure that the way the application is installed and run is aligned with their business needs (e.g., runs as fast as possible, lowest price, and so forth).

Sometimes customers wonder whether they are running a specific application ‘properly’ – are they achieving the best performance?

Sometimes it’s more than idle interest – customers often request benchmarks as part of formal procurement exercises, or before they start a Proof of Concept (PoC) – with help from our teams.

Running HPC application benchmarks properly – and at scale – is a complex task. It requires preparation, experience, and strong domain knowledge. It’s more complex if you have to leave your comfort-zone and run applications in a new, possibly unknown environment, too, because this might call for a deep understanding of how the new infrastructure works.

This is why we’ve created the resource we’re launching today. It’s maintained by AWS HPC Specialists Solution Architects, who will take care of updating and improving it as our services evolve, new application versions are released, or more optimal ways are found for running applications.

We’re starting with the most common applications for the computer-aided engineering (CAE) community. These are the codes most requested by our customers.

Our goals for HPC Applications Best Practices on AWS

Here’s what we’re aiming to accomplish with this initiative:

  1. Make sure our customers can achieve the best price/performance using our infrastructure and services for their HPC Applications
  2. Ensure you have references (like time to completion) and datapoints (benchmark metrics) for running these applications using public datasets.
  3. Share general guidance, settings, and tips that can be applicable to other applications, too.
  4. Lowering the level of cloud expertise needed to run these workloads in the cloud.

While the repo isn’t a supported product or service, we’re aiming to give you access to our best thinking – and our experience – on the topics it covers.

Let’s tour the GitHub repo

At launch, we’ve structured the repo as follows:

/apps contains a folder for the best practices for each of the included applications. Within this folder you’ll find:

  • an example launch script, in some cases multiple ones to cover different architectures (x86 vs GPUs vs Graviton), or different application versions (in case they require different settings in the launch script). The example launch scripts are working examples, that can be executed with minor changes, but we also know every end-user has their own peculiarities and might want to run a given application in a custom way. In this case we’d recommend starting from the working example launch script provided and adapt it to your specific needs.
  • short documentation about the best practice (README.md file + a few assets). It typically comprises of an introduction, tips and tricks, a deep dive into the architectural choices and the most important application and environment settings aiming at tuning the performance, and finally the benchmark results shared with one or more charts.

/docs contains documentation, images, and charts. This doesn’t replace our official application, but complements them with explanations about architectural choices and details about specific applications and environment settings we’ve used.

/ParallelCluster contains simple example config files for building an HPC cluster using AWS ParallelCluster. As we release new services or features for managing HPC resources, we’ll also update this section, accordingly. We’ve included some automated CloudFormation-based procedures to deploy a basic cluster (using ParallelCluster) in selected AWS Regions. This structure will change over time as our capabilities grow, and as we gather new material.

Each application included in the repository will be supported by slightly different sets of assets (for example, launch scripts, documentation, performance charts). But there’s a minimum set you’ll see for every included application.

Beginning with the /apps folder, you’ll find the list of Best Practices available in the repo.

Figure 1. The list of available best practices you can find in the GitHub repository under /apps/

Figure 1. The list of available best practices you can find in the GitHub repository under /apps/

Inside any application directory, you’ll find one or more launch scripts that you can use as-is – or customize based on your needs. In some cases, the launch scripts are in sub-directories, broken down by CPU or GPU architectures.

Figure 2. Shows a typical example of the assets you can expect to find for an application.

Figure 2. Shows a typical example of the assets you can expect to find for an application.

Simple documentation, typically in a Readme.md file (or additional documents linked to it).

Figure 3. An example of the documentation available.

Figure 3. An example of the documentation available.

The documentation is not meant to be exhaustive – it focuses on what is relevant for running the application in the most optimal way on AWS. For general purpose application documentation (like end-user or admin guides), refer to the official guides that come with the app itself.

Typically, the documentation in the repo will include the list of versions and architectures we’ve tested, application installation tips, general tips, and key settings relevant for tuning the performance, and some performance-related information with charts and metrics for the most relevant Instance types.

Figure 4. An example of the performance charts we will provide.

Figure 4. An example of the performance charts we will provide.

How to use these best practices

If you already have a cluster up and running, you can try these best practices by cloning this repository:

git clone https://github.com/aws-samples/hpc-applications.git 

In case you don’t have an existing HPC Cluster, or if you want to deploy a new one for the scope of these tests, then you can follow the guide to launch a new ParallelCluster. We’ve included a few simple CloudFormation templates that help you to create a new test HPC cluster with just a few clicks. If you want to assemble a more complex environment for testing, using modular templates designed to work together, check out the HPC Recipes Library, which is also available as a GitHub repo (and is well explained in our blog post when we announced it).

To deploy one of the one-click stacks, select your preferred AWS Region among the ones shown in the table, and click the appropriate launch button. You’ll be asked a few questions about networking and storage. If you don’t know how to answer these, just leave the default values: AUTO. The one-click deployment procedure will take care of creating everything needed for your HPC Cluster to run properly.

Figure 5. the links to the 1-Click CloudFormation templates.

Figure 5. the links to the 1-Click CloudFormation templates.

After the CloudFormation stack is deployed, you can go to the Output tab in the CloudFormation console and click on the SystemManagerUrl link. This will let you securely access the head node using AWS Systems Manager (SSM), without needing a password or certificate. You’ll find a clone of the GitHub HPC Applications Best Practices repo under /fsx on the cluster.

Figure 6. The Outputs tab in the CloudFormation console shows a link to securely connect to the cluster through AWS Systems Manager (SSM), without needing a password or certificate.

Figure 6. The Outputs tab in the CloudFormation console shows a link to securely connect to the cluster through AWS Systems Manager (SSM), without needing a password or certificate.

Conclusion

Fine tuning your HPC applications to run well on any cluster is a complex task. We aim to keep this repository up-to-date with future versions of common applications and AWS services to provide you with the best experience possible with the fewest steps.

We’d love to receive your feedback (using GitHub Issues) to let us know if this is useful for you, or if you need something else to support your business.