Introducing a community recipe library for HPC infrastructure on AWS

We want to make it easier for customers to extend and build on AWS using tools like AWS ParallelCluster, Amazon FSx for Lustre, and some of the hundreds of other AWS services that customers often use to make discoveries from their data or simulations.

Recently, we introduced you to new capabilities in ParallelCluster that provide a neat way to create self-documenting infrastructure – mainly by first writing the specs into AWS CloudFormation templates, then letting CloudFormation build the infrastructure for you behind the scenes. This lets you version control your cluster by putting your templates under source control. You can even embed their management in continuous integration systems.

Today we’re making available a community library of patterns that build on that functionality. HPC Recipes for AWS is a public repository at GitHub that hosts interoperable CloudFormation templates designed to work together to build complete HPC environments. We believe this library will help customers achieve feature-rich, reliable HPC deployments that are ready to run diverse workloads – regardless of where they’re starting from.

In this post, we’ll provide some background about this new project, and take you for a tour of some of its components. Finally, we’ll show you how to cook up a cluster from these recipes in just a few minutes.

Background

By design, ParallelCluster makes it straightforward to create and manage HPC clusters. It handles the undifferentiated heavy lifting to orchestrate the compute, networking, and storage resources you need (assembled from around a dozen AWS services) into a coherent whole. Customers tell us that most common paths through cluster creation and management are well-served by ParallelCluster through its declarative cluster template file, reliable resource management, and an optional web user interface.

However, customers also told us about two papercuts that they needed help with.

First: it can be complicated to incorporate external resources, like user authentication resources, or existing filesystems, or relational databases for job accounting – into ParallelCluster. The “right way” usually includes custom scripts, home-grown CloudFormation templates, or documented steps and workarounds.

Second: Most of these problems have been solved before, sometimes by AWS solution architects and sometimes by the community at large. Why aren’t these solutions discoverable and reusable? ParallelCluster has made the cluster part of HPC easier, but there can be a steep learning curve involved to stand up the related parts of the infrastructure.

We agreed.

So, in ParallelCluster 3.6, we built support for CloudFormation, so anyone could write templates to create and manage ParallelCluster clusters. These templates could sit beside other templates that launch surrounding and supporting resources.

But this opened up a new challenge: There are many ways to set up these resources, all quite valid. We wanted builders to be comfortable leveraging each other’s work, but without continuously reinventing each other’s wheels. We needed some mechanisms for making CloudFormation templates modular, interoperable, and shareable. And that led us to the work we’re sharing today.

Introducing… HPC Recipes for AWS

HPC Recipes for AWS is a public GitHub repo with over 20 recipes, organized into themes like networking, storage, user environments, and vertical integrations. Each recipe features downloadable assets, along with documentation and metadata to assist with discovery and attribution. The repository is managed by the HPC engineering team, with contributions from AWS solutions architects and our customers.

In the repo, you’ll find two general kinds of recipe:

Modular templates – these launch specific services or enable particular configurations, and can be to stitched together with other templates. To that end, they share a common naming convention for their parameters (inputs) and outputs.
One-click launchable stacks – these combine multiple building blocks from the modular templates catalog, and often launch complete working clusters that you can adopt, tweak, or make your own.

Modular templates are great for standing up pieces of functionality that might be used by several other services or clusters. A great example is the serverless SQL database you can use to power Slurm’s accounting features. Or it could be larger: setting up an Amazon FSx for Lustre filesystem that’s going to be shared between multiple clusters.

One-click launchable stacks are more opinionated assemblies of the modular components. They can be launched quickly, usually after you make some minor customization choices. For instance, you can launch AWS ParallelCluster UI with an HPC cluster ready-to-go. Or, you can bring up a specifically-tuned HPC cluster to try out the latest AWS Graviton3E CPUs in our Hpc7g instances.

We think these recipes will help you quickly find a pattern that most closely resembles your own to get started with. Afterwards, you can take a stack and customize it to make it your own. In any case, you can make tactical changes as your needs evolve over time, because … it’s the cloud.

Fairness is nice, and useful too.

Our repo is designed to be F.A.I.R. software to help ensure it is broadly useful:

Findable: the recipes are organized into categories and clearly named, and are tagged by their relevant technologies and attributions so they’re easier to discover.
Accessible: the collection is hosted in a public GitHub repository under a MIT-0 license, which is standard for many AWS open-source projects. We make it even more accessible by mirroring assets from its recipes to Amazon S3, which gives every file an HTTPS URL that can be used with CloudFormation and other services. This means they can be imported into other recipes or embedded in quick-create links.
Interoperable: recipes are written in AWS CloudFormation or AWS CDK. They follow standards (where available) and best practices for naming and design. There is a good-faith effort to use clear, standard names for parameters, outputs, and exports.
Reusable: there are a growing number of modular infrastructure recipes. We intend that these can be used directly, but also imported into other recipes (even outside this collection). Furthermore, each recipe is clearly documented towards being a teaching instrument to promote modification and adaptation.

It turns out that being FAIR can also make a project quite useful, as we’ll see next, as we explore how some of its recipes (and the repo’s design) can align to simplify HPC on AWS.

Off to the Kitchen

Let’s head to a virtual test kitchen to compare three recipes for cooking up a cluster with a persistent Lustre filesystem backed by Amazon Simple Storage Service (Amazon S3).

The first shows a common case, where you want to use resources provisioned separately from your cluster. Using this, you’ll need to look up several values and input them into a launch template. This is the most versatile option if you’re a frequent CloudFormation user, but can lead to some repetitive tasks. It’s useful to understand how this works – especially if you’re likely to borrow stacks from other repos from time to time.

The second shows what can be done with infrastructure stacks that are written to work together – like the ones in the recipe library.

Finally, the third recipe demonstrates how to connect separate stacks into a single streamlined launch template. This is the fastest way to get started with a complete environment, but you’ll probably want to customize these to adapt to your needs before you get too serious.

As we go through the recipes, we’ll call out how key features of HPC Recipes on AWS and AWS CloudFormation enable these designs.

Recipe 1: cluster by hand

Our first recipe uses modular templates that create supporting infrastructure, then has us configure the cluster by hand with their outputs. Found at training/try_recipes_1, it involves several steps:

Create an HPC-ready VPC using the net/hpc_basic There are several fields in this template, but the only one you need to set is Availability Zone. When the stack has been created, look up the VPC and subnet IDs in its outputs. They will be named VPC, DefaultPublicSubnet, and DefaultPrivateSubnet.
Provision an Amazon S3 bucket to back the FSx for Lustre filesystem with storage/s3_demo.
Now, create a persistent Lustre filesystem with a data repository association using storage/fsx_lustre_s3_dra. Put the Amazon S3 bucket name in DataRepositoryPath, formatted as an S3 URL. Select the networking stack VPC in VpcId and its private subnet for SubnetId.
Once the filesystem and other resources are created, go the cluster recipe, and choose Launch stack to create a an HPC system using outputs from all the supporting stacks.

To accomplish this last step, you’ll need to do additional output-to-parameter mappings (Figure 1):

Select the public subnet from your network stack for This is where the head node will launch. The private subnet goes in ComputeNodeSubnetId. This is for the compute nodes.
Go to the Outputs tab in the filesystem stack. Find the value for FSxLustreFilesystemId and use it for your cluster’s FilesystemId Use the value for FSxLustreSecurityGroupId for the FilesystemSecurityGroupId setting.
Finally, choose the operating system, architecture, number of compute instances, and Lustre filesystem size and finish launching the stack.

Figure 1. Cluster launch template requests outputs from other CloudFormation stacks

In around 10-15 minutes, the cluster will be ready to use.

Go to stack outputs and choose SystemManagerUrl to log into the cluster with Amazon SSM. Once you’re in, you can view queues and run jobs. If you upload some data to the S3 bucket, it’ll show up in the Lustre shared filesystem (and vice versa).

To shut down the cluster and its resources, delete the stacks in reverse order from when they were created. Start with the cluster stack, then the FSx storage stack, followed by the Amazon S3 stack, and finally, the network stack.

You might also notice that each recipe has a quick-create link that launches the AWS CloudFormation console. These are enabled by the automatic mirror of recipe assets to the Amazon S3 bucket we mentioned earlier.

Here is a template for creating a quick-create link:

https://console.aws.amazon.com/cloudformation/home?region=REGION#/stacks/create/review?stackName=OPTIONAL-STACK-NAME&templateURL=HPC-RECIPES-S3-URL

All recipes that include a CloudFormation template can be embedded this way. You can learn more about quick-create links in the CloudFormation User Guide and the documentation for this repo.

Figure 2. Complex outputs-to-parameter mappings between CloudFormation stacks.

Turning to the actual cluster recipe, there is a challenge with its design – the configuration is pretty simple yet we still had to specify VPC and subnet two times and consult multiple stack outputs to launch it (Figure 2).

What if we need to integrate multi-user support or Slurm accounting (each their own stack with more sophisticated networking needs)? What if we also have additional filesystems, security groups, or IAM policies? We’d probably need a pen and paper to keep all the parameter mappings straight! It might also be challenging to remember how resources from the various stacks are related to one another when we need to make updates or delete them.

Recipe 2: Cluster using Imports

The next approach, training/try_recipes_2, improves on the first by using CloudFormation’s ImportValue intrinsic function to bring in information from existing stacks. With this design, you provide the names of the stacks that provide networking and filesystem support (Figure 3). Then the cluster CloudWatch template imports values from their outputs.

Figure 3. Simplified templates with CloudFormation imports

Let’s see how it works in practice:

Create a HPC network stack with net/hpc_basic.
Next, create an Amazon S3 bucket using the storage/s3_demo
Stand up an Amazon FSx for Lustre filesystem using the “alternative import stack” in the storage/fsx_lustre_s3_dra recipe. Instead of picking a VPC and subnet, just input the name of the networking stack from step 1 into NetworkStackNameParameter.
When the FSx for Lustre stack is ready, go to the cluster recipe and choose Launch stack. It will ask for the names of your networking and storage Then, it’ll use them to automatically import key configuration details like the VPC, subnets, and filesystem ID. That’s a lot less typing!
Last, choose your operating system, architecture, number of compute instances, and size of the Lustre filesystem and complete the stack launch.

After a few minutes, the cluster will be accessible via AWS Systems Manager, just like in the first recipe. To shut down the system and delete its dependencies, delete the stacks in reverse order from when they were created, beginning with the cluster stack.

As you can see, this approach streamlines the cluster creation process because you don’t have to look up parameter values – instead they are simply imported when the stack launches.

Besides providing a simplified end-user experience, there are two other benefits to this design. First, you can change out implementations of the modular stacks with your own CloudFormation templates. They just have to follow the parameter and export naming conventions expected by the other stacks. Second, this design helps promote sustainable infrastructure practices by organizing your deployment by logical units – like we recommend in the AWS Cloud Development Kit (CDK) best practices guide.

To get started with CloudFormation imports, read over the template for this recipe, as well as that of dependencies like the networking recipe. Notice how exports from the dependency stacks get imported by name into the cluster template using the Sub and ImportValue intrinsic functions.

Recipe 3: Automatic Cluster

In our final recipe, training/try_recipes_3, we demonstrate a one-click launchable stack. The only required input is the Availability Zone you want to use. Everything else is automatically configured (Figure 4).

Using it is much simpler:

Go to the recipe and choose Launch stack. You will be asked to select an Availability Zone, then choose the operating system, architecture, number of compute instances, and Lustre filesystem size. In a few minutes, the cluster will be ready to use. Go to stack outputs and navigate to SystemManagerUrl to log in using a web console.

Getting rid of the various HPC resources is just as straightforward. Delete the main stack and CloudFormation will take care of shutting everything down in the correct order. If you want to keep some of the provisioned resources, you can go to the CloudFormation console, find the relevant component stack, and enable termination protection before deleting the parent stack (s).

This approach relies on an important CloudFormation capability called “nested stacks“. With these, you can create resources by loading CloudFormation code from an S3 URL. In the case of this recipe, code for those resources comes from other HPC recipes. It is quite an opinionated way of doing things, but provides a direct path for anyone to offer reproducible deployments of complex infrastructure for demonstrations, proofs of concept, or training.

Figure 4. Nested stacks can enable 1-step deployments of complex infrastructure

To learn more about nested stacks, have a look at this recipe’s template file. Pay special attention to how TemplateURL is used to import other HPC Recipes for AWS, and how dynamic references are used to link stack outputs and parameters.

Conclusion

AWS HPC systems often depend on other AWS resources, like filesystems, networking, databases, or directory services. It can be complicated to set up all the dependencies and ensure they work together.

HPC Recipes for AWS is growing collection of modular and all-in-one infrastructure recipes that helps you accomplish that. You can learn from them, directly use them to configure and launch infrastructure, and extend or remix them to your own ends.

We invite you to try launching an HPC cluster with one of these recipes today, explore the repository in greater detail, and contribute new recipes or improvements. You might also consider starring the repository on GitHub so you can be informed of updates and new recipes.

Finally, if this new resource improves your workflow or helps you solve harder problems than you were before, reach out to us at ask-hpc@amazon.com and let us know about it.

AWS HPC Blog