AWS HPC Blog

Three recipes you don’t want to miss for AWS Parallel Computing Service

Three recipes you don’t want to miss for AWS Parallel Computing ServiceIn December 2024, just before the holidays we announced general availability of AWS CloudFormation support in AWS Parallel Computing Service – our newest HPC managed service, which makes it easy for you to deploy and scale your HPC workloads, and scientific and engineering models in the cloud.

We think this is a big deal, but if you’ve not experienced Infrastructure as Code (IaC) before, there’s a good chance you’re scratching your head wondering what all the fuss is about. IaC means that you express your physical machine architecture as readable, editable, version-controlled code. The cloud itself picks up the heavy lifting to turn that code into infrastructure – often in minutes. And usually without any involvement from you.

Today we want to highlight some great examples of IaC from our HPC Recipes Library, some of which will deliver you a cluster faster than it takes to get pizza. The pizza is still useful, though, because you can munch on it while you ponder the cluster recipes and decide which parts of our code you want to adopt for your own work. The Recipes Library is open-source, and we hope you make use of that to get to the right result sooner.

What has AWS CloudFormation ever done for us?

But before get to the recipes, it’s worth taking a minute to understand the other benefits you get from using an IaC approach to building your clusters.

The addition of CloudFormation support to AWS Parallel Computing Service (AWS PCS) marks a significant step forward in making HPC more accessible. Instead of navigating complex configuration steps, you can now use recipes or stack sets to deploy complete HPC environments with just a few commands or clicks.

This highlights simplicity as the first obvious benefit of IaC. Because we manage the infrastructure using the same rigor we apply to application code, we can deploy it like an application, too. Consumers of an app don’t need to understand the source to use the app.

Every cluster deployment also produces identical results. This eliminates the “it works on my cluster” problem and ensures your development, testing, and production environments match exactly. When you need to recreate an environment – whether for disaster recovery, scaling to a new AWS Region, or demonstrating the reproducibility of a published result – you can be confident in the outcome.

Getting that level of predictability is also beneficial for attaining tight security and compliance standards. If a human needs to manually implement the 23 critical steps for protecting a cluster from the bad guys, there’s a good chance they’ll miss one of them one day, mainly because humans aren’t great at this sort of monotony.

You may have heard us say that “security if job zero at Amazon”. It means that it’s the gating factor that comes before everything else when we’re prioritizing our work. IaC lets you bring that kind of obsessiveness to your HPC environments. That’s because you can define security configurations once and deploy them consistently across everything you build. You can write security best practices as reusable code components, deployed everywhere.

You can also keep a complete audit trail of infrastructure changes through version control. Since you’re now documenting security controls and compliance requirements in code, you could automatically generate compliance reports. Your code, and all its versions is now comprehensive system documentation – you can show it to auditors and stakeholders. It doesn’t stop there: you could automate security tests in the same way you can also automate performance regression tests. It’s just code now.

Three recipes from our HPC kitchen

Now let’s look at three powerful recipes from the HPC Recipes Library that show different aspects of Parallel Computing Service.

Architecturally, they’re all based on the same pattern as the Getting Started cluster from the PCS user guide, so we’ll spend some time describing that first, and then cover the differences which make the other two distinct.

Getting Started: your first HPC cluster

The Getting Started recipe replicates our popular tutorial experience – without the assembly steps. This recipe gives you a basic but fully functional cluster that’s perfect for learning and testing. It’s an ideal starting point if you’re new to PCS or want to validate your environment.

Figure 1 - The Getting Started cluster provides a fully-functional HPC environment, complete with shared file systems, a job scheduler, a login node, and an elastic fleet of compute resources for running jobs.

Figure 1 – The Getting Started cluster provides a fully-functional HPC environment, complete with shared file systems, a job scheduler, a login node, and an elastic fleet of compute resources for running jobs.

Every cluster spun up with these recipes creates a new Virtual Private Cloud (VPC), which is the environment in which we build the clusters. You don’t need a new VPC just to use PCS – we’ve just included new, clean, juiced-up one in these recipes mainly so we don’t bump into unnecessary limits when we’re spinning up the cluster. It’s also a good idea to initially run these recipes as user with Administrator privileges, at least until you finish experimenting and turn to narrowing down to focus on your environment, and its particular security (or compliance) needs.

Scheduler – Every cluster launched comes with its own managed Slurm controller, running in an AWS service account, which means we’re managing it and – of course – looking after updates and upgrades for you when they come along. We

Compute nodes – We pre-configure every Getting Started cluster with two compute node groups (CNGs). The first spins up a single login node so you have something to connect to for compiling code, or submitting jobs. The second CNG is more elastic: it contains from zero to four compute nodes. The number of nodes will depend on the jobs you submit to Slurm, but will default to zero – which is ideal for a cloud HPC environment, which really should be consuming as little as possible when there are no jobs to run.

This recipe uses very small instances (c6i.xlarge) each with 4 vCPUs – making this a 16 vCPU cluster. Its main purpose is to help you learn about the knobs and dials in PCS, so we aimed for small, low-cost instances. You can easily change this, though, by just editing the compute node group definition in the PCS console. You can also clone a compute node group in the PCS console (like in Figure 2), changing the instance types it draws from along the way. Amazon Elastic Compute Cloud (Amazon EC2) has hundreds of instances to choose from, depending on your needs.

Networking – were possible, we use instances that support Elastic Fabric Adapter (EFA), which is our custom-built network for HPC and ML applications. It’s not usually offered on smaller instance sizes, so the 4-core instances used by some recipes won’t offer you the blazing fast performance you might be looking for. But: “4-core instances” should have given you the idea that performance wasn’t our motivation when choosing them for this demonstration purpose. Larger instances do support EFA and where they do, we use it – and we provision instances in Cluster Placement Groups so your compute resources are close to each other physically, to minimize latency.

Figure 2 - You can clone compute node groups in the PCS console. This is helpful if you're experimenting with a new idea or planning to create a new job queue which needs similar (but not identical) resources to your other queues.

Figure 2 – You can clone compute node groups in the PCS console. This is helpful if you’re experimenting with a new idea or planning to create a new job queue which needs similar (but not identical) resources to your other queues.

Storage – The recipe will also create two filesystems for your data. User data (/home) will live on an Amazon Elastic File System (Amazon EFS) NFS v4 filesystem, which has great performance for this purpose. Scratch data can use Amazon FSx for Lustre (‘persistent’ type, and mounted as /fsx). If the 1.2 TB sample Lustre isn’t enough for you, it’s easy to expand in the Amazon FSx console. You can also tweak the throughput and metadata IOPS while you’re there. Yes, that’s pretty cool.

SSH access – when the cluster comes up, you can choose to download the SSH key pair the recipe created for you, or you can connect to the login node using AWS Systems Manager, which offers you a terminal in your web-browser right from the Amazon EC2 console.

AMD-powered HPC: industry standard x86 processors

The Try AMD recipe deploys a complete HPC environment built on AMD Milan processors – the current mainstream standard in high performance computing.

It has all the same architectural features of the Getting Started cluster, but differs in a few key aspects.

Like it says on the lid: this cluster is AMD-based, so the login node and all the compute nodes use AMD CPUs.

Next, it’s limited to using the us-east-2 Region in Ohio, since this is one of the Regions that provides access to the Hpc7a instance family.

This cluster has two Slurm partitions: small and large. The small partition sends jobs to nodes supplied by a c7a.xlarge node group. These are quite small (and very low cost) instances but without Elastic Fabric Adapter (EFA) networking. Great for kicking tires while you get used to how PCS works. The large partition sends work to a node group that uses hpc7a.48xlarge instances. These have EFA built in (dual-rail, in fact, delivering 300 Gbit/s), and come with 96 cores and 768 GB of RAM. And as we discussed earlier, we provision instances in tight Cluster Placement Groups, so they’re physically close to each other.

The last thing to say is that Hpc7a instances require access-listing to use, so if your AWS account isn’t already enabled for this EC2 family, you’ll probably need to cut a ticket to support or contact your AWS account manager or solutions architect. If you submit jobs to the large queue and don’t see any instances spinning up to run jobs, chances are you’re not access-listed, or you may have a quota limit you’re exceeding. Either way, the AWS teams we just mentioned will help you figure it out.

Computing with Arm64 on AWS Graviton

Arm-based processors are making big inroads into every corner of computing. In fact, in December at re:Invent, Dave Brown revealed that Graviton system deployments made up more than half of the growth of Amazon EC2 over the last two years.

The Try Graviton recipe lets you have an Arm-based cluster with full-sized serious processors which offer an impressive combination of performance, cost, and energy reduction. Instances using AWS Gravitons cost around 40% less than comparable ones for the same workloads. They also use around 60% less energy, which helps thousands of customers with their carbon-reduction strategies.

Like the other two recipes we’ve just described, the cluster comes with all the same architectural features for storage, login nodes (a c7g.2xlarge in this case), and SSH access.

The compute node group supporting the Slurm queue uses hpc7g.16xlarge instances – these are built from 64-core Graviton3E processors and have 128 GB of RAM.

It also comes with Open MPI 5, which supports the Elastic Fabric Adapter (EFA), our custom-built network for HPC and ML applications. It’s the same fabric – and the same adapter – that we use for all our Amazon EC2 fleet – it’s not no different for Graviton.

Cleaning up

Pay some attention to the clean up details in each recipe’s README.md file. This will ensure you don’t leave something behind that’s costing you money after you’ve finished experimenting.

Start in minutes

Ready to try these yourself? We’ve done everything we can to make it easy. Visit the HPC Recipes Library on GitHub and choose one of these recipes to get started. In about 20 minutes – yes, faster than most pizza deliveries – you can have an HPC environment up and running ready to do some serious work.

Each recipe is thoroughly documented and includes step-by-step deployment instructions. And they’re designed to be adopted, forked, and modified to quickly quit your needs, and the demands of your users.

The best part? You can experiment with different configurations and architectures without the usual learning curve and setup time. Just choose your recipe, deploy with AWS CloudFormation, and start computing.

Let us know how you get on.

Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.