AWS HPC Blog

You told us we needed to re-think HPC in the cloud. So we did.

told us we needed to re-think HPC in the cloud. So we did.A few weeks ago we launched a new managed service called AWS Parallel Computing Service (AWS PCS) to make it easier for customers to run and scale high performance computing (HPC) workloads on AWS using Slurm.

For some time, customers have been telling us they wanted a cloud-native HPC solution that combines the flexibility and scalability of AWS with the familiar tools and workflows of traditional HPC environments. And they wanted it to be a managed service that lifted from them the burden of owning the systems integration involved in assembling the huge number of complex parts that make up an HPC cluster.

Today, we want to walk you through what we’ve done, why it matters, and how we think about solving for the kinds of workloads you bring to us. We’ll cover a little extra ground, too, because we’re conscious that this announcement has generated a high degree of interest. And that means many readers might be looking at the cloud for the first time, or perhaps coming back to see hear about what’s changed.

A cluster is a special kind of pattern

AWS PCS abstracts and deconstructs a traditional HPC cluster into cloud-native primitives, reimagining the concept of a cluster.

The cluster control plane, including the job scheduler, runs in an AWS-managed environment. This lets us handle updates, scaling, and management of the scheduler and control plane – significantly reducing the undifferentiated heavy lifting for customers.

Meanwhile, that control plane provisions compute resources in your account, giving your users direct access to – and control over – their compute instances. In HPC this is essential, because it’s the fine-grained control over how processes launch on compute nodes that lets end-users extract amazing performance.

The key components of PCS should be quite familiar:

  • Clusters: the overall logical grouping of resources, providing a unified view of the HPC environment
  • Node groups: sets of Amazon Elastic Compute Cloud (Amazon EC2) instances with defined scaling rules, allowing for flexible and heterogeneous compute environments
  • Queues: job queues that map to compute resources supplied by node groups, enabling sophisticated job scheduling and resource allocation strategies

This base-level abstraction allows for a high degree of customization and optimization, and lets you build really complex (and very complete) environments when you match it with other services from the compute, networking, and storage teams in AWS.

You can create multiple node groups with different instance types, AMIs, purchasing options (On-Demand, Spot, Reserved Instances), and scaling policies. Then you can then map these node groups to support different queues – creating a tailored environment for various workload types and user groups.

There’s also no head node. While we’ve abstracted this away (it remains alive in the control plane), you can still connect to the cluster in a traditional way using login nodes. But – as members of just another node group – login nodes can come in any variety. You might choose to have a pilot-light login node running on the tiniest instances imaginable, just so you can submit jobs or check on their progress using scripts from your favorite tool bag. Or you might elect to establish an on-demand set of login nodes powered by serious graphics GPUs so you can crunch and visualize petabytes of data.

PCS is not alone

PCS deeply integrates with other AWS services too, which means you can use the broader AWS environment to fill out a comprehensive HPC solution.

PCS works with various storage options including Amazon Elastic File Service (Amazon EFS) for home directories and general-purpose storage, Amazon FSx for Lustre for high-performance working scratch space, and Amazon Simple Storage Service (Amazon S3) for durable, long-term storage. These services, in turn, integrate with data movement options like Amazon File Cache or AWS DataSync. The combination of just these five services we’ve just named will let you integrate with your on-premises data back at home base and create a linkage between your data movement and your jobs that reflects everything that’s unique about your workloads.

Amazon VPC provides the high-level networking structure, and we use Elastic Fabric Adapter (EFA) to deliver low-latency, high-throughput inter-node communications. This is critical for tightly-coupled codes and bandwidth sensitive I/O. And because EFA presents as a libfabric provider you can just use things like Open MPI, Intel MPI, and NCCL out of the box, without changing any code.

Identity management can be satisfied in several ways. Most directly by AWS Identity and Access Management (IAM). But you could also choose to use a managed Active Directory service, or connect through VPNs or private network connections back to on-premises systems to bind with the rest of your IT infrastructure. That can make logging into the cluster completely transparent.

Compute

Compute is probably the easiest conversation to have regarding PCS. It supports the full range of Amazon EC2 instance types – including the latest CPU and GPU options. This means you can choose the most cost-effective and performant instances for each specific workload if you want. That could be CPU-based instances for simulation workloads, GPU-accelerated instances for machine learning and visualization, or high-memory instances for data analytics.

There are (currently) more than 800 different instance types in EC2, so you’re not lacking for variety. But on the CPU front: we’ve tried to make the decision easier with a group of specifically HPC-focused instances which are the best performing and lowest cost you’ll find in the cloud for (probably) most of your workloads (and, conveniently, we called them Hpc6a, Hpc7a, Hpc6id, and Hpc7g so you don’t have to look very hard).

Open, familiar, and straight-forward

PCS aims to provide a familiar environment for HPC users while still being able to benefit from cloudy things like elasticity and managed infrastructure.

You can bring your existing Slurm-compatible applications and job scripts over to PCS with few – if any – changes, which should make it much easier to kick the tires of the cloud. At the same time, you’ll immediately have access to the dynamic scaling for your cluster based on workload demands. In many cases, this can reduce your costs. But at a bare minimum, we hope your users will enjoy waiting around less for their jobs to start (and finish) – probably the most profound lever you can pull for improving innovation in your company.

The supported operating systems are the ones you know well already – RHEL, Ubuntu, Rocky – and of course our own Amazon Linux. These deliver the same experience you’re used to virtually anywhere else.

Your code runs in standards-based environments, and if you’re using MPI or NCCL – you just keep doing that. Under the covers EFA takes care of moving torrents of packets quickly and with a reliability that means you won’t be able to tell that you’re not running on Infiniband. You don’t have to believe us on this score, either. In Patel et al, one of our customers found that “.. we see equally good parallel efficiency (strong and weak scaling) using Ethernet-based Elastic Fabric Adapter (EFA) interconnect versus Infiniband”.

More than a hundred of those 800 Amazon EC2 instance types we mentioned come with an EFA interface. You don’t get that kind of variety – which the cloud is famous for – by plumbing a single-purpose interconnect into some parts of your data centers just for the HPC people to use – that just creates islands which, over time, become fragmented from one another. EFA uses the cloud’s properties (large scale, massively redundant pathways) as the lever for delivering great performance scaling, rather than its curse. This accounts for why we’re so committed to EFA, and making sure it’s widely supported.

We’ll also keep putting in the effort to make Spack work well with PCS, too. Spack gives you an easy-installation path for more than 8,000 packages. While most of these are open-source, a good number of them (like ACfL from Arm, or the NVIDIA HPC SDK) are commercially-offered packages that you’re probably using in your development environments already.

Manage a cluster like a cloud

AWS has been a separate business from Amazon for over 18 years now. What we’ve learned over these many years is that it takes a number of tools to satisfy customers’ needs for managing their environments. Every customer – and sometimes every user – has their own idea of what success is, and they measure inputs and outputs very differently from one another.

Internally, we like to say that there’s no compression algorithm for experience. So in PCS we’re gradually building out an approach that brings what we’ve learned from helping millions of different customers trade management problems for tools.

PCS automates updates and patching for you. Not just so you can be free of bugs or performance regressions – though you can – but also to improve your security posture. That’s a non-trivial lift in regulated environments where we very often find HPC.

PCS also just works with dozens of AWS services for automation, monitoring, logging, and cost management. That means you can put budget alerts on node groups to stay knowledgeable about the consumption that’s going on in your R&D groups, without needing to crunch numbers manually. You can even have important budget milestones sent to your phone as push notifications, or to Slack channels where your ops teams live.

This is an area you can expect to see quite a bit of movement on as PCS evolves – because we’ll be gaining experience as we go and figuring out how to solve gritty management problems to save you time and stress. But this is also where we’ll benefit the most from feedback you can give us.

Solving for … everyone

PCS, by its nature, is flexible enough to deploy clusters purpose-built for any workload. We’ve worked backwards from auto makers doing CFD, research labs working on drug discovery, and partners building all-in research solutions for higher education.

Coming soon, PCS will support AWS CloudFormation, which means you’ll begin finding full-stack deployments for different use-cases packaged up as recipes you can launch, without needing to understand the fine-grained details. We’re excited for that aspect of PCS because it’ll make the abstraction model even more powerful by mapping different problems to different solutions, and make those solutions more easily shareable. We’ll be kicking that off with our HPC Recipes Library when the time comes.

Using infrastructure as code techniques distills the pattern of your cluster and it’s interconnections with other services and facilities and expresses it in code – code that’s reviewable, auditable, and subject to version control. There’s no better way to be able to focus laser-like on the security aspects of your design and then relentlessly iterate and tighten the screws (just a little more) when you become aware of a change in the environment you operate in.

HPC admins can use PCS right now to simplify cluster management, reduce operational overhead, and better meet their compliance or security goals.

The small amount of time it takes to walk through our getting started tutorial to stand up a fully functioning cluster is nothing compared to the 12-18 months it takes to plan, procure, and deploy a traditional, on-premises cluster in your own data center.

For similar reasons, we’re already hearing from ISV partners that PCS solves a complexity management problem for them. Nathan Albrighton, the CEO of Ronin told us that PCS “… greatly simplifies our ability to build and operate HPC environments using APIs and elevates the HPC capabilities we offer to our customers”. They can now afford to file away the infrastructure management piece from their own stacks and leave that to PCS.

This leaves ISVs and partners to focus on what they do best, whether that’s incredible drug design platforms or best-in-class computer aided engineering tools so they can offer them to wider audiences beyond their traditional base. Or, to just make life easier for their customers – which is the larger goal we all have.

Science, not servers

In the end, it’s really all about the science and engineering teams who are solving the world’s hardest problems. The humans who work in these jobs are truly not very scalable. Training them took decades even before they could dive even deeper and specialize for another decade or two. We know we’re not going to make the breakthrough discoveries we need soon enough by waiting for the next crop to put away their skateboards and finish school.

That’s why PCS has a very big mission – to make those scientists and engineers using HPC in the cloud the most productive people in their fields. As a community, we’ve almost never solved that by making IT more complex, or by squeezing ever more users into a fixed-size, limited cluster … and expecting them to solve the contention problems themselves.

Letting these rare species experiment more freely, and iterate at the pace of their own ideas allowed the scientific community to meet the very sudden and pressing need for a new vaccine during the Covid-19 pandemic. This, for us, crystallized the benefit that cloud computing can bring to research.

HPC overall still needs more streamlining. Codes are still generally hard to deploy, and require a lot of specialization to use. Fixing that isn’t a simple task (or one that will probably even be truly “done”). But by relentlessly taking on the uninteresting maintenance and management chores under the hood, we’re freeing up the practitioners and experts to chip away at that rock, too. We call it “working backwards” – it’s key to our design process. While we’re mentioning that, we’re truly thankful to the many customers who took the time to beta test PCS and be part of the working backwards journey. They inspire us – and they’ve guided us.

So if you try PCS, let us know about your use case. We’re not just eager to see what you create. We’re even more eager to hear your feedback – good or bad – so we can make AWS Parallel Computing Service an even better solution for everyone in this community – so they can prosecute their own mission to solve pressing, urgent, and difficult problems. Get in touch with us.

Matt Vaughn

Matt Vaughn

Matt Vaughn is a Principal Developer Advocate for HPC and scientific computing. He has a background in life sciences and building user-friendly HPC and cloud systems for long-tail users. When not in front of his laptop, he’s drawing, reading, traveling the world, or playing with the nearest dog.

Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.

Nikhil Tahalramani

Nikhil Tahalramani

Nikhil Tahalramani is a Senior Product Manager with a passion for making HPC feel like a walk in the park—minus the bugs. When he's not orchestrating smooth HPC experiences on AWS, you can find him smashing tennis balls with questionable precision, reading non-fiction, and enjoying waterfront views.

Tarun Mathur

Tarun Mathur

Tarun Mathur is a Senior Product Manager covering HPC and scientific computing. His goal is for customers running HPC workloads to have a smooth orchestration experience on AWS. Outside work, he enjoys Brazilian jiu-jitsu, mountaineering, and searching for the best NYC rooftop bar.