AWS HPC Blog

New: Introducing AWS ParallelCluster 3

Pretty picture for the blog that says "ParallelCluster 3"Running HPC workloads, like computational fluid dynamics (CFD), molecular dynamics, or weather forecasting typically involves a lot of moving parts. You need a hundreds or thousands of compute cores, a job scheduler for keeping them fed, a shared file system that’s tuned for throughput or IOPS (or both), loads of libraries, a fast network, and a head node to make sense of all this. These are just the table stakes, too, because when you move to the cloud, you’re expecting to do more ambitious things – most likely because you’re a researcher with a problem to solve and a lab full of colleagues waiting for the answer.

Since 2018, AWS ParallelCluster has simplified the orchestration of HPC environments and helped researchers and engineers tackle some of the most ambitious problems facing the world today. Watching customers discover what “infrastructure as code” means in the context of HPC has really propelled us to find new ways to delight them. When a single shell command can create a complex thing like an HPC cluster, and a Lustre file system, and a visualization studio, it leads to more people trying cloud than ever before, and they’re asking us for new functionality.

So today we’re announcing AWS ParallelCluster 3. Customers, systems integrators, and other builders have told us they want to build end-to-end “recipes” for HPC, spanning the whole gamut from infrastructure to middleware, libraries, and runtime codes. They also explained to us their need for a API-like interface so they can interact with ParallelCluster programmatically to create interfaces and services for their users. As we’re known for doing, we worked backwards from this feedback, using thousands of conversations with customers to create what we’re showing you today.

There are a lot of changes you’ll notice – large and small. Here’s some highlights before we dive deeper later in this post:

  • A new flexible AWS ParallelCluster API – This simplifies building solutions and interfaces on top of ParallelCluster, or including your clusters lifecycle as part of a pipeline. We’ve also changed the CLI to match, so scripted or event-driven workflows are easy.
  • Build custom AMIs with EC2 Image Builder – Support for custom AMIs in ParallelCluster has grown from a feature in 2018 into a mainstream process now. With the introduction of EC2 Image Builder, we now have a way to automate this process without anyone needing to invent the automation. This will make clusters using custom AMIs faster to scale because it front-loads the image creation stage. It’ll improve reliability too, and you’ll find it easier to stay patched and even harder to mess up your security posture.
  • A new configuration file format – ParallelCluster configurations now use a YAML format, and each one defines just one cluster. Along with several other changes we think it’ll be easier to keep your cluster configurations organized and readable.
  • Simplified network configuration options – we’ve streamlined support for networking to enable the use of private, pre-existing Route 53 zones and provided some more flexibility for how we use Elastic IPs.
  • Finer-grained IAM permissions – we’ve changed how we do permissions. We let you specify an IAM role or an Instance Profiles, and we let you do that separately for the head node and compute nodes. We support IAM permission boundaries on-creation for organizations that require specific limits when roles are applied.
  • Runtime customization scripts – you can now tweak the pre- and post-install scripts separately for the compute nodes on a live running cluster, and they’ll get updated when you issue the ‘pcluster update’ command.

These features simplify initial cluster setup and ensure easier organization and reproducibility of clusters, saving customers time as they build out custom environments.

A change to some current features

Back in June of 2020, we announced deprecation of support for the Son of Grid Engine (SGE) and Torque job schedulers. We also added clear warnings in our documentation for configuration scheduler options that while you can still choose SGE or Torque, it’s probably not a good idea, as the official date for deprecation of support would be December 31, 2021.

We took this decision a year ago because the open-source software (OSS) repositories for these two projects had no community updates for several years. This makes them higher risk as vectors for attack because “no updates” also means “no patches” for vulnerabilities that are discovered. With every ParallelCluster 2.x release we’ve worked harder (and harder) to tighten the net around these packages to ensure we meet your expectations of AWS in the shared responsibility model. But with ParallelCluster 3, we’re shifting to only directly supporting schedulers with their own viable support models.

Any clusters created with SGE and Torque as their configured scheduler won’t stop working, and ParallelCluster 2.11 patch releases will continue to include these components until December 31, 2021. However, AWS will not support customers with issues related to SGE and Torque past December 31, 2021.

Finally, with the announced end-of-life for CentOS 8 on December 31, 2021, we are also dropping support for this operating system. Similar to the schedulers, ParallelCluster 2.11 patch releases after that date will no longer include support for CentOS 8.

ParallelCluster Support Policy

With our release of ParallelCluster 2.11.0 in June we changed our support policy, and with ParallelCluster 3 it’s worth restating what that is and how it affects you.

ParallelCluster 2.11 is the last minor release of the 2.x series and is feature-stable. It will continue to receive bug and critical security fixes in the form of patch releases (2.11.x) until December 31, 2022. These patch releases will be provided every six months, unless a critical fix requires a more immediate response. While we aim to provide consistency and predictability in our feature-stable releases, note that this does not extend to product components that reach an end-of-life state. For example, in the case of SGE, Torque, and the CentOS 8 operating system, these features are all marked for end-of-life on December 31, 2021, and will not be included in 2.11.x releases beyond that date.

ParallelCluster 3.0 is eligible for support until March 31, 2023, which is 18-months from today.

ParallelCluster will continue to be enhanced with new features, which will be included in minor releases (e.g., 3.1, 3.2 …). Each of these will extend the support window by 18-months from their release date. This is subject to the following:

  • Bugs and security issues will be addressed in a minor release (e.g., 3.1) unless severity requires a more immediate patch release (e.g., 3.1.1).
  • To receive bug and security fixes you must upgrade to a minor or patch release in which these fixes are provided.
  • To receive feature enhancements, you need to upgrade to the most recent version of ParallelCluster 3.

More details are on our support policy page in ParallelCluster’s documentation.

We encourage you to upgrade to ParallelCluster 3 so you’re fully supported before ParallelCluster 2 reaches end-of-life next year. However, existing AMIs, cookbooks, and PyPI artifacts from previous versions will be available indefinitely.

Updated Configuration Options

ParallelCluster configurations now use a YAML format, and each configuration file only contains the definitions for just one cluster. Cluster definitions will also now use the “multiple-queues” style of syntax, which we first introduced in ParallelCluster 2.9.0, even if you’re only starting with one compute queue. We think these changes will make it easier to keep your cluster configurations organized and readable. Here’s an example (with the new format on the right):

A picture showing the different ParallelCluster configuration files that show relationship between the INI and YAML formatted files for a multi-queue configuration

Some configuration settings are no longer available:

extra_json, additional_cfn_template, template_url, custom_chef_cookbook, custom_node_package, s3_read_resource, s3_write_resource, initial_count, compute_subnet_cidr

Whilst some others have changed name:

  • min_queue_size is now MinCount
  • max_queue_size is now MaxCount

Across the entire product, we’ve incorporated inclusive language, so we no longer refer to a ‘master node’, but instead to a ‘head node’ (and that extends to names for environment variables like MASTER_IP, which is now PCLUSTER_HEAD_NODE_IP).

Custom image building

AWS ParallelCluster 3 offers a new streamlined AMI creation and management feature built on top of EC2 Image Builder using the new ‘pcluster build-image’ command. This allows you to specify sets of image build components that are layered on top of ParallelCluster-provided AMIs or your own images to create pipelines for building those custom AMIs. In fact, it’s possible to manage the whole lifecycle of images, with list-images, describe-image and delete-image commands. And you can see the official ParallelCluster images include in our releases using list-official-images.

With these new features you can create your own build components or use build components shared and published by others – inside or outside your organization. These components can then be reused, modified, and mixed to suit different workloads or to create new AMIs compatible with different releases of ParallelCluster. An ISV or open-source group could publish images with pre-installed applications, or a central IT organization might publish images with specific security settings.

You can also elect to automatically enable package updates or operating system security updates to keep AMIs up-to-date and secure. For customers with long running cluster configurations, this seemingly small change will remove a lot of unnecessary work and worry.

Additionally, ParallelCluster now offers a more flexible approach in running pre or post install scripts to the main bootstrap action when you create your clusters.  You can specify different custom bootstrap actions scripts for the head node and compute nodes using OnNodeStart and OnNodeConfigured parameters in the HeadNode and Scheduling sections.

Finally, AMIs created inside of ParallelCluster are automatically tagged to help you organize all custom images created.

ParallelCluster’s API and CLI just got more interesting

ParallelCluster 3 comes with a new API so you can create more complex, scripted workflows. You can now manage and deploy clusters through HTTP endpoints with Amazon API Gateway. This opens new possibilities for scripted or event-driven workflows, like creating a new cluster when a dataset is ingested or automating a reaction for those moments when your existing infrastructure won’t meet some sudden needs.

The API also makes it easier for builders like ISVs, Systems Integrators or HPC admins to use AWS ParallelCluster as a building block in their HPC workflows and create their own extensible solutions, services, and customized front-end interfaces.

ParallelCluster’s command line interface (CLI), ‘pcluster’ has also been redesigned for compatibility with this API, and includes a new JSON-output option. That means you can write code that parses CLI responses precisely. You can combine this with AWS CloudShell to push even the automation scripts into the cloud. All of this makes it easier for customers to implement familiar building block steps using the CLI or the API.

Simplified networking configuration

ParallelCluster 3 gives you more freedom in the networking configuration of your clusters.

If your organizational policy doesn’t allow the creation of a Route 53 zone, you can now provide an existing private Route 53 zone (created outside of ParallelCluster) for ParallelCluster to use.

Instead of ParallelCluster creating a new Elastic IP address for your head node, you can associate an existing Elastic IP address which mean you can keep the same IP address assigned to your head node even after you delete the cluster.

Finally, we’ve added expanded options around automated subnet creation by adding a prompt for Availability Zone choice in pcluster configure. This should make it easier to build configurations that are more portable across Availability Zones, so you can take advantage of other services or resources (or even better EC2 Spot pricing).

IAM, Permissions and Naming Conventions

Previous versions of ParallelCluster spun up clusters with default Instance Profiles and our own CloudFormation naming conventions for clusters. ParallelCluster 3 allows you to use an existing Instance Profile for cluster creation and Image Builder.  In the same vein, you can now use different IAM roles for the head node than the compute nodes, and this is useful for certain jobs (in a specific Slurm partition, for example) that require restricted IAM permissions.

We’ve also removed the ‘parallelcluster-‘ prefix from the CloudFormation stacks we create for you which means more control in the naming conventions at your site.  If your organization also adheres to IAM permission boundaries, we’re happy to report that ParallelCluster 3 allows you to pass the ARN of the IAM policy you would like to use as the permissions boundary for all roles created by ParallelCluster.

Getting started with ParallelCluster 3

As always, AWS ParallelCluster is available at no additional cost, and you pay only for the AWS resources needed to run your applications. The easiest way to get started is to download ParallelCluster 3 on the platform of your choice and try it out with your workload.

To make that easier, we’ve built a hands-on step-by-step workshop which takes you through several ways to launch it, including with AWS CloudShell and AWS Cloud9. You can see a video of a discussion with Nathan Stornetta, Senior Product Manager for ParallelCluster on our HPC Tech Shorts channel.  Detailed documentation for AWS ParallelCluster is located on our documentation pages, which now cover both ParallelCluster 2 and ParallelCluster 3.

Finally, keep an eye out for more supporting material and information on new AWS ParallelCluster 3 features on this blog channel, our HPC Tech Shorts, and our workshops site as they become available.

We’re excited with this launch and keen to know what discoveries it enables. Don’t hesitate to tell us what you like and help us understand how to make ParallelCluster even better by giving us feedback at our GitHub repository.

TAGS:
Brendan Bouffler

Brendan Bouffler

Brendan Bouffler is the head of the Developer Relations in HPC Engineering at AWS. He’s been responsible for designing and building hundreds of HPC systems in all kind of environments, and joined AWS when it became clear to him that cloud would become the exceptional tool the global research & engineering community needed to bring on the discoveries that would change the world for us all. He holds a degree in Physics and an interest in testing several of its laws as they apply to bicycles. This has frequently resulted in hospitalization.

Rye Robinson

Rye Robinson

Rye Robinson is a Global Solutions Architect at Amazon Web Services, specializing in high performance computing, primarily in the Life Sciences industry. He has experience with HPC, cluster computing, and distributed file systems. He loves to help customers leverage new technologies to solve their problems and is an evangelist for HPC as well as Life Sciences.