AWS HPC Blog

Minimize HPC compute costs with all-or-nothing instance launching

Matt Vaughn, Principal Developer Advocate, AWS HPC Developer Relations
Sean Smith, HPC Senior Specialist Solution Architect, AWS WWS

Dynamic autoscaling is one of the best features of AWS ParallelCluster, enabling HPC cluster size and cost to adapt to the volume of work at hand.

The Slurm job scheduler in ParallelCluster can be configured to scale from zero up to a maximum number of Amazon Elastic Compute Cloud (Amazon EC2) instances, or nodes, based on how many jobs are in the queue.

However, there is a case where this default behavior can lead to unwanted costs.

In the normal course of operation, as when the backlog of jobs is completed, nodes are automatically shut down. However, when you launch Slurm multi-node jobs (sbatch jobs), some instances might launch and then sit idle. This happens because they’re waiting on the exact number of instances required to run the job – but which can’t launch because there isn’t enough overall capacity available at the time.

This blog post shows you how to configure ParallelCluster to use an all-or-nothing instance launch strategy, which means no instances in a multi-node job will launch unless all the instances can launch together. This helps to prevent idle EC2 resources, and the costs associated with them.

Overview

To explore this issue in detail, let us assume a ParallelCluster using the Slurm scheduler. It’s configured with a queue named c6i that manages up to 192 x c6i.32xlarge instances. If you submit a series of 24 single-node jobs to this queue, Slurm will launch new on-demand instances and run jobs on them. This will happen while there is enough EC2 capacity of that instance type, and you are within your AWS Service Quota.

for N in {1..24}
do
    sbatch -N 1 -p c6i job_script.sh
done

Importantly, if all 24 instances can’t launch, Slurm will reuse nodes that do launch to complete the work. Compute charges will only accrue for instances that launch. Contrast this to the case below where a single job needs 24 nodes to even start running.

$ sbatch -N 24 -p c6i job_script.sh

When you submit a large multi-node job like this, Slurm will begin provisioning nodes, withholding work from them until they’re all ready. If there’s sufficient capacity, and you’re within your service quota, all 24 instances will launch (more or less) at the same time.

Otherwise, instances will launch incrementally as they become available – which may be gradually, if the constraint is EC2 capacity at that moment. If the constraint is your service quota, they will never launch – at least until you are below your quota. Meanwhile, instances that do launch will sit idle and incur charges. This happens because ParallelCluster configures Slurm to use a best-effort cloud scaling strategy by default (see Figure 1A).

Figure 1: Adopting all-or-nothing instance scaling helps avoid job delays and unexpected EC2 charges.

Figure 1: Adopting all-or-nothing instance scaling helps avoid job delays and unexpected EC2 charges.

To prevent this situation, you can configure Slurm to use an all-or-nothing batch launching strategy (Figure 1B). Behind the scenes, instead of asking EC2 to launch up to 24 instances at once, Slurm will request exactly 24 instances. In case of insufficient capacity or service quota, the launch request will fail, which Slurm will interpret as nodes failing to power up. This will cause the job submission to fail, which can be remedied by resubmitting it after securing additional capacity, changing the job configuration, or choosing a more appropriate queue. Meanwhile, no compute resources sit idle.

Configuring ParallelCluster

Currently, the all-or-nothing setting is not directly exposed, but there is a way to enable it. To do that, you’ll need to edit the Slurm ResumeProgram configuration file. You can do this by logging into the cluster head node and editing it directly.

$ pcluster ssh -n CLUSTER_NAME
# sudo su -
# echo "all_or_nothing_batch = True" >> /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf

This will work fine on a running cluster, but you can (and should) use the opportunity to do some infrastructure as code, so you can persist this setting between cluster launches.

First, create a shell script called all_or_nothing_batch.sh containing this:

#!/bin/bash
echo "all_or_nothing_batch = True" >> /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf

Upload it to an Amazon Simple Storage Service (S3) bucket, say “PCLUSTER_S3_BUCKET” that’s accessible to your ParallelCluster.

$ aws s3 cp all_or_nothing_batch.sh s3://PCLUSTER_S3_BUCKET/

Now, you can take advantage of ParallelCluster’s CustomActions configuration, which lets you define scripts to run during the node provisioning lifecycle. Add a CustomActions stanza to your cluster’s HeadNode configuration like this:

HeadNode:
  [...]
    CustomActions:
      OnNodeConfigured:
        Script: s3://PCLUSTER_S3_BUCKET/all_or_nothing_batch.sh

This tells ParallelCluster to run the all_or_nothing_batch.sh configuration script once the cluster head node is booted up and ready for service.

Now try creating a cluster with the new configuration:

$ pcluster create-cluster -n CLUSTER_NAME -c cluster_configuration.yml

Be aware that you can’t update an existing cluster after changing its CustomActions – you need to create a new cluster for the new configuration to take effect.

Caveats

Like any scaling strategy, the all-or-nothing approach has limitations. Among the most notable are:

  • Job launches that fail due to capacity or quota issues have to be explicitly re-queued by the user. This is normal behavior for a traditional HPC cluster, but users may not expect capacity limits when using a cloud-powered resource.
  • All-or-nothing can only be used for batches of the same instance type. For example, a job that needs 2x c6i.32xlarge and 2 x p4d.24xlarge instances can’t be managed by all-or-nothing scaling.
  • Slurm groups instance requests together for efficiency. As a result, batch jobs submitted in rapid succession might end up chunked into one large instance request, causing both to fail due to capacity limits. Waiting around a minute between job submissions will help prevent this from happening.
  • Slurm applies all-or-nothing configuration to the entire cluster. That means there’s no way to configure it on a per-queue basis.
  • By default, up to 500 instances can be launched at once. If you try to launch, for example, 750, the first 500 will be requested in one batch, followed by another batch of 250. It’s possible to run into insufficient capacity to launch the second batch, leaving the first 500 instances idle until capacity becomes available. To prevent this from happening, consider scaling in batches of 500 nodes or fewer.

Despite the trade-offs, all-or-nothing scaling is a solid approach for dealing with capacity limitations, especially when you are using specific instance types that are in high demand, or you’re requesting an exceptional number of them at once.

Conclusion

Autoscaling with ParallelCluster can be interrupted by a failure to launch enough compute instances for multi-node jobs, which can lead to delayed work, idle capacity, and unexpected charges.

One way of preventing this is to configure your cluster to use all-or-nothing instance launching. There are a few limits to this approach, but in general, it’s a strategy you should consider if you’re running ParallelCluster at scale and have many multi-node job types. You can learn more on this topic by visiting the ParallelCluster wiki.

Matt Vaughn

Matt Vaughn

Matt Vaughn is a Principal Developer Advocate for HPC and scientific computing. He has a background in life sciences and building user-friendly HPC and cloud systems for long-tail users. When not in front of his laptop, he’s drawing, reading, traveling the world, or playing with the nearest dog.

Sean Smith

Sean Smith

Sean Smith is a Sr Specialist Solution Architect at AWS for HPC and generative AI. Prior to that, Sean worked as a Software Engineer on AWS Batch and CfnCluster, becoming the first engineer on the team that created AWS ParallelCluster.