AWS HPC Blog

Save up to 90% using EC2 Spot, even for long-running HPC jobs

Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances enable AWS customers to save up to 90% of their compute cost by using spare EC2 capacity. They are really popular with HPC customers who use a lot of compute every day. But long running HPC jobs often can’t survive an EC2 Spot interruption. This makes it hard for them to benefit from all this low cost and large volume of compute.

Amazon EC2 Spot interruptions do provide a 2-minute warning, though, that you can detect from within your running instance, or externally using Amazon EventBridge. Some HPC tools are starting to handle these interruptions by saving their state and allowing the next instance to resume work from that point.

But what about the much larger number of HPC tools that don’t have this capability? Can we add them into the mix – to make them fault-tolerant without changing the applications? AWS technology partners such as MemVerge offer solutions for checkpointing at the VM level. They capture the process tree and the working directory (like local temp files) and write it to shared storage. By handling this at the VM level, they allow your existing HPC tools to run to completion unaware of Spot interruptions. This opens savings opportunities for AWS customers using EC2 Spot instances.

In this post, we’ll show you how these new technologies can help you optimize HPC compute costs by using EC2 Spot Instances. Depending on your environment, you may not need to modify your applications to do it. We’ll cover the underlying technologies, their operational implications, and their limitations.

Outline of the checkpoint/restore process

Figure 1 describes the process: When a Spot interruption happens, EC2 sends an event to Amazon CloudWatch. A monitoring script inside the instance identifies this from the instance metadata (1). This creates a checkpoint, stored in a shared file system (2). The HPC scheduler will detect the execution node was terminated, however the HPC tools will still be unaware of Spot interruptions. (3). It’ll call the EC2 API to launch a new Spot instance (5). The new instance will detect a checkpoint exists (6), and will restore it (instead of starting the job from scratch). The job now continues, and communicates with the license server (7) for the duration of the job.

Figure 1 – The checkpoint / restore process, which uses the Spot interruption notification to force a checkpoint to take place and recycles the job onto a new instance when one is available.

Figure 1 – The checkpoint / restore process, which uses the Spot interruption notification to force a checkpoint to take place and recycles the job onto a new instance when one is available.

How we tested this

This blog post is the result of running MemVerge with an HPC code over a few weeks, slowly incrementing the complexity of the solution. Our goal was to minimize changes to the existing HPC workflow.

Where changes are required we’ll mention them as we go along.

Our test introduced one new element at each step to be able to identify which steps impacted the runtime the most. We repeated each step at least 15 times to identify variability in the results:

  1. Baseline: test a “clean” copy of the environment (no additional checkpointing tools installed)
  2. Test the checkpointing tool binaries installed, but not running
  3. Test with the tool running, but without creating checkpoints
  4. Test with a checkpoint, restore on the same node
  5. Test with a checkpoint restore on a different node
  6. Create multiple checkpoints in the same run, resume each one on a different node

With two exceptions (which we’ll discuss in minute), we didn’t experience longer runtimes from checkpointing. We chose to limit testing to a single worker node. See “Workloads spanning multiple nodes” near the end of this post for more on that.

Operational considerations

We learned a lot from this lab work – mainly things that you can work around with adequate planning.

For the sake of what follows, we’ll refer to the first EC2 Spot Instance that gets interrupted as the old node. The new instance launched to continue the job is the new node.

Compute

You must restart your job on a new host with similar configuration as the old host. This includes the CPU manufacturer (AMD, Intel, Arm), memory size, and the CPU generation. This can be incorporated easily by diversifying the instances before launching the job by selecting the conforming instances using the ec2-instance-selector tool . Your HPC tool may check for specific instruction-set support and rely on it. If this instruction set is not available in the next new node, it may fail to complete the job. It’s possible that you can move up to newer generations with backwards compatible instruction sets, but we didn’t test that.

Since instances can be interrupted, it’s best to run each worker in its own instance. This limits the number of impacted workers. It also prevents them from competing for network bandwidth when they need to write the checkpoint.

Scheduler integration

Since the scheduler is the one restarting the interrupted job, it’s important to understand its role in this.

Most schedulers offer a mechanism for pre-execution scripts (to setup the environment for execution) and post-execution scripts (to clean up or capture logs). Pre-execution scripts are repeated on the new node too, so they’ll need to detect the existing of a checkpoint and restore it. You’ll need to check with your technology provider if they’re able to capture environment variables. If not, those may need to move to your pre-execution script.

Another aspect of the scheduler integration is the locality of the workload. While relevant to all HPC jobs, it’s worth highlighting it in this context. If the old node ran in one Availability Zone (AZ), the new node really should be launched in the same AZ. This allows low-latency communication with the shared storage or other workers, and it’ll save you from paying for cross-AZ network traffic charges.

And, of course, the scheduler should not attempt to resume the job if a license isn’t available to run it (more on licensing later).

Shared storage – performance

The checkpoint has to be stored in some shared storage so it can outlive the old node and be restored on the new node. The Spot interruptions will generate temporary high-throughput writes to the shared storage. This checkpointing process needs to complete within 2 minutes and hence incremental checkpoints are preferred. The size of the checkpoint depends on the process memory size and the size of the working directory that needs to be backed.

In a distributed HPC cluster environment running Spot instances, Spot Placement Score (SPS) can help you find the optimal Spot configuration to take informed Spot availability decisions, more specifically, which Availability Zones AZ(s) to use as highlighted in the blog about SPS. However, storage performance sizing should still be accounted for when running a multi-machine workload. You’ll need a shared volume to host these checkpoints and it has to be sized to handle higher rate of interruptions.

If storage is a performance bottleneck, some checkpoints might fail to complete within 2 minutes. Your job will then have to be restarted or failback to the previous checkpoint.

Shared storage – permissions

To capture the entire process-tree of the job, the checkpoint software will need to have root permissions. This means checkpoint storage writes will come from the root user. Since most HPC environments use root_squash for security reasons, you’ll need to decide how to allow these writes. You can setup a dedicated volume for checkpoints with no_root_squash, or map the root user to another user. We recommend consulting your security team because any root access should be thoroughly reviewed.

Shared storage – capacity

Beyond the obvious capacity requirements for the checkpoints, you’ll need to consider how you clean up old checkpoints from successfully completed jobs. This minimizes storage capacity over time. You can use the post-execution script to handle that.

This script does not run on the old node, where the job never completed, but does run on the new node when it is completed. However, you might want to keep them available for a short period of time for troubleshooting. If so, consider a queue of these jobs that gets cleaned up periodically.

Licensing impact

If your HPC job relies on a license checked out from a license server, you’ll need to consider how this impacts the license count. Licenses bound to MAC addresses may cause you to double up on checked-out licenses – until the first one is released.

Some licensed tools allow the client to set a keep-alive interval forcing it to communicate with the license servers to keep the license assignment. Setting the keep-alive be below the time it takes to restart the job can resolve this. However, setting it too low can overwhelm the license server with requests.

Another possible approach is to have the MAC address move with the job between nodes. You can achieve this using a second Elastic Network Interface (ENI) on the old node. ENIs can be migrated to another node. This avoids the MAC address change. You can achieve this in your pre-execution script, for example. Another way is to use a Lambda function called by the scheduler. This minimizes access to this permission (“least privilege” – one of our guiding principles for designing anything in AWS).

Incremental checkpoints

To minimize network traffic and storage throughput bottlenecks, you can use incremental checkpoints.

By periodically checkpointing your process you reduce the amount of data that needs to be written during the 2-minute Spot interruption. The tradeoff is that you’ll write more data in total, because your memory changes throughout the job’s execution. Writing more frequently means you’ll write data in checkpoint 1, 2 and 3 that’s not relevant to the last checkpoint (the one that’s restored).

It’s hard to recommend a good “rule of thumb” for this, but we recommend measuring this tradeoff for the specific application, and considering these questions:

  1. What is the cost of a longer recovery point (less frequent checkpoints)? A high-cost license may make you prefer to checkpoint more frequently to minimize amount of work that can be lost in a worst-case scenario.
  2. How long does a checkpoint take? The checkpoint might need to freeze the process to take a consistent checkpoint, which means your job takes longer as you add checkpoints. Here too you’ll need to balance the protection against lost work with the impact on the time to results. In our lab, a 32GB machine was able to complete the checkpoint in 3-4 seconds (negligible).
  3. What’s the memory size used? The larger the memory model, the more data you’ll need to write. At some point the data volume becomes too great to write in two minutes, and incremental checkpoints become mandatory.
  4. How long is your job? This defines your worst-case-scenario for losing work. For a job that takes 5 days to complete, checkpointing daily means you may lose a full day of work. If that’s not acceptable – consider more frequent checkpoints.

Handling Spot interruptions

The impact of Spot Instances interruptions can be minimized by aligning to Spot best practices of Instance Diversification. Alternatively, use one of the Capacity allocation strategies such as Price Capacity Optimized, and Spot placement scores. As highlighted in this Spot best practice blog, this also helps to optimize your Spot Instances usage and is applicable to HPC workloads.

Many customers need to achieve a Service Level Agreement (SLA) for jobs to complete; this calls for a trade-off between cost and SLA. An interrupted job that is at risk of missing the SLA can be requeued and restarted on On-Demand instances in order to meet that SLA. When asking for the On-Demand instances, make sure to diversify your fleet request, including multiple host types. You can also explore low cost HPC-specific instances using On-Demand pricing if jobs have to be done intermittently or Savings Plans (to further reduce the cost if the jobs need to run regularly.

Checkpointing tool specific constraints

The checkpointing tool may include additional requirements (like the need to run as root to capture the process tree). However, it may also introduce new requirements as well on how you run your job.

These may include running them under a dedicated cgroup / container / namespace, which will introduce additional time to start / teardown the job. In our test we found a 20-second average setup time to create a new namespace, which for a long running job was acceptable.

Other requirements might include specific user permissions for running your tools, which will need to be evaluated against your security guidelines.

Some tools offer the ability to preempt the Spot interruption and move the workload in advance of this happening. This is an alternative to incremental checkpoints, but you can also use these two techniques together.

Workloads spanning multiple nodes

While we didn’t test workloads spanning multiple hosts in this lab, we will offer two insights:

In most cases long tightly-coupled workloads are not suitable to run on Spot instances. In tightly-coupled workloads workers rely on one another. The interruption of just one instance may leave the rest of the cluster stuck wasting time and resources. HPC instances (such as Hpc7aHpc7g, or Hpc6id), offer lower cost and offer more cores in a single host, reducing this challenge. They were designed exactly for this kind of workloads. Saving plans are a great option to get a discount in scenarios where the workload can benefit from a 1 year or 3 year commitment.

If your workers are loosely-coupled, you should still look at the head-node that manages them. (1) This node should not be using Spot so it’s not interrupted. (2) You should also check how this node will react to an instance interruption and returning with a new hostname. This is tool-specific, and you might need to do some testing to validate this.

Conclusion

Checkpoint/restore solutions offer a new way of reducing the cost of long running HPC jobs, but introduces some new requirements. For competitive markets like semiconductor and healthcare, this can mean faster time to market without a similar increase in the R&D budget.

It requires some planning and implementation to meet your specific software needs. For more information, visit the AWS marketplace to explore partner solutions for using EC2 Spot with your HPC workload, or reach out to us at ask-hpc@amazon.com.