AWS Solutions Library

Guidance for Improving High Performance Computing Resiliency on AWS

Q: Operational Excellence

The HPC scheduler monitors the Amazon EC2 instance running the job, launching a new instance to replace it when interrupted and logging the failure of the initial instance. It then calculates the job's complete runtime by combining the time spent on the first and second instances. This allows users to observe the impact of the interruption, broken down into "net" runtime (actual CPU time) and "wall-clock" time (total time from submission to completion). Read the Operational Excellence whitepaper

Q: Security

The automated cleanup of old checkpoints triggered by the Lambda function helps manage the attack surface area by removing unnecessary data storage. Read the Security whitepaper

Q: Reliability

An Amazon EC2 Spot instance interruption causes a checkpoint to be created of the workload's state in a shared storage volume. The HPC scheduler then detects the failure and launches a new instance to resume the job. Rather than restarting the job from the beginning, the job loads the checkpoint and allows the workload to continue from the point of interruption. Read the Reliability whitepaper

Q: Cost Optimization

Amazon EC2 Spot Instances allow users to realize cost savings without the constraints of upfront financial commitments or instance type limitations. Additionally, the Lambda function responsible for cleaning up obsolete checkpoints helps ensure the storage costs remain aligned with the number of jobs currently running. This prevents the accumulation of unnecessary checkpoint data over time. Read the Cost Optimization whitepaper

Q: Sustainability

By using Amazon EC2 Spot Instances and high-performance shared storage services, this Guidance distributes responsibility for sustainability between the user and AWS, with AWS responsible for the underlying infrastructure and its environmental impact while the customer can focus on optimizing their workloads. The ability to quantify the performance and cost benefits of using Amazon EC2 Spot Instances and optimized storage allows users to better understand the environmental impact of their workloads. Also, Amazon EC2 Spot Instances help maximize the use of AWS compute resources, reducing the overall resource requirements and downstream environmental impacts. Finally, the integration of checkpoint and restore capabilities, coupled with the use of high-performance shared storage, enable workloads to resume from checkpoints, minimizing the need to rerun jobs from the start and reducing overall resource consumption. Read the Sustainability whitepaper

Overview

This Guidance shows how to improve the resiliency and reduce the costs of long-running high performance computing (HPC) and electronic design automation (EDA) jobs through checkpoint and restore capabilities. Some HPC and EDA jobs run for extended durations, spanning hours or days, without built-in resiliency mechanisms, requiring the jobs to restart from the beginning upon any interruption. These jobs have historically been unable to use the cost benefits of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances, despite their high memory requirements. With this Guidance, AWS customers can adopt the MemVerge Memory Machine™ Cloud Edition (MMCE) agent. MMCE can improve resiliency and reduce costs by allowing checkpoints throughout the job and enabling jobs to resume from the last checkpoint in the event of an Amazon EC2 Spot Instance interruption. Implementing these capabilities allows a greater percentage of EDA jobs to benefit from the cost savings offered by Amazon EC2 Spot Instances.

How it works

These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.

Download the architecture diagram

100 %

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

The HPC scheduler monitors the Amazon EC2 instance running the job, launching a new instance to replace it when interrupted and logging the failure of the initial instance. It then calculates the job's complete runtime by combining the time spent on the first and second instances. This allows users to observe the impact of the interruption, broken down into "net" runtime (actual CPU time) and "wall-clock" time (total time from submission to completion).

Read the Operational Excellence whitepaper

The automated cleanup of old checkpoints triggered by the Lambda function helps manage the attack surface area by removing unnecessary data storage.

Read the Security whitepaper

An Amazon EC2 Spot instance interruption causes a checkpoint to be created of the workload's state in a shared storage volume. The HPC scheduler then detects the failure and launches a new instance to resume the job. Rather than restarting the job from the beginning, the job loads the checkpoint and allows the workload to continue from the point of interruption.

Read the Reliability whitepaper

This Guidance allows customers to migrate a greater portion of their EDA and HPC workloads to Amazon EC2 Spot Instances. The cost savings realized through the use of Spot Instances helps these customers to use faster compute instances, which may have previously been cost-prohibitive. This allows them to reduce the overall runtime of their workloads.

Read the Performance Efficiency whitepaper

Amazon EC2 Spot Instances allow users to realize cost savings without the constraints of upfront financial commitments or instance type limitations. Additionally, the Lambda function responsible for cleaning up obsolete checkpoints helps ensure the storage costs remain aligned with the number of jobs currently running. This prevents the accumulation of unnecessary checkpoint data over time.

Read the Cost Optimization whitepaper

By using Amazon EC2 Spot Instances and high-performance shared storage services, this Guidance distributes responsibility for sustainability between the user and AWS, with AWS responsible for the underlying infrastructure and its environmental impact while the customer can focus on optimizing their workloads. The ability to quantify the performance and cost benefits of using Amazon EC2 Spot Instances and optimized storage allows users to better understand the environmental impact of their workloads. Also, Amazon EC2 Spot Instances help maximize the use of AWS compute resources, reducing the overall resource requirements and downstream environmental impacts. Finally, the integration of checkpoint and restore capabilities, coupled with the use of high-performance shared storage, enable workloads to resume from checkpoints, minimizing the need to rerun jobs from the start and reducing overall resource consumption.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Improving High Performance Computing Resiliency on AWS

Overview

How it works

Well-Architected Pillars

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help

Guidance for Improving High Performance Computing Resiliency on AWS

Overview

How it works

Well-Architected Pillars

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Sustainability

Related Content

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help