- AWS Solutions Library›
- Guidance for Improving High Performance Computing Resiliency on AWS
Guidance for Improving High Performance Computing Resiliency on AWS
Overview
How it works
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
The HPC scheduler monitors the Amazon EC2 instance running the job, launching a new instance to replace it when interrupted and logging the failure of the initial instance. It then calculates the job's complete runtime by combining the time spent on the first and second instances. This allows users to observe the impact of the interruption, broken down into "net" runtime (actual CPU time) and "wall-clock" time (total time from submission to completion).
The automated cleanup of old checkpoints triggered by the Lambda function helps manage the attack surface area by removing unnecessary data storage.
An Amazon EC2 Spot instance interruption causes a checkpoint to be created of the workload's state in a shared storage volume. The HPC scheduler then detects the failure and launches a new instance to resume the job. Rather than restarting the job from the beginning, the job loads the checkpoint and allows the workload to continue from the point of interruption.
This Guidance allows customers to migrate a greater portion of their EDA and HPC workloads to Amazon EC2 Spot Instances. The cost savings realized through the use of Spot Instances helps these customers to use faster compute instances, which may have previously been cost-prohibitive. This allows them to reduce the overall runtime of their workloads.
Amazon EC2 Spot Instances allow users to realize cost savings without the constraints of upfront financial commitments or instance type limitations. Additionally, the Lambda function responsible for cleaning up obsolete checkpoints helps ensure the storage costs remain aligned with the number of jobs currently running. This prevents the accumulation of unnecessary checkpoint data over time.
By using Amazon EC2 Spot Instances and high-performance shared storage services, this Guidance distributes responsibility for sustainability between the user and AWS, with AWS responsible for the underlying infrastructure and its environmental impact while the customer can focus on optimizing their workloads. The ability to quantify the performance and cost benefits of using Amazon EC2 Spot Instances and optimized storage allows users to better understand the environmental impact of their workloads. Also, Amazon EC2 Spot Instances help maximize the use of AWS compute resources, reducing the overall resource requirements and downstream environmental impacts. Finally, the integration of checkpoint and restore capabilities, coupled with the use of high-performance shared storage, enable workloads to resume from checkpoints, minimizing the need to rerun jobs from the start and reducing overall resource consumption.
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages