[SEO Subhead]
This Guidance shows how to improve the resiliency and reduce the costs of long-running high performance computing (HPC) and electronic design automation (EDA) jobs through checkpoint and restore capabilities. Some HPC and EDA jobs run for extended durations, spanning hours or days, without built-in resiliency mechanisms, requiring the jobs to restart from the beginning upon any interruption. These jobs have historically been unable to use the cost benefits of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances, despite their high memory requirements. With this Guidance, AWS customers can adopt the MemVerge Memory Machine™ Cloud Edition (MMCE) agent. MMCE can improve resiliency and reduce costs by allowing checkpoints throughout the job and enabling jobs to resume from the last checkpoint in the event of an Amazon EC2 Spot Instance interruption. Implementing these capabilities allows a greater percentage of EDA jobs to benefit from the cost savings offered by Amazon EC2 Spot Instances.
Note: [Disclaimer]
Architecture Diagram
[Architecture diagram description]
Step 1
Users submit long-running HPC jobs to the scheduler.
Step 2
The scheduler launches an Amazon Elastic Compute Cloud (Amazon EC2) Spot Instance to run the job. The MMCE checkpointing agent is pre-installed to generate periodic incremental checkpoints.
Step 3
The Amazon EC2 service provides a 2-minute advance notice of Spot Instance interruption.
Step 4
When a spot interruption is detected, the MMCE agent creates an incremental checkpoint, capturing the process state, memory, and working directory, and stores it in a separate shared storage volume.
Step 5
The scheduler launches a new Spot Instance to restart the job. This process is transparent to the scheduler, as no integration is required to handle the checkpoint.
Step 6
Before launching the tool, a check for existing checkpoints is made. If a checkpoint is found, the MMCE is called to restore the checkpoint. This allows the job to resume from the last checkpoint instead of restarting the job from the beginning.
Step 7
An Amazon EventBridge scheduled event triggers an AWS Lambda function, which submits an HPC job to clean up old checkpoints and reduce storage space.
Get Started
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
The HPC scheduler monitors the Amazon EC2 instance running the job, launching a new instance to replace it when interrupted and logging the failure of the initial instance. It then calculates the job's complete runtime by combining the time spent on the first and second instances. This allows users to observe the impact of the interruption, broken down into "net" runtime (actual CPU time) and "wall-clock" time (total time from submission to completion).
-
Security
The automated cleanup of old checkpoints triggered by the Lambda function helps manage the attack surface area by removing unnecessary data storage.
-
Reliability
An Amazon EC2 Spot instance interruption causes a checkpoint to be created of the workload's state in a shared storage volume. The HPC scheduler then detects the failure and launches a new instance to resume the job. Rather than restarting the job from the beginning, the job loads the checkpoint and allows the workload to continue from the point of interruption.
-
Performance Efficiency
This Guidance allows customers to migrate a greater portion of their EDA and HPC workloads to Amazon EC2 Spot Instances. The cost savings realized through the use of Spot Instances helps these customers to use faster compute instances, which may have previously been cost-prohibitive. This allows them to reduce the overall runtime of their workloads.
-
Cost Optimization
Amazon EC2 Spot Instances allow users to realize cost savings without the constraints of upfront financial commitments or instance type limitations. Additionally, the Lambda function responsible for cleaning up obsolete checkpoints helps ensure the storage costs remain aligned with the number of jobs currently running. This prevents the accumulation of unnecessary checkpoint data over time.
-
Sustainability
By using Amazon EC2 Spot Instances and high-performance shared storage services, this Guidance distributes responsibility for sustainability between the user and AWS, with AWS responsible for the underlying infrastructure and its environmental impact while the customer can focus on optimizing their workloads. The ability to quantify the performance and cost benefits of using Amazon EC2 Spot Instances and optimized storage allows users to better understand the environmental impact of their workloads. Also, Amazon EC2 Spot Instances help maximize the use of AWS compute resources, reducing the overall resource requirements and downstream environmental impacts. Finally, the integration of checkpoint and restore capabilities, coupled with the use of high-performance shared storage, enable workloads to resume from checkpoints, minimizing the need to rerun jobs from the start and reducing overall resource consumption.
Related Content
Save up to 90% using EC2 Spot, even for long-running HPC jobs
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.