Cost-optimization on Spot Instances using checkpoint for Ansys LS-DYNA

This post was written by Dnyanesh Digraskar, Sr. Partner Solutions Architect, HPC, and Amit Varde Sr. Partner Dev Manager.

Organizations migrate their high performance computing (HPC) workloads from on-premises infrastructure to Amazon Web Services (AWS) for advantages such as high availability, elastic capacity, latest processors, storage, and networking technologies; all of this at a pay-as-you go pricing. These benefits empower engineering teams to scale compute- and memory-intensive workloads such as Finite Element Analyses (FEA) effectively in order to reduce costs and achieve faster time-to-results.

As the major portion of the costs incurred for running FEA workloads on AWS comes from the usage of Amazon EC2 instances, Amazon EC2 Spot Instances offer a cost-effective architectural choice. Spot Instances allow you to take advantage of unused EC2 capacity, and are available at up to a 90% discount compared to On-Demand Instance prices.

In this post, we describe how engineers can run fault-tolerant FEA workloads on Spot Instances using Ansys LS-DYNA’s checkpointing and auto-restart utility, and continue to leverage the cost benefits of Amazon EC2 Spot Instances.

How Spot Instances work

Spot Instances are spare EC2 compute capacity in the AWS Cloud available for steep discounts off On-Demand Instance prices. In exchange for the discount, Spot Instances come with a simple rule – they are interruptible and must be returned when EC2 needs the capacity back. The spare EC2 capacity comes from the excess capacity due to unpredictable demand at any given time for all the 375+ instance types across 77 Availability Zones (AZ) and 24 Regions. Rather than let that spare capacity sit idle and unused, it is made available to be purchased as Spot Instances.

The location and amount of spare capacity available at any given moment is dynamic and continually changes in real time. This is why it is important for Spot Instance customers to only run workloads that are truly interruption tolerant. Additionally, Spot Instance workloads should be flexible, meaning they can be shifted in real time to where the spare capacity currently is (or otherwise be paused until spare capacity is available again). For more information on how Spot Instances work, refer to this whitepaper.

Solution overview

Ansys LS-DYNA checkpointing utility, known as the lsdyna-spotless toolkit, uses the mechanism for monitoring the simulation jobs on Spot Instances as shown in the following architecture diagram (Figure 1).

Figure 1: Architecture for monitoring spot interruptions and restarting jobs.

The flow represented in the above architecture is explained in the high-level steps below:

User submits a single (or a set of) Ansys LS-DYNA jobs on the cluster head node.
Each job is split in multiple Message Passing Interface (MPI) tasks based on the user input.
The monitor daemon, poll, is dispatched to each compute node to poll for instance interruption signal from the EC2 metadata for all the MPI tasks.
On receiving the interruption signal, a checkpoint of the running simulation is created and saved on /shared drive accessible by both the head and compute nodes.
The job restarter daemon, job-restarter, resubmits the job to the cluster queue when the desired capacity is back again.

Now that you have an overview of the utility and the commands involved, let us now review the detailed steps involved in setting up Ansys LS-DYNA simulation environment with the lsdyna-spotless utility. This will help you understand its impact on overall instance costs.

Simulation environment setup

The LS-DYNA simulation environment can be set up on the cluster head node using the following steps:

Download and install the latest Ansys LS-DYNA version on the head node. Version R12 is used for this blog. Download the toolkit from this GitHub repository, and unzip the downloaded package.
Some customization options are provided in the env-vars.sh
Set the MPPDYNA variable to the path of the Ansys LS-DYNA executable.
Variables for license server and SLURM queue are updated with the appropriate path and queue name, respectively.

$ export LSTC_LICENSE_SERVER=”IP-address-license-server”
$ export SQQUEUE=”your-SLURM-queue-name”

Source this script after modification to set the environment variables.

$ source env-vars.sh

Copy all the tools distributed within the package to the binary directory created by the script in Step 2.

$ cp * /shared/ansys/bin

Launching jobs

After setting up the necessary environment variables, let us now look at the commands used to submit fault-tolerant Ansys LS-DYNA jobs with the checkpointing utility.

Each job is assumed to be in its own uniquely named directory with its own SLURM job script. The job script assumes that the main input deck for Ansys LS-DYNA is named main.k. If the name is different, either change it or create a soft link.

To start the job, run the following command:

$ start-jobs 2 72 spotq.slurm job-1 job-2 job-3

This will submit three Ansys LS-DYNA MPI jobs, each with 2 nodes and 72 tasks per node, for a total of 144 tasks per job.

Following is the sample spotq.slurm job script for SLURM scheduler that comes with the toolkit package:

#!/bin/bash 
#SBATCH -J job # Job name
#SBATCH -o job.%j.out # Name of stdout output file

INPUTDECK="main.k"

if ls d3dump* 1>/dev/null 2>&1; then
mode="r=$(ls -t d3dump* | head -1 | cut -c1-8)"
op="restart"
else
mode="i=$INPUTDECK"
op="start"
fi

# create/overwrite checkpoint command file
echo "sw1." >switch

# launch monitor tasks
job_file=$(scontrol show job $SLURM_JOB_ID | awk -F= '/Command=/{print $2}')
srun --overcommit --ntasks=$SLURM_JOB_NUM_NODES --ntasks-per-node=1 $SQDIR/bin/poll "$SLURM_JOB_ID" "$SLURM_SUBMIT_DIR" "$job_file" &>/dev/null &

# Launch MPI-based executable
echo -e "$SLURM_SUBMIT_DIR ${op}ed: $(date) | $(date +%s)" >>$SQDIR/var/timings.log
srun --mpi=pmix_v3 --overcommit $MPPDYNA $mode
echo -e "$SLURM_SUBMIT_DIR stopped: $(date) | $(date +%s)" >>$SQDIR/var/timings.log

Jobs can be stopped by running the command:

$ stop-jobs

Impact on turnaround times and cost

We launched a set of 10 jobs on 2 c5.18xlarge Spot Instances, for a total of 144 MPI tasks per job using the lsdyna-spotless toolkit. The utility monitors each job for a change of status that is, interrupted or not. By analyzing the start and stop time of a job, excluding the time intervals where it was interrupted, you can derive the effective runtime. By comparing this to the wall clock time elapsed, the overhead caused by Spot Instance interruptions can be derived. A utility (calc-timing) is provided for a simple tabulation of the timings data. Following is the output for monitoring the runtimes for the set of 10 jobs:

$ calc-timing ../var/timings.old
job /shared/lstc/neon finished in 1543 seconds, after interrupt(s).
job /shared/lstc/neon-9 finished in 1308 seconds, uninterrupted.
job /shared/lstc/neon-8 finished in 1333 seconds, uninterrupted.
job /shared/lstc/neon-1 finished in 1478 seconds, after interrupt(s).
job /shared/lstc/neon-3 finished in 1279 seconds, uninterrupted.
job /shared/lstc/neon-2 finished in 1537 seconds, after interrupt(s).
job /shared/lstc/neon-5 finished in 1313 seconds, uninterrupted.
job /shared/lstc/neon-4 finished in 1295 seconds, uninterrupted.
job /shared/lstc/neon-7 finished in 1334 seconds, uninterrupted.
job /shared/lstc/neon-6 finished in 1304 seconds, uninterrupted.
10 jobs finished and they finished in 13724 seconds in CPU time.
Wall clock time: 14669 seconds elapsed.

These interruptions are user-injected with the EC2 metadata mock tool, and are not representative of the actual performance and availability of Spot Instances.

Out of the 10 jobs, 3 jobs experienced interruptions, which resulted in an average increase in runtime by 11% per job. However, this 11% added turnaround time, translates to about 60% price reduction compared to running the jobs on On-Demand Instances as shown in the following figure (Figure 2). Turnaround time is represented by solid blue columns and total instance cost for the simulation set is represented by solid red triangles.

Figure 2: Comparison of turnaround times and cost for running a set of 10 Ansys LS-DYNA jobs on Amazon EC2 On-Demand and Spot Instances.

Simulation costs highlighted in the above figure reflect the On-Demand and Spot Instance costs in the N. Virginia (us-east-1) Region, and you should consider license costs separately.

Cluster configuration and AWS services

The HPC cluster used for running the test cases for the purpose of this blog is based on the AWS services highlighted below. Cases are run on Amazon EC2 compute-optimized C5 Spot Instances. Amazon EC2 C5.18xlarge instances, with 36 physical cores, are based on 3.4 GHz Intel Skylake-SP processors. Amazon Elastic File System (EFS) drive is attached to the head node for storing the application files that can be shared with compute nodes as well. The simulation checkpoint information is gathered from compute nodes and is saved on Amazon EFS drive for later use.

Deployment of Ansys LS-DYNA on AWS is described in the following architecture diagram (Figure 3).

Figure 3: Architecture for deploying LS-DYNA on AWS using AWS ParallelCluster.

AWS ParallelCluster, an AWS supported open-source cluster management tool, is used to deploy and manage HPC clusters. The workload manager SLURM, which runs on AWS ParallelCluster, is used as job scheduler, and is a pre-requisite for running the Ansys LS-DYNA’s checkpointing utility. Amazon Elastic File System (Amazon EFS) is an elastic file system that lets you share file data without provisioning or managing storage. Shared Amazon EFS drives are ideal for storing application checkpoint data that can be seamlessly accessed by all the compute nodes of the cluster.

Additional best practices for Spot Instances

While the above described Ansys LS-DYNA checkpointing utility actively monitors Spot Instance interruptions, and resubmits interrupted jobs to the cluster queue, it is recommended that you architect the simulation environment with additional fault-tolerance operations when possible.

One such method is to diversify the EC2 instance types used for running Ansys LS-DYNA simulations. This can be achieved using the AWS ParallelCluster’s multiple queues and multiple instance types capability, as described in this blog. As the EC2 capacity and demand vary per instance type, diversifying instance types in the HPC cluster gives you flexibility to submit and shift simulations to a queue with potentially lower interruptions.

Conclusion

Engineers can efficiently run fault-tolerant Ansys LS-DYNA simulations on Amazon EC2 Spot Instances, and can achieve up to 60% cost savings compared to On-Demand Instances for running their workloads.

This post demonstrates how the Ansys LS-DYNA’s new checkpointing utility allows engineers to run their FEA workloads on Spot Instances by accounting for any instance interruptions. The utility can also automatically re-submit the jobs to the HPC cluster queue when the instance capacity is added back.

If you are interested in running FEA or other HPC workloads on Amazon EC2 Spot Instances using Ansys LS-DYNA, more information can be found on the Ansys LS-DYNA solutions page.

AWS HPC Blog

Cost-optimization on Spot Instances using checkpoint for Ansys LS-DYNA

How Spot Instances work

Solution overview

Simulation environment setup

Launching jobs

Impact on turnaround times and cost

Cluster configuration and AWS services

Additional best practices for Spot Instances

Conclusion

Resources

Follow

Learn

Resources

Developers

Help