AWS HPC Blog

Accelerating drug discovery with Amazon EC2 Spot Instances

This post was contributed by Cristian Măgherușan-Stanciu, Sr. Specialist Solution Architect, EC2 Spot, with contributions from Cristian Kniep, Sr. Developer Advocate for HPC and AWS Batch at AWS, Carlos Manzanedo Rueda, Principal Solutions Architect, EC2 Spot at AWS, Ludvig Nordstrom, Principal Solutions Architect at AWS, Vytautas Gapsys, project group leader at the Max Planck Institute for Biophysical Chemistry, and Carsten Kutzner, staff scientist at the Max Planck Institute for Biophysical Chemistry.

This blog is part of a blog series that covers how we have been working with a team of researchers at the Max Planck Institute for Biophysical Chemistry, helping them leverage the cloud for drug research applications in the pharmaceutical industry.

In this post, we’ll focus on how the team at Max Planck obtained thousands of EC2 Spot Instances spread across multiple AWS Regions for running their compute intensive simulations in a cost-effective manner, and how their solution will be enhanced further using the new Spot Placement Score API.

Computer Aided Drug Design in the cloud

The drug research and development process usually starts with a really large number of potentially promising compounds. From this virtually infinite chemical space, it’s the researcher’s goal to identify potent molecules that might be life-saving. These compounds are then gradually filtered through a multi-stage selection process until eventually a small subset of them are synthesized and thoroughly tested before further approval for use.

After identifying a potential drug candidate (the “lead compound”), the aim is to further optimize this lead into an actual active molecule. Computational methods based on molecular dynamics simulations help this by efficiently reducing the search space to only a few hundred candidates. These can then be processed and tested in later stages, which are increasingly laborious – and expensive.

Computer aided drug design (CADD) is increasingly used in the early drug discovery stage, and thanks to advancements in technology, highly accurate and computationally-intensive methods can be used to select the best possible candidates. This includes a class of methods using molecular dynamics where we simulate the protein-ligand interaction at the atomic level.

These Early drug discovery simulations are usually performed on-premises using large supercomputers shared by multiple research and development institutions. Building that kind of infrastructure takes years and once it’s built it’s expensive to maintain, has limited capacity, and a lot of other users, which means it sometimes takes a long time to get results.

AWS can offer massive capacity which is only provisioned and charged for the duration of a simulation. Besides the lower costs and reduced time to provision the capacity, it also offers increased flexibility by using multiple instance types, different families, and purchasing options. This flexibility means researchers can experiment with many of the available options to find a best fit for each application. This empowers them to achieve the best possible trade-off between time to results and cost for each simulation.

Running GROMACS at scale on EC2 Spot Instances

EC2 Spot Instances enable AWS customers to request unused EC2 capacity at steep discounts, up to 90% compared to On-Demand prices. They’re a great fit for many stateless, fault-tolerant and/or flexible workloads, and are especially suited for loosely coupled computationally-intensive applications running over hundreds or thousands of instances. In these cases, Spot savings can add up to significant amounts of money which can ensure the feasibility of a given workload.

Spot uses capacity pools, which are sets of unused EC2 instances with the same instance type and operating system running within an Availability Zone. When EC2 needs this capacity for another customer, instances are claimed back, with (at least) a two-minute warning.

To be successful with Spot, it helps to be flexible – especially when it comes to your preferred instance types. Diversification across multiple Spot capacity pools means EC2 can provision new instances from other capacity pools in the event of Spot interruptions in a specific pool. Your workload can then resume on the new instances and continue on, often without any visible impact.

For most workloads, Spot diversification is achieved by using multiple instance types and tapping into all the Availability Zones within a Region; the more Availability Zones and instance types, the better the chance to get the desired Spot capacity, and the lower the frequency of Spot interruptions.

The Max Planck research team was interested in using EC2 Spot to provision thousands of instances for running their computationally-intensive simulations. Their GROMACS workload has a few characteristics that make it a great fit for Spot:

  • It’s loosely coupled and instance type flexible – it runs well on CPUs and GPUs.
  • It’s Region flexible – there’s relatively little input data and output data that need to be moved from one place to the next.
  • The acceptable time to get the end results is flexible – it can be measured in hours, days or even more than a week depending on the simulation. Time-flexible workloads like this often present trade-offs between cost and time-to-results.
  • It can implement checkpointing – a job can resume quickly in the face of a Spot interruption. For compute-heavy workloads like molecular dynamics, a task might take hours or days to compute.

In our previous blogs from this series, the Max Planck research team showed benchmark results across multiple instance types, and figured out that the most cost-effective instance types for them are G4dn.xlarge, G4dn.2xlarge and G4dn.4xlarge. But they knew they could also use a larger number of instance types – cost-efficiency varied. They summarized their results, which we’ve shown in Table 1.

Table 1 – Runtime and price per job (On-Demand and Spot Instances at the time of benchmarking) run for a 107k atoms system for a selection of instances.
Instance Type $/job OD $/job Spot runtime
g4dn.xlarge $50.40 $16.80 12.9h
g4dn.2xlarge $48.00 $16.20 8.5h
g4dn.4xlarge $57.60 $19.20 6.6h
g4dn.8xlarge $87.60 $28.20 6.0h
g4dn.16xlarge $165.60 $51.60 6.0h
p4dn.24xlarge $159.60 $49.80 6.1h
c5.4xlarge $117.00 $46.20 26.4h
c5.24xlarge $182.4 $70.80 8.0h
c5a.24xlarge $223.20 $94.80 9.6h

Considering the workload’s regional flexibility, and its large capacity needs, we helped the team run them over multiple AWS Regions in parallel using a tool called ‘HyperBatch’. This is a solution, designed by an AWS Solution Architect, which is designed to run AWS Batch across multiple AWS Regions, to secure the required capacity by leveraging large numbers of capacity pools.

Depending on the trade-off between cost and time-to-results that the Max Planck research team is trying to achieve for a given simulation, they had some options for achieving their workload’s Spot capacity:

  • for the lowest possible cost, they could run only on the preferred G4dn GPU instance types – this doesn’t offer much diversification. Since G4dn instances are popular and used for many workloads including HPC, deep learning and graphics rendering, they can often be in short supply in the Spot capacity pools. That can increase the rate of interruptions, which might increase the simulation time to multiple days, which is not always workable.
  • for a fatser time-to-results, the team could use a highly-diversified mix of instance types, including a variety of EC2 compute-optimized instances. By increasing the diversification of instances, there’s more compute capacity overall that could satisfy our needs, and we’ll be able to run the simulations sooner.

For this simulation run, the team optimized for shorter time-to-results and used a mix of C5 and G4dn instance types of various sizes, as per the Spot diversification best practices.

Figure 2 shows the distribution across six different AWS Regions. Figure 3 shows the breakdown by instance types.

Figure 1 - Instance distribution by AWS Region.

Figure 1 – Instance distribution by AWS Region.

Figure 2 - EC2 Instance distribution by instance type

Figure 2 – EC2 Instance distribution by instance type

In this previous blog from the series you can find out more about this simulation run and its outcomes.

Spot Placement Score API

In the past, customers running large-scale workloads have requested AWS guidance to select the right AWS Regions and Availability Zones, so they had the best shot at getting the capacity they needed.

Given the large scale of these runs, in most cases, EC2 Spot is chosen as the way to efficiently run these workloads.

AWS recently launched Spot Placement Score, a new API that allows customers to determine the set of AWS Regions able to deliver enough Spot capacity for their workload in a self-service manner. With Spot Placement Score customers input their Spot capacity requirements in the form of instance types and target capacity and get a scored list of Regions and Availability Zones, as in the examples below:

{
    "InstanceTypes": [
        "c5.2xlarge",
        "c5.4xlarge",
        "c5.9xlarge",
        "c5.12xlarge",
        "c5.18xlarge",
        "c5d.24xlarge",
        "c5d.2xlarge",
        "c5d.4xlarge",
        "c5d.9xlarge",
        "c5d.12xlarge",
        "c5d.18xlarge",
        "c5d.24xlarge",
        "c5n.2xlarge",
        "c5n.18xlarge",
        "g4dn.2xlarge",
        "g4dn.4xlarge",
        "g4dn.8xlarge"
        ],
    "SingleAvailabilityZone": false,
    "TargetCapacity": 1000
}

Note: As the documentation explains, the Spot Placement Score API requires a TargetCapacity figure roughly as large as the number of Spot Instances previously launched in your AWS account, so you may receive errors with the above capacity figure if you haven’t been running thousands of Spot Instances before.

Save the above JSON snippet into a file named input.json and retrieve the Spot Placement Scores using the AWS command line tool:

$ aws ec2 get-spot-placement-scores --region us-east-1 —cli-input-json file://input.json

The response looks like in the below code block, each AWS Region being given a score between 1 and 10, on how likely it is to obtain the required Spot capacity:

 
{
    "SpotPlacementScores": [
        {
            "Region": "us-east-1",
            "Score": 9
        },
        {
            "Region": "us-east-2",
            "Score": 9
        },
        {
            "Region": "ap-northeast-1",
            "Score": 7
        },

Based on this output, you can then request Spot capacity in the Regions or Availability Zones with the highest scores, which are likely to succeed in provisioning the required capacity when the score is in the high digits.

Summary and future work

In this post we showed how the team at Max Planck provisioned a large number of EC2 Spot Instances over multiple AWS Regions for their drug research simulations using the HyperBatch solution with a manual configuration. We also explained how the new Spot Placement Score API enables the team to implement a more flexible solution going forward.

With the Spot Placement Score API integration, HyperBatch can automatically tap into multiple Spot capacity pools across even more AWS Regions than those configured manually before. This will further reduce the rates of Spot interruptions experienced during the simulations and will react in real time to changes in capacity by automatically adjusting the weighs of instance types across AWS Regions.

The subsequent simulations by the team at Max Planck will also test other scenarios, like running only with GPU instances for the lowest possible cost, but with a potentially increased time-to-result. The team will also consider the newly released G5 instance types, which deliver up to 3x better performance for graphics-intensive applications and machine learning inference and up to 3.3x higher performance for machine learning training compared to Amazon EC2 G4dn instances. We want to find out if G5 instances will improve the performance of GROMACS, and we’ll be sure to post here when we have some results.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

Cristian Măgherușan-Stanciu

Cristian Măgherușan-Stanciu

Cristian is a Senior Specialist Solution Architect for EC2 Spot with AWS. He has a DevOps, Consulting, and Open Source development background, has been using AWS since 2013 and Spot since 2015 while at Nokia and Here Technologies. His current focus is helping AWS customers optimize their costs by adopting Spot and Graviton2 instances.

Christian Kniep

Christian Kniep

Christian is a Senior Developer Advocate for HPC & Batch with AWS and gained his HPC experience in the automotive sector. After learning about containers he pivoted to work for startups and Docker Inc before joining AWS. He is passionate about bringing HPC to the masses by simplifying the experience and introducing new technologies like containers.