Running a 3.2M vCPU HPC Workload on AWS with YellowDog
This post was contributed by Colin Bridger, Principal HPC GTM Specialist, and Simon Ponsford, CEO at YellowDog.
Historically, advances in fields such as meteorology, healthcare, and engineering, were achieved through large investments in on-premises computing infrastructure. Upfront capital investment and operational complexity have been the accepted norm of large-scale HPC research. These challenges in deploying HPC technologies restricted the pace that smaller companies could achieve.
In recent work, OMass Therapeutics, a biotechnology company identifying medicines against highly validated target ecosystems, used YellowDog on AWS to analyze and screen 337 million compounds in 7 hours, a task which would have taken two months using an On-Premises HPC cluster. YellowDog, based in Bristol in the UK, ran the drug discovery application on an extremely large, multi-region cluster in AWS with the AWS ‘pay-as-you-go’ pricing model. It provided a central, unified interface to monitor and manage AWS Region selection, compute provisioning, job allocation and execution. To prove out the solution, YellowDog scheduled 200,000 molecular docking analyses across two regions and completed that workload in 65 minutes, enabling scientists to start work on analysis the same day, significantly accelerating the drug discovery process.
In this post, we’ll discuss the AWS and YellowDog services we deployed, and the mechanisms used to scale to 3.2m vCPUs in 33 minutes using multiple EC2 instance types across multiple regions. Once analysis started, the utilization rate was 95%.
Overview of solution
YellowDog democratizes HPC with a comprehensive cloud workload management solution that is available to all customers, at any scale, anywhere in the world. It’s cloud native and schedules compute nodes based on application characteristics, rather than the constraints of a fixed on-premises HPC cluster. This allows you to manage clusters using the YellowDog scheduler or third-party schedulers from one control pane, so you have a single view on cost and department consumption across applications. It can also provision resources across multiple Regions, Availability Zones, instance types and machine sizes.
The YellowDog platform runs on Amazon EC2, using Amazon Elastic Kubernetes Service (EKS). It uses Elastic Load Balancing to manage access to the cluster. Amazon Elastic Block Storage (EBS) is used for the foundational services (Kafka, Artemis and Zookeeper) that require persistent storage. AWS CloudTrail and Amazon CloudWatch are used for monitoring.
In this configuration, we also used Amazon EC2 Spot Fleets in “maintain mode” to acquire and maintain Spot Instances. These fleets are configured with multiple instance overrides (and subnets) and use a “capacity-optimized and order-prioritized” allocation strategy. Finally, Amazon Simple Storage Service (Amazon S3) is used for data transfer and access.
To execute the run, YellowDog launched 46,733 Spot Instances, utilizing 24 Amazon EC2 Fleets and 8 different instance types. In total, we provisioned 3.2 million vCPUs and achieved over 95% utilization for the duration of our run. YellowDog scheduled 200,000 molecular docking analyses across two geographically dispersed Regions, in North America and Europe. The combination of Docker containers, AutoDock Vina, and Open Babel was used to orchestrate, analyze, and score hit compounds.
Our batch ran for 65 minutes and notably, the utilization of the instances kept pace with provisioning. Furthermore, in the event of Spot Instance reclamations, the reclaimed Spot Instances were immediately replaced, with no disruption to workload execution.
The YellowDog AWS Cluster implementation relies on a repeatable, structured set of processes:
Technology that can be clustered: The YellowDog platform is inherently suited to building clusters. It uses Kubernetes to manage all deployed services; from open-source support (e.g., Mongo, Zookeeper, Kafka), to native foundational services such as the YellowDog scheduler and compute service that is contained within the YellowDog platform). This means that all services, like provisioning and scheduling, scale linearly with the work that needs to be executed.
Intrinsic load balancing: To ensure efficient utilization of available capacity, tasks are submitted to internal queues to be picked up by the next available instance. This allows for flexible, horizontal scaling, as the service capacity is increased, thereby maximizing throughput. This design reframes the problem of scaling distributed computing to servicing large numbers of requests.
Cloud native scheduling and meta scheduling: The YellowDog scheduler, which coordinates and manages a workload, is a pull-based scheduler. It responds to registration requests and status updates from workers on cloud instances, against a queue of work. The scheduler can manage jobs that may last fractions of seconds, or those that may take many hours, and manages the reallocation of jobs when workers are terminated due to Spot Instance reclamations. It has flexible auto-scaling algorithms to ensure that work-demand matches worker-supply. This also means that when instances are withdrawn in one location, they can be compensated in another. In addition to this, the scheduler can orchestrate and manage third-party tools, such as Slurm or IBM LSF, to give a single view on cost and department consumption across applications.
Use of heterogenous resources: The YellowDog scheduler is designed to either use heterogenous compute within the same cluster or combine compute from heterogenous clusters in a single ‘demand vs. supply’ paradigm. This heterogenous compute can either be virtual or bare-metal instances, it can have different lifecycles (Spot Instances, On-Demand, or Reserved Instances) and it can include both cloud and on-premises resources.
Object Store service: YellowDog’s object store service is used to manage data transfer, which is uploaded to Amazon S3 object storage prior to workload execution. It manages access during job submission and completion. The object store service, which is integrated with the YellowDog scheduler, provides a secure and recoverable mechanism to move data. This includes collection from the individual workers, as well as making data available to them.
Lightweight configuration: The configuration required for running workloads on the YellowDog platform is exceptionally lightweight. Users configure keyrings, which store cloud credentials using military-grade AES-256 encryption. They can define Region specific details, such as security groups, subnet IDs, and IAM roles; and they can also define the instance provisioning strategy. In this run, we used a split provision strategy, which keeps compute even across multiple sources. We used two Spot Fleets and specified eight different Amazon EC2 instance types, ranging from large c5.24xlarge and m5.24xlarge, to smaller c5.9xlarge and m5.24xlarge instances. The instance types were ordered by preference to best match the requirements of the workload.
The run had 1 million vCPUs working within 7 minutes. The 2 million mark was achieved within 11 minutes. At these milestones, over 95% of vCPUs were processing data at full utilization. Within 33 minutes, the 3.2 million vCPUs were processing the workload, evenly split amongst the Amazon Elastic Compute Cloud (Amazon EC2) fleets.
This was used at-scale by OMass Therapeutics to accelerate and expand its virtual screening capabilities. With access to this on-demand supercomputer, the researchers at OMass were able to analyze and screen 337 million compounds in 7 hours. To replicate that using their on-premises systems would have taken two months. The benefit of getting a drug to sick people months ahead of schedule is clear, as we can all appreciate in current times.
Dr Giles Brown, VP and Head of Medicinal Chemistry, OMass Therapeutics said “the professional and timely manner of this collaboration has enabled OMass to rapidly screen a novel target and help the company in its ambition to build a pipeline of small molecule therapeutics. The scale and speed of the platform is something OMass would not be able to replicate.”
We’re excited to help organizations accomplish more with on-demand access to powerful computing resources, intelligent provisioning capabilities and cloud-based meta scheduling. What’s particularly exciting is that we’re able to unblock access to computing power as a limiting factor, leaving only the problems customers are trying to solve, and their imagination for finding new ones.
Doing work at this global magnitude enables organizations to solve global problems. Vital vaccination research helps tackle diseases (like Covid-19). Identifying possible evacuation routes in response to severe weather events, saves lives. Whatever the application, there are endless possibilities that come from accessing computing power at immediate scale.
Some of the content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.