How Thermo Fisher Scientific Accelerated Cryo-EM using AWS ParallelCluster
Thermo Fisher Scientific is a world leader in serving science, including compute-intensive image and movie processing Cryogenic Electron Microscopy (Cryo-EM) workloads. Scientists use Cryo-EM to determine the 3-dimensional structure of biomolecules at near-atomic resolution, a process that generates terabytes of images and movies using Thermo Fisher instruments. Reducing the total cost to results, time spent, and operational overhead of iteratively processing these data sets is a critical innovation to improve the speed and quality of medicinal chemistry results.
In this blog post, we’ll walk you through the process of building a successful Cryo-EM benchmarking pilot using AWS ParallelCluster, Amazon FSx for Lustre, and cryoSPARC (from Structura Biotechnology) and explain some of our design decisions along the way.
Cryo-EM was the subject of the Nobel Prize in Chemistry in 2017, and was used to produce some of the most valuable 3D structures of important drug targets, including the spike protein of SARS-COV-2. Cryo-EM Transmission Electron Microscopes flash-freeze protein samples in vitreous ice, revealing 3D digital representations of microscopic structures in near-native states. They produce terabytes of data per sample, and the data resolution increases with each generation of new instruments. That data needs to be processed using High Performance Computing (HPC) resources, with a complex pipeline accelerated by one or more GPUs. This is partly interactive, requiring scientists to be involved, and the results require 3D visualization, too.
Simplifying Complex Compute Requirements Using AWS Parallel Cluster
The HPC needs of Cryo-EM are difficult to solve on-premises. That’s because Cryo-EM analysis requires flexible access to compute resources, since the speed of any stage may not scale predictably with the number of GPUs per node, depending on the specifics of the processing pipeline. Rather than relying on aging and expensive hardware in on-premises data centers, Cryo-EM workloads on AWS can just take advantage of the elasticity of Amazon Elastic Compute Cloud (EC2) resources.
AWS ParallelCluster is an open-source cluster management tool that uses infrastructure as code to provision clusters of Amazon EC2 instances that you configure to match the needs of each step in the workflow. Scientists can experiment to find the optimal balance between cost and performance, benchmarking against the number and type of GPUs per worker node. There’s no additional cost associated with ParallelCluster; you only pay for the underlying resources provisioned by the framework. It scales resources up and down to match what’s needed, alleviating the pain point of the resource contention of traditional on-premises clusters during peak processing times.
High Performance Collaboration and Cost-Efficient File Storage with FSx for Lustre
In addition to compute flexibility, scientists increasingly need to share data sets and results with geographically distant colleagues. Transferring large data sets across networks with limited bandwidth causes delays of days or weeks, and does not provide controls for who accesses the data after it’s transferred. Storing that data in Amazon Simple Storage Service (Amazon S3) allows global sharing with a few clicks, with granular access controls to fit the business and scientific management, security, and governance requirements.
To make GPU nodes perform as fast as possible, it is critical to have fast-caching file storage to keep the GPUs fed. For Cryo-EM, sequential I/O performance is particularly relevant as large particle stacks need to pass in and out of the GPU’s memory. Amazon FSx for Lustre is a fully-managed AWS service that delivers a high-performance parallel filesystem that is optimized for HPC workloads, and It integrates with both ParallelCluster and Amazon S3. Using this seamless combination of S3 and FSx for Lustre gives users the best of both worlds, in terms of costs and performance. Since FSx for Lustre is shared between all the compute nodes, it further reduces costs by not requiring compute nodes that come with local storage. In short, FSx for Lustre is a cost-effective and performant shared drive for rapid read/write access to temporary cryo-EM files to speed up cryo-EM reconstructions.
Benchmarks using CryoSPARC
CryoSPARC is a complete solution for Cryo-EM data processing , developed by Structura Biotechnology Inc, for use in research and drug discovery. It keeps compute footprint small by quickly processing raw data for 3D reconstruction. It features unique algorithms that address issues of flexibility and perform particularly well for therapeutically relevant targets, like membrane proteins. Thermo Fisher automated their deployment of cryoSPARC on AWS using a configuration file that includes execution of a post-install script that sets up a Cryo-EM processing environment together with the infrastructure. With a single command line option, hardware and software stand ready for Cryo-EM processing.
This blog post walks through the technical considerations and the “why” behind the configuration elements in the solution. For additional prerequisites, clean-up and step-by-step instructions to reproduce this solution in your own account, refer to the AWS Samples GitHub repository.
Networking, Security, and Compute Availability Prerequisites
A typical use of a default VPC has public and private subnets balanced across multiple Availability Zones (AZs). However, HPC clusters (like ParallelCluster) usually prefer a single-AZ so they can keep communication latency low and use Cluster Placement Groups. For the compute nodes, you can create a large private subnet with a relatively large number of IP addresses. Then, you can create a public subnet with minimal IP addresses, since it will only contain the head node.
HPC EC2 instances like the P4d family aren’t available in every AZ. That means we need to determine which AZ in a given Region has all the compute families we need. We can do that with the AWS CLI describe-instance-type-offerings command. The easiest way to do this is to just below into CloudShell, which provides a shell environment ready to issue AWS CLI commands in a few minutes. After the CloudShell environment is provisioned, copy and paste the text into the shell, and provide your desired region in the bracketed placeholder.
aws ec2 describe-instance-type-offerings \ --location-type availability-zone \ --region <region> \ --filters Name=instance-type,Values=p4d.24xlarge \ --query "InstanceTypeOfferings[*].Location" \ --output text
The output will show you which AZs have the instances you described in the input parameters.
You’ll also need an EC2 SSH key-pair as an input parameter to the ParallelCluster configuration, to enable SSH access to the head node once the cluster is ready.
The data-transfer mechanism to move data from the instruments into Amazon S3 depends on the connectivity in the lab environment and the volume of data to be transferred. We recommend AWS DataSync, which easily automates secure data transfer from on-premises into the cloud with minimal development effort. AWS Storage Gateway, Amazon S3 File Gateway is another viable option, especially if lab connectivity is limited or continued two-way access from on-premises to the transferred data sets is needed. Both DataSync and Storage Gateway can be bandwidth throttled to protect your non-HPC business-critical needs.
Alternatively, you can use the AWS CLI to transfer individual files, or use partner solution to get started quickly.
AWS Identity and Access Management (IAM) Permissions
While ParallelCluster creates its own least-privilege roles and policies by default, many Enterprises limit their AWS account users’ access to IAM actions. ParallelCluster also supports using or adding pre-created IAM resources, which you can request to be pre-created for you by your IT services team. The required permissions and roles are provided in the ParallelCluster documentation to help you get started quickly.
You can provision your FSx file system as persistent or scratch. Persistent file systems can automatically export data back to Amazon S3, but scratch file systems don’t. The example in this GitHub repo uses scratch, since it is provisioning a benchmark environment rather than a production environment. If you want to integrate a data export task into the ParallelCluster job scheduler so that every time a job completes, a data export is run transparently in the background, this requires additional IAM Policy statements to be attached to the instance profile of the head node. The policy is in the file FSxLustreDataRepoTasksPolicy.yml in this GitHub repo. Make sure the role that you’re using to execute your ParallelCluster provisioning includes this policy if you intend to run the export.
Compute Cluster Configuration and End User Access
The architecture described in this blog allows to iteratively test on multiple EC2 compute families and break the workflow into small, medium, and large queues to create a configuration that optimizes each stage for cost and performance. Referencing Figure 1, Queue 1 contains g4dn GPU nodes, Queue 2 contains p4dn nodes, and Queue 3 contains utility compute-optimized c5n nodes.
CryoSPARC runs a web application and a MongoDB database on the head node, which you can access over reverse SSH tunnel on the head node, or by using AWS Systems Manager Session Manager. The web application allows you to interactively choose particles during the workflow and access visualizations during (and after) processing.
For the head node, you should choose an instance large enough to support the cryoSPARC application and small enough to minimize costs. The example uses a c5.4xlarge, but you can choose another instance type to fit your usage pattern.
Installation and Storage Considerations
The ParallelCluster post-install script executes the cryoSPARC installation with the license key you provide, and installs to a shared volume ranging from 1-500 GB. We found that 100 GB was sufficient for our use case. AWS ParallelCluster attaches the EBS volume to the head node, and exposes it to the compute nodes as an NFS mount. If clusters will be repeatedly de- and re-provisioned, you could take advantage of AWS ParallelCluster’s support for EC2 Image Builder to create your own custom AMIs with the software stack already installed. This reduces provisioning time and prevents repetitive post-install actions.
We also spin up a 12 TB Amazon FSx for Lustre scratch file system for the benchmarking workload due to its relatively temporary lifespan.
To reproduce the ParallelCluster setup described in this blog post, use the networking and security prerequisites we mentioned before to update the configuration file placeholders with subnet IDs, EC2 key pair name, cryoSPARC license, and S3 bucket name. Then follow the deployment instructions in the README file.
Once your cluster has been provisioned, you are ready to run your cryoSPARC jobs as described in Structura’s documentation. Benchmarking results were published in a Thermo Fisher white paper, Cryo-EM processing at the pace of medicinal chemistry on AWS. To hear the team talk about their results live, watch the Lab Roots Webinar: Keeping up with the chemists – Cloud processing for pharma cryo-EM on YouTube.
Using repeatable, secure, flexible, and iterative deployments via AWS ParallelCluster, Thermo Fisher reduced the processing time on public data sets from 4 weeks on-premises to 4 days on AWS. The total cost for a benchmarking execution was under $800, compared to buying and managing hundreds of thousands of dollars in on-premises hardware. Finally, the resolution of the resulting particle images was increased, because the team was able to use the latest hardware available from AWS.
Use the resources provided to benchmark your own data sets, and use the flexibility, performance, and cost optimization of AWS to accelerate the speed of science. Don’t forget to clean up your cluster using the instructions in the GitHub README file to avoid unintended charges in your account.