AWS Public Sector Blog
Structural biologists learning cryo-electron microscopy can access educational resources powered by AWS
Guest post by Michael A. Cianfrocco, Ph.D.,University of Michigan
Cryo-electron microscopy (cryo-EM) — the technology that won the 2017 Nobel Prize in Chemistry — lets scientists peer into the molecular “machines” at work in our cells in ways that were previously impossible. Using advanced microscopes, cryo-EM captures images of proteins flash-frozen in vitreous ice, revealing their 3D structure in near-native states. The field of cryo-EM is growing rapidly, and the National Institutes of Health (NIH) invested in creating three national cryo-EM centers to increase access to these highly advanced microscopes.
Defining 3D structures of these molecules is critical to understand fundamental biological processes and to develop new medicines. Capturing the proteins’ images requires enormous data collection and data processing power – a typical dataset for one project is several terabytes.
Even as access to this technology improves, many researchers are still limited by computing bottlenecks. The cryo-EM field needs to provide more hands-on training in how to process such large datasets. Amazon Web Services (AWS) allows us to provide training, broadening the impact of this important structural biology technology.
The University of Michigan Life Sciences Institute was an early adopter of cryo-EM. As our cryo-EM lab continues to expand, we are focusing on advancing the technology we use and opening the field for fellow researchers, both inside and outside of our university.
Each summer, the university hosts a hands-on workshop that lets researchers from around the world learn how to use common image processing packages for analyzing cryo-EM data. Participants bring a project they are working on and the associated multiple terabytes of data. The workshop requires 40 individual cryo-EM data processing workstations and data ingestion of 100-200 terabytes. AWS provides the large computing infrastructure and flexibility that allows this workshop to run smoothly.
In the workshop, participants are exposed to multiple cryo-EM software packages such as cryoSPARC, RELION, cisTEM, SPHIRE, Rosetta, and Phenix. Because some software packages use Central Processing Units (CPUs) whereas others use Graphics Processing Units (GPUs), each participant is provided with a dedicated g3.16xlarge instance on Amazon Elastic Compute Cloud (Amazon EC2) (4 x NVIDIA M60 GPU, 64 vCPUs, 488 GB CPU RAM). During the five-day workshop, the user has unrestricted access and can run any data analysis routines.
Workshop participants need web access to each g3.16xlarge instance to run cryoSPARC. This means that each instance is exposed to a specific set of IP addresses using the AWS security group features. This type of access is common and straightforward for AWS users, however, typical high performance computing facilities in both academic and industry settings do not allow web access directly to GPU nodes. AWS made this step convenient, whereas most other settings would be unable to offer 40 GPU nodes with public IP access.
Beyond providing access to dedicated computing instances that were pre-loaded with tutorial data, participants can bring their own data with them. This poses a data upload and management challenge as each participant could bring up to 10 terabytes of data. Globus is used for data upload and management, which uses GridFTP to transfer arbitrarily large-sized datasets between registered endpoints. To upload data into AWS, Amazon Simple Storage Service (Amazon S3) storage connector from Globus uploads user data directly into an S3 bucket and access opens a week before the workshop. Then, users install a local Globus endpoint on their laptop or server, after which they upload their data into a user-specific directory.
To perform the final step of data orchestration and initiation of Amazon Elastic Compute Cloud (Amazon EC2) instances, we leverage AWS command line tools. Specifically, we built new functionalities into the existing cryoem-cloud-tools software package that we previously used for building hybrid computing architectures for cryo-EM data processing. Using this software, we launch 40 x g3.16xlarge instances each with 10 TB of local storage. On each machine, we download two terabytes of tutorial data in addition to any user-provided data.
At the end of the workshop, participants leave with exposure to leading software packages and expert-level advice. This includes participants leaving the workshop with near publication quality structures. Multiple participants left feeling empowered to process data and think about solving problems. By providing users with significant computing resources, they could analyze data and get feedback in real-time from experts, helping advance projects in days instead of weeks.