AWS HPC Blog

Stion – a Software as a Service for Cryo-EM data processing on AWS

This post was written by Swapnil Bhatkar, Cloud Engineer, NREL in collaboration with Edward Eng Ph.D. and Micah Rapp Ph.D, both SEMC/NYSBC, and Evan Bollig Ph.D. and Aniket Deshpande, both AWS.

Introduction

Cryo-electron microscopy (Cryo-EM) technology allows biomedical researchers to image frozen biological molecules, such as proteins, viruses and nucleic acids, and obtain structures of molecules that were impossible using previous methods. Cryo-EM requires both large and expensive electron microscopes as well as substantial high performance computing (HPC) resources to process microscope imagery and extract three-dimensional structures from them. The compute and storage infrastructure necessary to support these workloads is often prohibitively expensive for individual researchers and small research labs to process a small number of Cryo-EM projects per year, with compute and storage hardware, GPUs, and staffing costs amounting to $500,000+. This is where the cloud services from Amazon Web Services (AWS) can come handy.

To overcome these challenges, the NY Structural Biology Center built Stion, a web application that provides on-demand access to GPU instances on AWS for biomedical researchers to process Cryo-EM data. The main objective of building Stion is to reduce the infrastructure overhead for researchers so that they can focus on the science. Stion provides the building blocks for end-to-end data processing, including educational tutorials to improve skills, and access to scalable computational resources required for Cryo-EM processing. Taken together this open-source platform reduces the barrier for entry for new users by providing them a pathway to become familiar with Cryo-EM cloud computing and provides biomedical researchers a framework to learn how to process their own Cryo-EM data.

Why AWS?

We chose AWS for a few reasons. First, the AWS global infrastructure allows researchers all over the globe to launch AWS instances within minutes without having to maintain any physical infrastructure. They have the ability to run complex workloads through the browser from any part of the world if they have a stable internet connection.

Second, we did an in-depth analysis of both short-term and long-term pricing for different cloud providers for compute, storage and networking, which are the backbone of our application. After a detailed comparative analysis and benchmarking of our ideal pricing models, we came to a conclusion that AWS provides the lowest-cost pricing all-in for on-demand compute instances, object-based storage, and file-based storage.

Finally, one of the major reasons for selecting AWS as our go-to cloud vendor is because of the extensive support it provides to customers. AWS has plenty of online resources such as technical guides and in-depth documentation about every product and service they offer, but AWS support really stands out because of their well-trained personnel in every domain.

Solution overview

Stion provides a dedicated GPU sandbox pre-loaded with software packages such as CryoSPARC, RELION, Appion-Protomo, and EMAN2. Each of the software comes with preloaded datasets for researchers to get started with data processing immediately.

Architecture design is a key element to build a data processing pipeline on the cloud. We built a hybrid architecture where some of the resources are hosted on-premises and most of the resources are hosted on AWS Cloud. Figure 1 shows the on-prem and AWS resources, and Table 1 provides an accompanying description of the data analysis workflow.

Figure 1. Stion architecture diagram, including the workflow for data transfer and processing, as described in Table 1.

Figure 1. Stion architecture diagram, including the workflow for data transfer and processing, as described in Table 1.

Table 1. The steps in data analysis workflow.

Step Description
0 User onboarding. End users bring their own account. See the section that follows for more details.
1 (Optional) Copy data from on-premises to Amazon Elastic File System using AWS DataSync. For tutorials, test data is already on the instance.
2 License CryoSPARC. All academic researchers need to obtain a valid license ID from the software vendor before launching the CryoSPARC sandbox.
3 Enable storage services. Once the instance is launched, 2 EBS volumes and an Elastic File System are attached to the instance.
4 Enable monitoring. EC2 instances constantly send built-in metrics such as CPU utilization, EBS read/write operations, and status checks to Amazon CloudWatch, an AWS monitoring service that collects and tracks metrics, logs and events in real-time from different AWS resources. CloudWatch alarms send notifications to SNS topics, which then triggers an AWS Lambda function, which in turn sends notification alerts to the NYSBC Stion Slack channel using the Slack API.
5 Copy data to Amazon S3. All of the data, including raw frames, movies, maps and processed data are asynchronously copied to an Amazon S3 bucket, which provides longer term data resiliency and automatic lifecycle management at a low cost.

Since size of Cryo-EM data is large, we chose to use the AWS DataSync agent to manage transfers from on-premises storage. One of the main advantages of AWS DataSync is the detailed logs that it provides at the end of every transfer. The logs include the total data transferred, number of files copied, and file throughput. These are crucial metrics when moving approximately 3-5 TB per transfer, so that we can actively monitor and troubleshoot issues.

A second item to note is that Stion leverages cross-account access to launch resources within your own AWS account. In step 2, when you register the CryoSPARC license in the Stion web application, the web application utilizes the boto3 Python library and assumes a role to generate temporary security credentials such as Access Key ID, Secret Access Key, and Session Token. These temporary credentials are managed by the AWS Security Token Service (STS), and are used to launch the CryoSPARC instance via the Amazon EC2 API in your account.

With respect to storage configuration, two Amazon EBS volumes are attached to the instance. One for the Amazon Machine Image where the OS is pre-configured and includes all Cryo-EM applications, and a second volume for test datasets and scratch space. An Amazon Elastic File System (Amazon EFS) is also mounted on instances. EFS provides scalable performance and shared access to data across multiple instances using the NFS protocol, allowing for concurrent data processing on both CryoSPARC and RELION instances. EFS can also be mounted on on-premises servers for local processing of data over VPN.

User Onboarding

User on-boarding is one of the most important steps in the entire workflow. It sets up your AWS environment (such as the networking, security groups and computational and storage resources). These resources are the building blocks for processing data in the cloud. Stion automates the process of creating AWS resources in our user’s account, saving researchers valuable time so that they can get to data analysis faster.

The mechanism to create resources in a researcher’s AWS account is a cross-account role. Cross-account IAM roles can be defined as an identity that has a certain set of permissions and policies that determine what a third-party user or service such as New York Structural Biology Center (“NYSBC”) can or cannot do in the researchers AWS account. This role allows NYSBC to securely access certain AWS resources and call APIs on behalf of the end user without actually managing keys and deploy/provision infrastructure. From a security standpoint, this approach is highly recommended because none of the parties has to share credentials, and the end user also can audit who is accessing their AWS account using AWS CloudTrail.

User-onboarding is a one-time process for every AWS account, and takes less than 5 minutes to complete. AWS resources are created in the end user’s account by using AWS CloudFormation with the help of a custom launch stack URL. Once registered on Stion, users receive an acknowledgment email with a URL that navigates to the onboarding portal. With a single click, they are redirected to the AWS CloudFormation management console so that they initiate the creation of the necessary AWS resources.  

Stion Features

Auto-shutdown of instances

It is a common scenario in cloud computing that end users forget to shut down their virtual machines when they are done. Hence, there is an exponential rise in their monthly bill especially when running costly AWS GPU instances with on-demand pricing model.

In order to avoid over-spending, we introduced an auto-shutdown feature that will automatically shut down EC2 instances after 8 hours when a user launches a new instance or restarts an existing one. The time and date for the auto-shutdown of AWS instances are displayed on the Stion dashboard. This feature is implemented to optimize costs and keep the monthly EC2 bill in check for end users. In practice, we find that 8 hours is greater than the average time for processing an end-to-end Cryo-EM workflow on a standard dataset with a GPU.

Users also have the ability to stop their instances whenever they want by clicking the ‘Stop Instance’ button. However, they must ensure that jobs complete before hitting the stop button.

Figure 2 Stion abstracts the complexity of AWS and provides users essential metadata for control/access.

Figure 2 Stion abstracts the complexity of AWS and provides users essential metadata for control/access.

Reminder emails

For business continuity and a smoother experience for end users, reminder emails are sent 1 hour before the instance is automatically shut down. Occasionally, if the users are working on larger datasets, jobs can require more than an hour to complete. In such cases, we encourage users not to schedule long-running jobs after receiving a reminder. Instance shutdown automatically kills all running jobs as well as jobs that are queued. However, manually stopping/restarting an instance resets the timeout for 8 hours after the new start time.

Web-based shell

One of the major features of Stion is access to the sandbox directly from a browser. This is made possible using AWS Session Manager, a utility that helps to manage Amazon EC2 instances through an interactive one-click browser-based shell.

From a security standpoint, the user no longer has to setup SSH ports, bastion hosts, or SSH keys. They can simply get access to the instance using IAM policies that are created by AWS CloudFormation during the onboarding process.

This policy is attached to the instance at launch enabling access to the instance when it is in ‘Running’ state. No desktop or mobile client downloads are required.

Launch, Start and Stop but no termination of instances

End users can launch, start, and stop EC2 instances directly from the Stion dashboard without having to login to the AWS Management Console. While calling the APIs, the Stion application assumes the role from the end user’s account and generates temporary credentials to launch, stop, or start an instance. Since the AWS Management Console has more than 150 services and thousands of options, it can be difficult for a new user to navigate through the management console and perform the necessary actions. Thus, Stion distills the experience to a basic set of metadata and actions that users can perform as shown in Figure 2.

To protect against accidental termination of instance and data deletion, we intentionally exclude a ‘Terminate Instance’ option on the dashboard. End users who are new to AWS might not understand the difference between Stop and Terminate Instance: terminating an instance deletes the CryoSPARC and RELION software from the account including the CryoSPARC database and RELION job submission scripts respectively. This can impact other researchers when an instance is shared between lab members. Raw data and final processed data will still be stored in the EFS filesystem unless the end user chooses to delete the data manually, but intermediate results are at risk due to termination. When the end user no longer wants to use the Cryo-EM sandbox, they can manually terminate the instance from their EC2 management console.

Example costs for a Cryo-EM analysis

NYSBC, being a non-profit research institution, doesn’t charge any money from end users. We have democratized Stion for all biomedical researchers. Costs incurred while processing data using AWS resources are paid directly to Amazon Web Services from the end user’s AWS account.

A lot of research labs are concerned about the potential costs for using Stion. Here we provide a cost analysis for an example Cryo-EM project analyzing 5TB of data for 8 hours on a single p2.8xlarge EC2 instance. Data are then archived to Amazon S3 Glacier Deep Archive for a period of 2 years. After this, the data will be permanently deleted from Glacier Deep Archive.

Assumptions:

  1. Data copied from On-prem to EFS using AWS DataSync is 5TB
  2. Data on EBS volumes is stored for a month and deleted once the instance is terminated after a month
  3. Data is stored in the Elastic File System for 2 days and then moved to S3 Standard-Infrequent Access (S3 Standard-IA) storage
  4. Data is stored in S3 Standard-IA for a month and then moved to S3 Glacier Deep Archive using lifecycle policies

Table 1: Pricing for processing a dataset

Resources Pricing Usage Approximate costs
P2.8x large EC2 instance $7.2 per hour 8 hours $58
EBS volumes (gp2) x 2 $0.1/GB-month 100 GB for 1 month $10
Elastic File System (standard) $0.3/GB-month 10 TB for 2 days $205
S3 Standard-Infrequent Access $0.0125 per GB 10 TB for 1 month $128
S3 Glacier Deep Archive $0.00099 per GB 10 TB for 2 years $288
Total costs $775

To summarize, the total cost of a single Cryo-EM project on a standard dataset is approximately $775 which is significantly less than buying and maintaining on-prem hardware.

Research labs can set up monthly budget alerts on specific services such as monthly Amazon EC2 usage or Amazon EFS using AWS Budgets to keep a track of its usage on a daily or monthly basis. Researchers can submit detailed reports to their PI’s directly using the AWS Billing dashboard. If multiple researchers are using shared AWS resources, then they can break down the spend of each of the researchers by using cost allocation tags. For example, S3 buckets owned by a researcher can have tags such as name and project which can then be filtered at a granular level using the cost management service.

On a related note for cost management, AWS offers a program for academic and research institutions that route at least 80% of their Data Egress out of the AWS Cloud through an approved National Research and Education Network (NREN), such as Internet2. This program waives the data egress fees for research institutions by up to 15 percent of their total monthly AWS spend.

Use-Case 1: Collaborating on COVID-19

COVID-19, the disease caused by the novel coronavirus SARS-CoV-2 (Zhou et al., Nature 2020), has killed more than 2.3 million people worldwide since its emergence in late 2019 (WHO). Both advances were aided greatly by the rapid determination of near-atomic cryo-EM structures of the viral spike glycoprotein and neutralizing antibodies. The Lawrence Shapiro lab at Columbia University, in collaboration with the Aaron Diamond Aids Research Center and the Kwong lab at the NIH’s Vaccine Research Center, has solved structures of more than a dozen spike complexes (Liu et al. Nature 2020, Zhou et al. Cell Host & Microbe 2020, Cerutti et al. bioRxiv 2021a, Rapp et al. bioRxiv 2021, Cerutti et al. bioRxiv 2021b). This body of work was accompanied by deluge of Cryo-EM data, an increase of at least two orders of magnitude compared to what the lab was used to, which quickly overwhelmed the available computational resources. The need for immediate, high-performance, and scalable resources was exacerbated by relative inexperience of many of those engaged in data processing as well as the emergence of novel strains of SARS-CoV-2 which multiplied the number of complexes that still need to be studied.

Stion was used to address these challenges and allowed for a quick, seamless transition from on-site workstations to cloud computing. As part of the COVID-19 HPC Consortium, AWS provided technical support and promotional credits for the use of AWS services to develop Stion. The first step was to develop a cost projection based on total compute time and data storage. This proved to be very difficult as it is hard to determine in advance how many GPU-hours it will take to reach a high resolution structure for any dataset that is not a known, highly-studied sample. Significant time was spent considering how data would be moved from the microscope to the AWS instance. In the first month of the pandemic, Stion allowed the Shapiro lab to process approximately 50TB worth of data, nearly as much as they had processed in the entirety of the previous year. As the number of samples obtained from COVID-19 patients increased, we easily scaled the AWS environment to include a second 16-GPU instance that could be used as a backup when the first was overutilized. It was also expanded to include other processing softwares, in particular RELION, which allowed us to process data from the very beginning of the process with motion correction all the way to advanced post-processing with Bayesian polishing.

Use-Case 2: Cloud-enabled workshops

A major thrust for Stion is to conduct workshops on the cloud and train researchers how to process data independently in the AWS sandbox. Before moving to the cloud, NYSBC heavily utilized on-premises infrastructure to conduct such workshops. However, local resources would get cluttered as the workshop participants submitted their jobs on the cluster, and the spiking load affected day-to-day data ingestion and data processing pipelines for regular users.

Conducting workshops on the cloud has several advantages. The first is elasticity. We have the ability to easily scale out and scale in EC2 instances with an Auto Scaling Group to match the number of participants. Elastic scalability allows us to cost-effectively provide an individual sandbox to each of our participants so that they can have complete control over compute resources as well as their data.

Another major advantage with the cloud is capability at scale. End users come from diverse backgrounds, with different compute capabilities at their home-institutions. Some may have GPU workstations in their labs, others might be limited to a quad-core PC. Some may share a workstation with other lab members while others might not even have a workstation. Others still may have limited access to their institution’s HPC clusters. Tapping into the capacity of the cloud we can offer an environment where every participant has identical compute and storage resources, as well as ensure users access to the latest-generation GPUs for optimal throughput. No waiting in line to run jobs or stressing over acquisition of new hardware. From an IT support perspective, the capacity to consolidate all users into a homogeneous pool simplifies operations and mitigates real-time troubleshooting.

In April 2021, the Simons Electron Microscopy Center (SEMC) at NYSBC offered a 3-day workshop focused on the theory and practice of tomographic methods. There were 2 processing hands-on sessions utilizing popular academic software suites: Appion-Protomo and EMAN2. Since it was a virtual event, more than 250 participants from all over the globe especially from the US, UK, and Asia-Pacific registered. In order to ensure compute capacity for all participants, capacity reservations were created in the workshop account a day in advance requesting T3 and G4 instances in specific Availability Zones for at least 100 researchers plus instructors and staff.

Our mission was to provide every researcher with a dedicated sandbox to learn and apply processing techniques required for Cryo-EM and Cryo-ET, no matter what part of the world they live in. For those 2 hands-on sessions, 500 instances were effortlessly launched on-demand:

  • 250 compute instances for Appion-Protomo
  • 250 GPU instances for EMAN2

By Regions,

  • 300 instances in us-east-1
  • 100 instances in us-east-2 and
  • 100 instances in us-west-2

Conclusion

As researchers adapt to working remotely and growing global collaborations, the use of cloud infrastructure is proving essential. Our solution, Stion, facilitates remote science and bridges the gap between biomedical researchers and computational infrastructure for Cryo-EM data processing on-demand.

By building on top of AWS, Stion reduces the barrier of entry into Cryo-EM for biomedical researchers in underserved communities, removing the need for IT infrastructure and dedicated IT professionals. Stion also accelerates Cryo-EM research on-demand at lower cost.

At present, Stion offers several processing workflows with public AMIs and tutorials for CryoSPARC, RELION, Appion-Protomo (Noble et al 2015) and EMAN2 (Bell et al. 2018). No hardware knowledge is required to get started, and researchers get on-demand access via a web browser. Furthermore, Stion effectively manages cloud resources in our collaborators’ accounts without compromising on security with the help of cross-account IAM roles.

Disclaimer: The content and opinions in this blog are those of the third-party authors and AWS is not responsible for the content or accuracy of this blog.

Swapnil Bhatkar

Swapnil Bhatkar

Swapnil Bhatkar is a cloud engineer in the Advanced Computing team at National Renewable Energy Laboratory, U.S Department of Energy. Prior to NREL, he was working at the New York Structural Biology Center as an AWS Solutions Architect and HPC Systems Engineer. He previously led the AWS cloud initiative for the National Center for CryoEM Access and Training (NCCAT) funded by NIH. He developed Stion and was responsible for building a robust cloud based CryoEM data processing pipeline for biomedical researchers. He is an active member of the global AWS Community Builders program that provides technical mentorship and guidance to help individuals and companies get started with the cloud.

Edward Eng, Ph.D.

Edward Eng, Ph.D.

Edward Eng, Ph.D., leads the operations team at the Simons Electron Microscopy Center, a world leading cryoEM facility, and is the manager of NCCAT, a NIH cryoEM service center, both at NYSBC. The national service center program allows him to engage with scientists in an open and collaborative forum to advance biomedical research. By bringing the best practices in the field to assist researchers he acts as a champion of cryoEM. His mission is to lower the barriers of access to the cryoEM technology and cross- train researchers to have accelerated impact at their home institutions.

Micah Rapp, Ph.D.

Micah Rapp, Ph.D.

Micah Rapp, Ph.D., is a recent graduate of Columbia University, where he was a graduate student working jointly in Larry Shapiro’s lab in the Department of Biochemistry and Molecular Biophysics and with Clint Potter and Bridget Carragher at the Simons Electron Microscopy Center. He is interested in studying macromolecular complexes in their native state, focusing on visualizing cell adhesion assemblies using cryo-EM, with experience in both single particle analysis and cryo-electron tomography. He has served as a teaching assistant for the SEMC Winter EM Course since 2018.

TAGS: ,
Evan Bollig

Evan Bollig

Evan Bollig, Ph.D., is a senior specialist solutions architect for HPC with AWS. Prior to AWS, Evan supported thousands of users at the Minnesota Supercomputing Institute in research computing and spearheaded efforts around cloud-HPC integrations. Evan has developed cloud-native infrastructures for production clinical genomics pipelines, led the creation and operation of a secure cloud enclave for controlled-access data research, and continues to be a longtime proponent for open source (SourceForge and GitHub—user: bollig).

Aniket Deshpande

Aniket Deshpande

Aniket Deshpande is senior GTM specialist for HPC in Healthcare Lifesciences at AWS. Aniket has more than a decade of experience in the biopharma and clinical informatics space, where he has developed and commercialized clinical-grade software solutions and services for genomics, molecular diagnostics, and translational research. Prior to AWS, Aniket has worked in various technical roles at DNAnexus, Qiagen, Knome, Pacific Biosciences and Novartis.