AWS Public Sector Blog

Five things to consider when moving your research workflows to AWS

DNA Genotyping and Sequencing. A bioinformatician analyzes DNA integration data

Research is done differently in the cloud than in an on-premises lab. Research labs looking to move computational research to the cloud should start with their workflows.

Workflows are well-developed sets of computer analyses that turn raw data into results that researchers can publish. Even though analysis methods can overlap, workflows vary by lab and project to suit the research question.

Researchers can benefit from the advantages of moving research workflows to Amazon Web Services (AWS). Moving to AWS is a good time to consider changes to workflows to increase reproducibility, portability, and collaboration—to enable faster and more scalable analyses.

There are common themes across computational research workflows that researchers should consider as they begin to move their research workflows to AWS. These include:

Storage and backup. Often, labs work with a master copy of the data and multiple working copies. This approach can cause trouble when the master data is updated but the copies are not. On AWS, data that you wish to keep should be stored on Amazon Simple Storage Service (Amazon S3). The storage itself maintains multiple copies to provide high durability. However, Amazon S3 is an object store—not a file system. To compute using data on Amazon S3, copy it to the file system on Amazon Elastic Compute Cloud (Amazon EC2) instances and save the results back to Amazon S3. You can also set up a desktop client such as Cyberduck to transfer files to and from your Windows or Mac to Amazon S3.

Amazon S3 is also available in a range of storage classes that make it less expensive to save data not accessed frequently. You can use lifecycle rules to move data from one Amazon S3 storage class to another, automatically archiving unused data for example.

Storage is a key enabler of research; Baylor College of Medicine used Amazon S3 to store more than 1PB of genomic data, instead of sharing data with hundreds of scientists by mailing encrypted hard drives.

Compute infrastructure. In most labs, the computers you use partially dictate what your workflow looks like—or even what you can do. For example, perhaps you are using a cluster configured with message passing interface (MPI) to run your computational chemistry simulation when a shared memory multiprocessor with two GPUs would be faster. With AWS, you create the computational infrastructure for your specific workflow rather than figuring out how to make something work on what you have. You start computers when you need them and stop them when you don’t. Instead of sharing one powerful server within your lab, you create a few different configurations for different purposes, and then start and stop them as needed.

Serverless computing. Serverless computing is running software without having to create or maintain the computer it runs on. This becomes important as research workflows become more advanced. AWS Lambda gives you the ability to execute small programs when needed. AWS Batch allows you to run containerized parallel jobs without creating or tearing down a cluster. For example, AWS Lambda can watch an Amazon S3 bucket and email users when a research assistant adds new subject data. AWS Batch can run preprocessing and quality assurance scripts on this subject data. You pay for only the computation time that your code uses. Putting the two services together, AWS Lambda can start AWS Batch jobs and notify the group if there is new data available to review.

AWS Step Functions can link multiple AWS services together, which is useful when you need to scale up to process thousands of scans, genomes, or assessments in the same way with minimal user intervention.

Workflow orchestration. Often there is a sequence of steps in a workflow, executed one after the other, potentially with some manual quality assurance steps in between. When one step fails, you need to correct the failure and redo all the steps that depend upon it.

For reproducibility, you script these steps. However, more sophisticated workflow tools (especially popular in genomics) allow you to run independent steps in parallel and pick up at the correct place after a failure. See the Common Workflow Language (CWL) and NextFlow for general purpose workflow tools.

When moving to AWS, choose a workflow orchestration language that makes it simple to test and run your workflows locally and to cost-effectively launch AWS resources. For example, Nextflow supports the use of on-premises clusters, AWS ParallelCluster or AWS Batch, with configuration parameters. If each step in the workflow is a container (necessary to use AWS Batch), the scientific code will be the same on all these platforms, and the workflow can take advantage of using the most cost-effective instance type for each stage of computation.

The scalable workflow orchestration provided by Nextflow and AWS Batch enabled researchers at Fred Hutchinson to reduce the time to process data on more than 15,000 microbiome samples from seven years to seven days, and easily share their workflow with other scientists.

Security. In labs where researchers copy data to their own computers and work independently, permissions for data may not be very complicated. When working in an environment where researchers will be sharing data and launching their own compute resources, it is useful to think about giving each person in the lab access to only the resources they need and removing access when they finish. AWS has the features to support the needs of the most secure applications through services that protect your data, manage identities and permissions, automate compliance checks, and actively monitor threats. Once you identify what compliance standards apply to your data, you can make sure that you use the right AWS infrastructure to support them.

Want to learn more? Register for a virtual session on July 21, 2:00-4:00PM ET, hosted by AWS and Internet2, designed to introduce researchers to AWS cloud computing.

Attending PEARC20? Stop by the AWS virtual booth and schedule time to connect with an AWS representative.

Learn more about research and technical computing on AWS, read more stories on research on AWS, and contact us.

Tara Madhyastha, Ph.D.

Tara Madhyastha, Ph.D.

Dr. Tara Madhyastha is a principal research scientist at Amazon Web Services (AWS) and affiliate faculty in the department of psychology at the University of Washington. Trained in high performance computing, she is an interdisciplinary scientist, with contributions in the fields of computer science, educational technology and psychology. For the last ten years, she worked in neuroimaging, developing new methods to study changes to cognitive networks that occur with aging and neurodegenerative disease.