AWS Architecture Monthly
A selection of the best new technical content from AWS
Table of Contents
- Ask an Expert: Ryan Ulaszek, WW Tech Leader - Genomics & Lisa McFerrin, PhD, WW Leader - Bioinformatics
- Executive Brief: Genomics on AWS: Accelerating scientific discoveries and powering business agility
- Case Study: Fred Hutch Microbiome Researchers Use AWS to Perform Seven Years of Compute Time in Seven Days
- Quick Start: For rapid deployment
- Blog: NIH’s Sequence Read Archive, the world’s largest genome sequence repository: Openly accessible on AWS
- Solutions: Genomics Secondary Analysis Using AWS Step Functions and AWS Batch
- Reference Architecture: Genomics data transfer, analytics, and machine learning reference architecture
- Case Study: Lifebit Powers Collaborative Research Environment for Genomics England on AWS
- Quick Start: Illumina DRAGEN on AWS
- Executive Brief: Genomic data security and compliance on the AWS Cloud
- Solutions: Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena
- Reference Architecture: Genomics report pipeline reference architecture
- Blog: Broad Institute gnomAD data now accessible on the Registry of Open Data on AWS
- Quick Start: Workflow orchestration for genomics analysis on AWS
- Solutions: Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker
- Reference Architecture: Research data lake ingestion pipeline reference architecture
Ask an Expert
What are some general architecture patterns for genomics in the cloud?
Genomics data generated by research, biopharma, and healthcare organizations is beginning to outpace their ability to cost effectively store, manage, and analyze this data on-premises. Thus, we are increasingly seeing the following four architectural trends that use AWS services for moving genomics workloads to the cloud.
- Data transfer and storage. Organizations are shifting to solutions like AWS DataSync to manage their large-scale data transfers. DataSync provides durable, high throughput data transfer. It can also throttle data to manage network bandwidth. DataSync handles common tasks, minimizing your IT operational burden. Customers use Amazon Simple Storage Service (Amazon S3) because it provides scalable, highly available, and secure storage. It preserves data long term and provides high availability in downstream analysis.
- Workflow automation for secondary analysis. Once in the cloud, raw genomic sequencing data undergoes “secondary analysis.” Each sample is processed through orchestrated tasks such as sequence alignment, variant calling, annotation, and quality control. To process these tasks, organizations use AWS Step Functions for serverless orchestration and AWS Batch to provision resources to optimize processing time and cost.
- Data aggregation and governance. To gain insights from data, customers combine sample data and layer inputs such as functional annotations or clinical fields. AWS Glue, Amazon Athena, and AWS Lake Formation are often used to integrate different data sources, enable accessibility and querying, and process data while keeping sensitive data secure. AWS Glue prepares and catalogs data. Athena provides querying for cohort creation. Lake Formation layers data access controls to meet data governance needs and comply with regulatory standards.
- Tertiary analysis and machine learning. Genomics data can be combined with other data modalities to predict disease risk and drug response or inform clinical decision making. Amazon Redshift and Athena are commonly used for variant storage and query. Amazon QuickSight and Jupyter notebooks in Amazon SageMaker are used to query and visualize data. Amazon SageMaker helps build, train, and deploy machine learning models to potentially discover relationships between biomarkers and patient populations to improve treatments and outcomes for patients.
What are some considerations when putting together an AWS architecture to solve business problems specifically for genomics customers?
Realizing the potential of genomics in precision medicine requires data analytics capabilities to increase knowledge of biology and disease, identify new targets for medicines, improve patient selection in clinical trials, and inform treatment strategies to optimize therapeutic benefit for patients.
However, organizations can be challenged by 1) genomics data outgrowing on-premises storage, 2) limited access to local high performance computing (HPC) clusters, and 3) inconsistent processes to manage complex workflows. Key considerations that address these concerns map well to the five pillars of the AWS Well-Architected Framework:
- Security is our highest priority when creating an AWS architecture. Genomics data is generally considered the most private of personal data. Privacy, reliability, and security are critical to data creation, collection and processing, and storage and transfer. Amazon S3 and AWS Identify and Access Management (IAM) help maintain a strong security posture. They provide specific controls for authorizing data access, defining data governance, and establishing and maintaining data encryption. Amazon CloudWatch provides logs and events. AWS CloudTrail maintains audit logs.
- Operational excellence. The genomics field is still emerging. Many tools are open-source and distributed via code and container repositories like GitHub and Docker Hub. These tools are frequently used by research and development for biomarker discovery, drug development, and association studies. Services like AWS CodeCommit, AWS CodeBuild, and AWS CodePipeline allow you to automate change management through continuous integration/continuous delivery (CI/CD). Amazon Elastic Container Registry (ECR) and Amazon Elastic Container Service (ECS) help store, manage, share, and deploy container images.
- Reliability. Step Functions allow you to handle task failures in bioinformatics workflows, and AWS Batch scales horizontally to meet sample processing demand. Call caching and fault-tolerant pipelines further prevent reprocessing of compute and time intensive work that would be costly to rerun, such as aligning and mapping whole genome sequences during secondary analysis for variant detection.
- Performance efficiency. As organizations experience higher volumes and velocity of data generated from genomic sequencers, large-scale data transfer is shifting to solutions like DataSync for durable, high throughput data transfer. Right sizing your tools also allows you to select the Amazon Elastic Compute Cloud (Amazon EC2) instance type for optimal performance. Tertiary analysis, like joint genotyping, single-cell analysis, and genome-wide association studies, can have extensive memory requirements to parse hundreds of thousands to millions of biomarkers. Using efficient data formats such as parquet files and distributed compute via Amazon EMR allows for improved memory management and computation times.
- Cost optimization for compute, storage, and data transfer is a key consideration for organizations. Configuring object life-cycling in Amazon S3 optimizes storage costs based on access patterns and storage requirements. Larger files that are rarely accessed can be moved to S3 Glacier Deep Archive for long-term storage and archiving. Amazon EC2 Spot Instances reduce costs by performing genomics data processing and analysis at off-peak hours. This is particularly valuable for secondary analysis workflows. Currently, the AWS Open Data Sponsorship Program covers the cost of storage for publicly available high-value cloud-optimized datasets. This democratizes access to data and encourages development of communities benefiting from shared data access. It also promotes development of cloud-native techniques, formats, and tools that lower the cost of working with the data. There are already over 70 genomics and life science datasets available within the Registry of Open Data on AWS.
Do you see different trends in genomics workflows in cloud versus on-premises?
Organizations running genomic workloads on-premises are typically capacity constrained, which can lead to significant wait times in the available queues and more complicated resource management. When moving to the cloud, most organizations choose to transition to AWS Batch for task and resource management. AWS Batch provides on-demand and spot queues that offer a serverless pay-for-what-you-use approach to scale capacity for increased throughput. Amazon FSx for Lustre is also regularly integrated for data staging and I/O management, which mimics on-premises cluster network file system for data access.
When migrating secondary analysis to the cloud, teams often use AWS compute instance types to optimize their tasks for cost and performance. Teams also move to containers and CI/CD to build and deploy tools independently through their own deployment pipelines. This minimizes the consequence of change and improves operational efficiency.
Downstream genomic analysis frequently uses R and Python languages and community-driven packages. Researchers on-premises usually manage installation and environment dependencies on their own computers, or they’ll work with their IT department for version updates. When installed at user, group, or organization levels, application dependencies can be difficult to manage and may break workloads. On AWS, it’s fairly easy to migrate to containers that integrate all the tool dependencies. Then you can build, package, and deploy tools independently in Docker images, which minimizes issues for researchers.
What's your outlook for genomics, and what role will cloud play in future development efforts?
Genomics research has led to remarkable advances across industries. This includes accelerating drug discovery, enhancing clinical trials, improving decision support for precision medicine, enabling population sequencing, and powering sustainable and diverse agriculture. Using patients’ molecular signatures in research and clinical care will speed up the research and development process, reduce costs, improve patient satisfaction, and ultimately drive the efficacy of clinical trials with up to a 2x higher success rate.
Because sequencing has gotten cheaper over the last decade, there is a rise in population genomics, millions of individuals have been sequenced around the world. The subsequent increase in data has provided larger and more diverse datasets so we can ask more complex questions about how genes may influence health. This drives progress to improve the prevention, diagnosis, and treatment of a range of illnesses, including cancer and rare genetic diseases. These practices are being similarly applied in agriculture. Genomics is powering sustainable and diverse initiatives that allow farmers to improve plant and crop yield in addition to animal breeding practices.
Genomic analysis and interpretation requires researchers to collaborate with peers, the scientific community, and institutions. The large datasets being analyzed require integrated multi-modal datasets and knowledge bases, intensive computational power, big data analytics, and machine learning at scale.
The cloud enables genomics to innovate by making data and methods more findable, accessible, interoperable, and reusable. The storage and compute services on AWS reduce the time between sequencing and interpretation, with secure and seamless sharing capabilities plus cost-effective infrastructure. Data and tool repositories allow for easier access to existing resources that can be readily deployed in replicate environments, accelerating the modern study of genomics.
Ryan Ulaszek is Worldwide Tech Lead, Genomics at AWS where he oversees technical initiatives in genomics and acts as liaison between the AWS Service teams and the technical field community, worldwide. Ryan has worked as a genomics specialist in AWS for three years, helping AWS life science customers architect genomics solutions in the Amazon Web Services (AWS) Cloud. Ryan has worked as a software engineer for over twenty years in numerous life science companies, including Human Longevity, founded by J. Craig Venter, where he was the principal architect. He also worked within Amazon as a Senior Software Engineer, leading architecture projects that spanned across teams in Amazon Fresh and Amazon Retail. Ryan holds patents in life science and software engineering and five AWS Solutions Architecture certifications, including the AWS Solutions Architect Professional and AWS DevOps Engineer Professional certifications.
Worldwide Lead - Bioinformatics
Lisa McFerrin is bioinformatics lead at AWS, where she drives initiatives supporting the genomics, healthcare, and life science industries as part of worldwide business development. Prior to AWS, Lisa worked at Fred Hutchinson Cancer Research Center where she specialized in the development of software and methods that bridge genomic and clinical data to advance the understanding of cancer biology and improve patient care. In these roles, she facilitates collaborative and reproducible research in order to lower the barriers in communication, analysis, and sharing of data, knowledge and methods. Lisa has a background in math and computer science and obtained her PhD in Bioinformatics from North Carolina State University.
- Genomics - July 2021
- 5G - June 2021
- Travel and Hospitality - May 2021
- Biopharma - April 2021
- Semiconductor Design - March 2021
- Manufacturing - February 2021
- Open Source - Nov/Dec 2020
- AWS Solutions – October 2020
- Robotics – September 2020
- Agriculture – August 2020
- Advertising & Marketing – July 2020
- Media & Entertainment – June 2020
- Education – May 2020
- Automotive – April 2020
- Data Lakes – March 2020
- Healthcare – February 2020
- AWS re:Invent – January 2020
- Manufacturing – Nov/Dec 2019
- Financial Services – October 2019
- Games – September 2019
- Serverless – August 2019
- Machine Learning – July 2019
- Internet of Things – June 2019
Click images below to view or download past issues
AWS Architecture Monthly provides new and curated content about architecting in the AWS Cloud. Our goal is to provide you with the best new technical content from AWS, from in-depth tutorials and whitepapers to customer videos and trending articles. We also interview industry experts who provide unique perspectives about the month’s theme and its related AWS services and solutions.