What does this AWS Solutions Implementation do?

The Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution creates a scalable environment in AWS to prepare genomic data for large-scale analysis and perform interactive queries against a genomics data lake. This solution can help IT infrastructure architects, administrators, data scientists, software engineers, and DevOps professionals build, package, and deploy libraries used for genomics data conversion; provision data ingestion pipelines for genomics data preparation and cataloging; and run interactive queries against a genomics data lake.

Data outputs from secondary analysis can be large and complex. For example, Variant Call Files (VCFs) must be converted to big data optimized file formats (like Parquet) and incorporated into existing genomics datasets. A data catalog must be updated with the appropriate schema and version so that users can find the data they need and operate within a defined data model that is semantically consistent. Annotation datasets and phenotypic data must be processed, cataloged, and ingested into an existing data lake in order to build a cohort, aggregate the data, and enrich the result set with data from annotation sources. Data governance and fine-grained data access controls are necessary to secure the data while still providing sufficient data access for research and informatics communities. The Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution simplifies this process.

This solution provides a genomics data lake and sets up genomics and annotation ingestion pipelines using AWS Glue ETLs and crawlers to populate a genomics data lake in Amazon Simple Storage Service (Amazon S3). The solution demonstrates how to use Amazon Athena for data analysis and interpretation on top of a genomics data lake and creates a drug response report from within a Jupyter notebook.

AWS Solutions Implementation overview

The diagram below presents the architecture you can automatically deploy using the solution's implementation guide and accompanying AWS CloudFormation template.

Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena | Architecture Diagram
 Click to enlarge

Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution architecture

The AWS CloudFormation template creates four CloudFormation stacks in your AWS account including a setup stack to install the solution. The other stacks include a landing zone (zone) stack containing the common solution resources and artifacts, a deployment pipeline (pipe) stack defining the solution's CI/CD pipeline, and a codebase (code) stack providing the ETL scripts, jobs, crawlers, a data catalog, and notebook resources.

The setup stack creates an AWS CodeBuild project containing the setup.sh script. This script creates the remaining CloudFormation stacks and provides the source code for both the AWS CodeCommit pipe repository and the code repository.

The landing zone (zone) stack stack creates the CodeCommit pipe repository. After the lanzing zone (zone) stack completes its setup, the setup.sh script pushes source code to the CodeCommit pipe repository.

The deployment pipeline (pipe) stack creates the CodeCommit code repository, an Amazon CloudWatch event, and the CodePipeline code pipeline. After the deployment pipeline (pipe) stack completes its setup, the setup.sh script pushes source code to the CodeCommit code repository.

The CodePipeline (code) pipeline deploys the codebase (code) CloudFormation stack. After the AWS CodePipeline pipelines complete their setup, the resources deployed in your account include Amazon Simple Storage Service (Amazon S3) buckets for storing object access logs, build artifacts, and data in your data lake; CodeCommit repositories for source code; an AWS CodeBuild project for building code artifacts (for example, third-party libraries used for data processing); an AWS CodePipeline pipeline for automating builds and deployment of resources; example AWS Glue jobs, crawlers, and a data catalog; and an Amazon SageMaker Jupyter notebook instance. 

Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena

Version 1.0.1
Last updated: 09/2020
Author: AWS

Estimated deployment time: 30 min

Use the button below to subscribe to solution updates.

Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.  

Did this Solutions Implementation help you?
Provide feedback 

Features

Provide a scalable environment in AWS for large-scale genomics analysis

Create a scalable environment in AWS to prepare genomic data for large-scale analysis and perform interactive queries against a genomics data lake.

Leverage infrastructure as code best practices

Rapidly evolve the solution using infrastructure as code (IaC) principles and best practices.

Leverage continuous integration and continuous delivery (CI/CD)

Use AWS CodeCommit source code repositories and AWS CodePipeline to build and deploy updates to data preparation jobs and crawlers, data lake configurations and Jupyter notebooks.

Modify your genomics data preparation pipelines and Jupyter notebooks for analysis

Modify the solution to fit your particular needs, for example, by adding new AWS Glue jobs and crawlers; and new Jupyter notebooks to perform data analysis. Each change will be tracked by the CI/CD pipeline, facilitating change control management, rollbacks, and auditing.
Build icon
Deploy a Solution yourself

Browse our library of AWS Solutions Implementations to get answers to common architectural problems.

Learn more 
Find an APN partner
Find an APN Partner

Find AWS certified consulting and technology partners to help you get started.

Learn more 
Explore icon
Explore Solutions Consulting Offers

Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.

Learn more