What does this AWS Solutions Implementation do?
The Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution creates a scalable environment in AWS to prepare genomic data for large-scale analysis and perform interactive queries against a genomics data lake. This solution can help IT infrastructure architects, administrators, data scientists, software engineers, and DevOps professionals build, package, and deploy libraries used for genomics data conversion; provision data ingestion pipelines for genomics data preparation and cataloging; and run interactive queries against a genomics data lake.
Data outputs from secondary analysis can be large and complex. For example, Variant Call Files (VCFs) must be converted to big data optimized file formats (like Parquet) and incorporated into existing genomics datasets. A data catalog must be updated with the appropriate schema and version so that users can find the data they need and operate within a defined data model that is semantically consistent. Annotation datasets and phenotypic data must be processed, cataloged, and ingested into an existing data lake in order to build a cohort, aggregate the data, and enrich the result set with data from annotation sources. Data governance and fine-grained data access controls are necessary to secure the data while still providing sufficient data access for research and informatics communities. The Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution simplifies this process.
This solution provides a genomics data lake and sets up genomics and annotation ingestion pipelines using AWS Glue ETLs and crawlers to populate a genomics data lake in Amazon Simple Storage Service (Amazon S3). The solution demonstrates how to use Amazon Athena for data analysis and interpretation on top of a genomics data lake and creates a drug response report from within a Jupyter notebook.
AWS Solutions Implementation overview
The diagram below presents the architecture you can automatically deploy using the solution's implementation guide and accompanying AWS CloudFormation template.
Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena solution architecture
The AWS CloudFormation template creates four CloudFormation stacks in your AWS account including a setup stack to install the solution. The other stacks include a landing zone (zone) stack containing the common solution resources and artifacts, a deployment pipeline (pipe) stack defining the solution's CI/CD pipeline, and a codebase (code) stack providing the ETL scripts, jobs, crawlers, a data catalog, and notebook resources.
The setup stack creates an AWS CodeBuild project containing the setup.sh script. This script creates the remaining CloudFormation stacks and provides the source code for both the AWS CodeCommit pipe repository and the code repository.
The landing zone (zone) stack creates the CodeCommit pipe repository. After the landing zone (zone) stack completes its setup, the setup.sh script pushes source code to the CodeCommit pipe repository.
The deployment pipeline (pipe) stack creates the CodeCommit code repository, an Amazon CloudWatch event, and the CodePipeline code pipeline. After the deployment pipeline (pipe) stack completes its setup, the setup.sh script pushes source code to the CodeCommit code repository.
The CodePipeline (code) pipeline deploys the codebase (code) CloudFormation stack. After the AWS CodePipeline pipelines complete their setup, the resources deployed in your account include Amazon Simple Storage Service (Amazon S3) buckets for storing object access logs, build artifacts, and data in your data lake; CodeCommit repositories for source code; an AWS CodeBuild project for building code artifacts (for example, third-party libraries used for data processing); an AWS CodePipeline pipeline for automating builds and deployment of resources; example AWS Glue jobs, crawlers, and a data catalog; and an Amazon SageMaker Jupyter notebook instance.
Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.
Provide a scalable environment in AWS for large-scale genomics analysis
Leverage infrastructure as code best practices
Leverage continuous integration and continuous delivery (CI/CD)
Modify your genomics data preparation pipelines and Jupyter notebooks for analysis
Browse our library of AWS Solutions Implementations to get answers to common architectural problems.
Find AWS certified consulting and technology partners to help you get started.
Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.