This guidance helps users prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. It includes infrastructure as code automation, as well as an ingestion pipelines to transform the data, and notebooks and dashboards for interactive analysis. This guidance was built in collaboration with Bioteam.
- Ingest, format, and catalog data from The Cancer Genome Atlas (TCGA) Program. The raw data is pulled from the Registry of Open Data on AWS (RODA) through the TCGA API. The data is transformed in an AWS Glue Extract Transform and Load (ETL) job and catalogued by a Glue Crawler. This makes the data available for query in Amazon Athena.
- Data from The Cancer Imaging Archive (TCIA) is ingested, formatted, and catalogued. The data is transformed in an AWS Glue ETL job and cataloged by a Glue Crawler.
- Data from the 1000 Genomes project and ClinVar is ingested, formatted, and catalogued, pulling the raw data from the RODA on Amazon Simple Storage Service (Amazon S3). The datasets are transformed in AWS Glue ETL jobs and catalogued by Glue Crawlers.
- Research scientists analyze the multi-modal data through a visual interface in Amazon QuickSight. The data is cached in a SPICE (Super-fast, Parallel, In-memory Calculation Engine) database, optimizing query performance.
- Data Scientists analyze the data with code using Jupyter notebooks provided through Amazon SageMaker notebook environments.
- Creates an AWS CodeBuild project containing the setup.sh script. This script creates the remaining AWS CloudFormation stacks and code repositories and code.
- The landing zone (zone) stack creates the AWS CodeCommit pipe repository. After the landing zone (zone) stack completes its setup, the setup.sh script pushes source code to the CodeCommit pipe repository.
- The deployment pipeline (pipe) stack creates the CodeCommit code repository, an Amazon CloudWatch event, and the AWS CodePipeline code pipeline. After the deployment pipeline (pipe) stack completes its setup, the setup.sh script pushes source code to the CodeCommit code repository.
- The CodePipeline (code) pipeline deploys the codebase (genomics and imaging) CloudFormation stacks. After the AWS CodePipeline pipelines complete their setup, the resources deployed in your account include Amazon Simple Storage Service (Amazon S3) buckets for storing object access logs, build artifacts, and data in your data lake; CodeCommit repositories for source code; an AWS CodeBuild project for building code artifacts; an AWS CodePipeline pipeline for automating builds and deployment of resources; example AWS Glue jobs, crawlers, and a data catalog; and an Amazon SageMaker Jupyter notebook instance.
- The imaging stack creates a hyperlink to a CloudFormation quick start, which can be launched to deploy the Amazon QuickSight (quicksight) stack. The QuickSight stack creates AWS Identity and Access Management (IAM) and QuickSight resources necessary to interactively explore the multi-omics dataset.
This guidance uses AWS CodeBuild and AWS CodePipeline to build, package and deploy everything needed in the solution to transform Variant Call Files (VCFs) with Hail and work with multi-modal and multi-omic data from the datasets in The Cancer Genome Atlas (TCGA) and The Cancer Imaging Atlas (TCIA). Code changes made in the solution AWS CodeCommit repository with be deployed through the provided CodePipeline deployment pipeline.
This guidance uses role based access with IAM and all buckets have encryption enabled, are private, and block public access. The data catalog in AWS Glue has encryption enabled and all meta data written by AWS Glue to Amazon S3 is encrypted. All roles are defined with least privileges and all communications between services stay within the customer account. Administrators can control Jupyter notebook, and Amazon Athena and Amazon QuickSight data access through provided IAM roles.
AWS Glue, Amazon S3, and Amazon Athena are all serverless and will scale data access performance as your data volume increases. AWS Glue provisions, configures, and scales the resources required to run your data integration jobs and Amazon Athena is serverless, so you can quickly query your data without having to setup and manage any servers or data warehouses. The Amazon QuickSight SPICE in-memory storage will scale your data exploration to thousands of users.
By using serverless technologies, you only provision the exact resources you use. Each AWS Glue job will provision a Spark cluster on demand to transform data and de-provision the resources when done. If you choose to add new TCGA datasets, you can add new AWS Glue jobs and AWS Glue crawlers that will also prevision resources on-demand. Amazon Athena automatically executes queries in parallel, so most results come back within seconds.
By using serverless technologies that scale on-demand, you only pay for the resources you use. To further optimize cost, you can stop the notebook environments in Amazon SageMaker when they are not in use. The Amazon QuickSight dashboard is also deployed through a separate AWS CloudFormation template, so if you don’t intend to use the visualization dashboard you can choose to not deploy it to save costs.
By extensively using managed services and dynamic scaling, you minimize the environmental impact of the backend services. A critical component for sustainability is to maximize the usage of notebook server instances, as covered in performance and cost pillars. Stop the notebook environments when not in use.
This architecture chose AWS Glue for the Extract, Transform, and Load (ETL) needed to ingest, prepare, and catalog the datasets in the solution for query and performance. You can add new AWS Glue Jobs and Glue Crawlers to ingest new The Cancer Genome Atlas (TCGA) and The Cancer Image Atlas (TCIA) datasets, as needed. You can also add new jobs and crawlers to ingest, prepare, and catalog your own proprietary datasets.
This architecture chose Amazon SageMaker Notebooks to provide a Jupyter notebook environment for analysis. You can add new notebooks to the existing environment or create new environments. If you prefer RStudio to Jupyter notebooks, you can use RStudio on Amazon SageMaker.
This architecture chose Amazon QuickSight to provide interactive dashboards for data visualization and exploration. The QuickSight dashboard setup is through a separate AWS CloudFormation template so if you don’t intend to use the dashboard you don’t have to provision it. In QuickSight, you can create your own analysis, explore additional filters or visualizations, and share datasets and analysis with colleagues.
This repository creates a scalable environment in AWS to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. The solution demonstrates how to 1) build, package, and deploy libraries used for genomics data conversion, 2) provision serverless data ingestion pipelines for multi-modal data preparation and cataloging, 3) visualize and explore clinical data through an interactive interface, and 4) run interactive analytic queries against a multi-modal data lake.
BioTeam is a life sciences IT consulting company passionate about accelerating scientific discovery by closing the gap between what scientists want to do with data—and what they can do. Working at the intersection of science, data and technology since 2002, BioTeam has the interdisciplinary capabilities to apply strategies, advanced technologies, and IT services that solve the most challenging research, technical, and operational problems. Skilled at translating scientific needs into powerful scientific data ecosystems, we take pride in our ability to partner with a broad range of leaders in life sciences research, from biotech startups to the largest global pharmaceutical companies, from federal government agencies to academic research institutions.
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.