Overview
How it works
Architecture
Prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and query against a data lake.

CI/CD
Prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and query against a data lake.

Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Additional Considerations
Data Transformation
This architecture chose AWS Glue for the Extract, Transform, and Load (ETL) needed to ingest, prepare, and catalog the datasets in the solution for query and performance. You can add new AWS Glue Jobs and AWS Glue Crawlers to ingest new The Cancer Genome Atlas (TCGA) and The Cancer Image Atlas (TCIA) datasets, as needed. You can also add new jobs and crawlers to ingest, prepare, and catalog your own proprietary datasets.
Data Analysis
This architecture chose SageMaker Notebooks to provide a Jupyter notebook environment for analysis. You can add new notebooks to the existing environment or create new environments. If you prefer RStudio to Jupyter notebooks, you can use RStudio on Amazon SageMaker.
Data Visualization
This architecture chose QuickSight to provide interactive dashboards for data visualization and exploration. The QuickSight dashboard setup is through a separate CloudFormation template so if you don’t intend to use the dashboard you don’t have to provision it. In QuickSight, you can create your own analysis, explore additional filters or visualizations, and share datasets and analysis with colleagues.
Deploy with confidence
This repository creates a scalable environment in AWS to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake. The solution demonstrates how to 1) use HealthOmics Variant Store & Annotation Store to store genomic variant data and annotation data, 2) provision serverless data ingestion pipelines for multi-modal data preparation and cataloging, 3) visualize and explore clinical data through an interactive interface, and 4) run interactive analytic queries against a multi-modal data lake using Amazon Athena and Amazon SageMaker.
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
Guidance
Guidance for Multi-Modal Data Analysis with Health AI and ML Services on AWS
This Guidance demonstrates how to set up an end-to-end framework to analyze multimodal healthcare and life sciences (HCLS) data.
Contributors

Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages