Guidance for Multi-Modal Data Analysis with AWS Health and ML Services

This Guidance demonstrates how to set up an end-to-end framework to analyze multimodal healthcare and life sciences (HCLS) data. It analyzes this data using purpose-built health care and life sciences services (such as AWS HealthOmics, AWS HealthLake, AWS HealthImaging) and machine learning (ML) and analytics services (such as Amazon SageMaker, Amazon Athena, and Amazon QuickSight). It ingests raw HCLS data formats like variant call format (VCF), Fast Healthcare Interoperability Resources (FHIR), and Digital Imaging and Communications in Medicine (DICOM), and provides a zero-extract, transform, load (ETL) architecture to customers who want to run their data analysis at scale on AWS.

The architectures shows how to store, transform, and analyze linked genomic, clinical, and medical imaging data of patients. The effectiveness of the Guidance is demonstrated on a coherent synthetic patient dataset with multiple disease scenarios, released by MITRE and available on AWS Registry of Open Data. It then trains an ML model for predicting patient outcomes. It also includes an interactive dashboard for visualizing summary statistics of data and ML model reports that can be customized based on the user persona.

Please note: [Disclaimer]

Architecture Diagram

[text]

Download the architecture diagram PDF

Guidance Architecture Diagram for Multi-Modal Data Analysis with AWS Health and ML Services

Step 1
Ingest genomic data from Amazon Simple Storage Service (Amazon S3) or Registry of Open Data on AWS (RODA) to AWS HealthOmics.

Use HealthOmics Reference store for reference genome data, such as Fast-All (FASTA), and HealthOmics Sequence store for sequence data, such as FASTQ, Binary Alignment Map (BAM), and Compressed Reference-oriented Alignment Map (CRAM).

Use HealthOmics Variant store for variant call format (VCF) files and HealthOmics Annotation store for annotation files. To run private or Ready2Run workflows, use HealthOmics Workflows.

Step 2
Ingest Fast Healthcare Interoperability Resources (FHIR) data to AWS HealthLake.

Step 3
Ingest Digital Imaging and Communications in Medicine (DICOM) images to AWS HealthImaging and read into insight toolkit (ITK) image object in-memory through API calls.

Step 4
View tables from HealthOmics and HealthLake as resources in AWS Lake Formation.

Step 5
Query the tables with Amazon Athena.

Step 6
Generate brain masking with the Medical Open Network for AI (MONAI) segmentation model. Use Amazon SageMaker Preprocessing to parallelize radiomic feature computation for each image representation.

Step 7
Build visualization dashboards with Amazon QuickSight.

Step 8
Store the multimodal feature set in Amazon SageMaker Feature Store.

Step 9
Build and train ML models on multimodal features with SageMaker AutoGluon-Tabular.

Step 10
Deploy the model as an endpoint for real-time inference.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

HealthOmics integrates with Amazon EventBridge and provides notifications for actions like Variant or Annotation store creation and delete in addition to start and completion of data import jobs. You can overlay rules and handling targets onto this Guidance to monitor and respond to any incidents that may occur, such as repeated import failures.

Read the Operational Excellence whitepaper
Security

HealthImaging enforces the use of AWS Key Management Service (AWS KMS) encryption as it will not allow the creation of an unencrypted datastore. In addition to this, encryption at rest and transit are supported by HealthOmics, HealthLake, Amazon SageMaker, Athena, QuickSight, Lake Formation, and Amazon S3. This Guidance uses AWS-owned keys, but customers are able to bring their own keys if needed.

Read the Security whitepaper
Reliability

When deploying this Guidance in an environment with pre-existing HealthOmics resources, you should be aware of HealthOmics Analytics quotas. This Guidance creates 1 Variant store and 1 Annotation store. By default, HealthOmics has a limit of 10 Variant stores and 10 Annotation stores. There are also default limits on the number of import jobs to HealthOmics Analytics stores and the file sizes they can handle. The default limit is 5 concurrent Variant or Annotation store import jobs. This Guidance uses 1 Variant import job and 1 Annotation import job. Variant import jobs have a default limit of 1,000 sources, each with a limit of 20 GB. The example variant data used by this Guidance consists of about 800 Variant files, each about 1 GB. Annotation import jobs have a default limit of 1 source, each with a limit of 20 GB in size. The example annotation data in this Guidance is a single file that is about 10 GB.

Read the Reliability whitepaper
Performance Efficiency

The data in HealthLake is automatically available through Lake Formation. This allows customers to create organizational units (OUs) of users and then grant row and column-level access to those users depending on their data access requirements.

Read the Performance Efficiency whitepaper
Cost Optimization

HealthLake automatically transforms the clinical data stored in your data catalog to run SQL queries on the data. This eliminates the need for exporting data and paying for data transfer costs for HealthLake data.

Read the Cost Optimization whitepaper
Sustainability

By establishing a centralized data lake for all modalities, this Guidance removes the need to create redundant data. Data stores provided by HealthLake, HealthOmics, and HealthImaging become the single source of truth for each of their respective data types. Lake Formation can govern and filter each data type to provide users with the appropriate access to data without duplication. Similarly, you can create common database constructs, such as “views” in Athena to support multiple analysis use cases without data replication.

Read the Sustainability whitepaper

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open sample code on GitHub

[text]

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS

Disclaimer

Was this page helpful?

Guidance for Multi-Modal Data Analysis with AWS Health and ML Services

[text]

Architecture Diagram

Well-Architected Pillars

Implementation Resources

Related Content

Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS

Disclaimer

Was this page helpful?

Ending Support for Internet Explorer