Simplifying Multi-modal & Multi-omics Analysis with AWS for Health
New AWS for Health Guidance: Multi-modal and Multi-omics
The new era of personalized health relies on data to guide more customized patient treatments, therapeutics, and diagnoses. Genomics sits at the core of personalized health, and by taking into account the individual variability among people and diseases, clinicians can create more personalized care journeys and targeted treatments. Across clinical and research disciplines, combining and analyzing different modalities including multiple molecular data types and imaging data is powering a more holistic view of patients and more robust insights into an area of study.
A great example of this is the work being done by Philips to incorporate multi-modal data into its Philips Healthsuite Platform, which was recently presented at the 2022 AWS Industry Innovators: Healthcare & Life Sciences event. To help determine the best treatment options on an individualized-basis, Philips created a platform on AWS that integrated different modalities of medical data involved in cancer treatments, including genomic, imaging, digital pathology, and clinical data. As a result, leading healthcare organizations like MD Anderson Cancer Center can now run more data-driven, personalized oncology treatments and clinical trial matching.
While the promise of multi-modal and multi-omics is becoming evident, the integration and analysis of varying forms of structured and unstructured data poses a unique set of challenges, including:
- Addressing influx of diverse data types and formats
- Extracting insights from unstructured data, such as voice and imaging
- Ingesting, normalizing, structuring, and formatting differing data types for consumption
- Creating cohorts and defining relative data subsets
To reduce barriers for handling and analyzing multi-modal and multi-omics data, AWS for Health has released the new Guidance for Multi-omics and Multi-modal data Integration and Analysis on AWS.
It is a prescriptive deep dive on how to prepare genomic, clinical, mutation, expression, and imaging data for large-scale analysis, and perform interactive queries using The Cancer Genome Atlas (TGCA) and The Cancer Imaging Archive (TCIA) as an example dataset. The ETL code provided in this guidance can be customized to ingest and transform additional datasets.
This comprehensive guidance provides step-by-step instructions and recommendations for:
- optimizing data formats and structures,
- querying and accessing data from different sources with ease, and
- integrating and analyzing genomics data together with other omics (for example, epigenomics, proteomics, transcriptomics, metabomics)
- as well as other modalities of data (for example, X-rays, health records, recorded audio, wearables data).
Following the six pillars of the AWS Well-Architected Framework, the guidance is designed to help healthcare and life sciences organizations build a secure, resilient, and scalable environment in AWS. It directs how to prepare genomic, clinical, mutation, expression and imaging data for large-scale analysis and perform interactive queries against a data lake.
The modern data architecture (Image 1) in this guidance demonstrates how to ingest common multi-omics data sets into a centralized data lake and work with that data using Amazon Athena and low-code Jupyter Notebooks. There are example ingestion pipelines for clinical, mutation, gene expression, and copy number data (TCGA), imaging metadata (TCIA), genomic variant calls data (1000 Genomes), annotation data (ClinVar), and an individual Variant Call File (VCF) data.
Image 1: AWS for Health Guidance: The Modern Data Architecture
This guidance demonstrates how to:
- Build, package, and deploy libraries
- Provision serverless data ingestion pipelines for multi-modal data preparation and cataloging
- Visualize and explore clinical data through an interactive interface
- Run interactive analytics queries against a multi-modal data lake
This guidance was built in collaboration with AWS for Health featured consulting partner BioTeam. BioTeam is a scientific IT consulting company expert in applying strategies, advanced technologies, and IT services to solve the most challenging research, technical, and operational problems in the life sciences. They can help implement and customize this guidance to ingest customized datasets.
The full guidance is now available here: Guidance for Multi-Omics and Multi-Modal Data Integration and Analysis on AWS
Additional AWS Resources for Multi-modal and Multi-omics:
- Building Scalable Machine Learning Pipelines for Multimodal Health Data on AWS
- Training Machine Learning Models on Multimodal Health Data with Amazon SageMaker
- Enabling the aggregation and analysis of The Cancer Genome Atlas using AWS Glue and Amazon Athena
- Building predictive disease models using Amazon SageMaker with Amazon HealthLake normalized data
- Building a Real World Evidence Platform on AWS