Multimodal Data Analysis with AWS Health and Machine Learning Services

In this blog, we show how you can leverage AWS purpose-built health care and life sciences (HCLS), machine learning (ML), and analytics services to simplify storage and analysis across genomic, health records, and medical imaging data for precision health use cases. The included reference architecture is built on AWS HealthOmics, AWS HealthImaging, and AWS HealthLake services which enable you to store these data modalities with a few clicks. You can also create governed databases and tables via AWS Lake Formation, that allows querying across multiple modalities using Amazon Athena. You can then build, train, and deploy ML models with Amazon SageMaker to make real-time, personalized inference on patient outcomes. Finally, you can build custom, interactive dashboards to visualize multimodal data across individual patients and cohorts using Amazon QuickSight.

HCLS customers are seeing a rapid growth in patient-level data. This data is increasing both in size and diversity, with modalities that include genomic, clinical, medical imaging, medical claims, and sensor data. While multimodal data offers a comprehensive view that can improve patient outcomes and care, analyzing multiple modalities at scale to build precision health applications is challenging. First, each modality requires distinct storage infrastructure, like Fast Healthcare Interoperability Resources (FHIR) for clinical records, Digital Imaging and Communications in Medicine (DICOM) for medical imaging, and custom databases for genomic variant and annotation data in Variant Call Format (VCF) files. Second, not all storage modalities are accessible via common query languages like SQL, making it difficult to execute analytical queries across data types. Third, tooling for data science and machine learning is typically not built to handle the domain-specific data infrastructures or data types presented by these modalities, thereby hindering comprehensive analytics. Finally, customers wishing to pilot precision health initiatives have difficulty accessing a coherent dataset across all modalities with enough data points to support ML development and benchmarking.

Here, we show how AWS addresses these challenges and simplifies and accelerates the use of multimodal datasets in HCLS. To do this, we demonstrate importing, querying, and training ML models using the Synthea Coherent Data Set, a fully synthetic dataset from MITRE available via the AWS Open Data program. This dataset provides FHIR resources, DICOM images, and genomic data with coherent linking across all modalities for patients diagnosed with cardiovascular disease. For further details, refer to the Guidance for Multi-Modal Data Analysis with Health AI and ML Services on AWS and the associated code repository which includes several example notebooks covering each section of the framework.

Benefits of end-to-end multimodal data analysis

Healthcare customers today often face two challenges with their data. First, they have vast amounts of data from years of encounters that is rich with insights that they want to use for machine learning, advanced analytics, and improving patient outcomes. However, such data is often separated in organizational silos, stored in different formats, duplicative, and non-coherent. Second, they want to make their data actionable. Using historical healthcare data for training models or finding historic trends is important, but customers often want to use those insights with patients in their facilities today, not yesterday. This means using real-time data, and requires a way to securely store and manage real-time data for their organizations using purpose-built data stores optimized for the type, scale, and velocity of those data modalities.

Importantly, each data modality has its own special considerations for how to transform and store it to optimize analytical queries and be ML-ready. Data scientists and engineers can spend months on one modality developing custom extract, transform, and load (ETL) processes, data harmonization, and data warehouse design. Working with multiple modalities simultaneously increases complexity by orders of magnitude, adding cost in terms of time and development effort.

With storage and analytics in AWS that is purpose-built for key HCLS data modalities (health records, genomic variants, medical imaging), you can go from raw data to training ML models with a few clicks and in minutes instead of months. AWS removes the undifferentiated heavy lifting needed to ingest, secure, query, and analyze across these data at scale.

AWS Health, machine learning, and analytics services and key capabilities

AWS services overall are pay-as-you-go and there are no annual subscriptions or upfront costs to get started. AWS HealthOmics, AWS HealthImaging, and AWS HealthLake are purpose-built managed services for health and life sciences use cases that offer “zero-ETL” data stores – simply provide your data in its raw form (VCF, DICOM, FHIR), start an import job, and be able to query and analyze the data in minutes. Unlike on-prem or traditional data center deployments, there is no upfront planning required on how much storage to reserve or how you plan to use the data. HealthOmics, HealthImaging, and HealthLake data stores scale according to your needs and provide easy access for downstream analysis.

AWS HealthOmics enables you to transform genomic, transcriptomic, and other omics data into insights. AWS HealthImaging simplifies storing, transforming, and analyzing medical images in the cloud at petabyte scale. AWS HealthLake facilitates securely storing, transforming, transacting, and analyzing health records data in minutes for patients and populations. These services are designed to securely store and manage data from different modalities in a cost optimized and highly available environment. Each service provides a transactional layer (for real-time application needs) using APIs, as well as an analytics layer (for advanced querying and data analysis). Leveraging these services, either individually or together, simplifies HCLS data use in analytics, reporting, and other downstream applications.

You can securely access and govern these data and their corresponding purpose-built data stores in your AWS account’s data catalog via AWS Lake Formation which builds, manages, and secures data lakes. This makes multimodal data available for a wide range of analytics services, like Amazon Athena, a serverless analytics service, that enables SQL queries on petabyte-scale data. For further downstream analysis, you can build, train, and deploy machine learning models on fully managed infrastructure with Amazon SageMaker. Finally, you can use Amazon QuickSight to unify and view trends and patterns in the data, by creating interactive data visualization dashboards that scale to hundreds of thousands of users without the need to set up, configure, or manage your own servers.

Multimodal Analysis on AWS

In Figure 1, we present the architecture for building a scalable solution to ingest HCLS data from multiple modalities, manage data with purpose-built storage, preprocess data with managed services, create interactive dashboards for data visualization, and build ML models for actionable insights.

Figure 1: Architecture for storing, integrating, and analyzing multimodal HCLS data with purpose-built services on AWS.

In the following sections, we describe the high-level steps to realize this end-to-end multimodal analysis framework.

Store data

First, ingest each data type from Amazon Simple Storage Service (S3) into the corresponding purpose-built AWS service (steps 1-3 of Figure 1).

Using the Synthea Coherent Data Set, we generated genomic, clinical, and imaging data ready for import into appropriate data stores provided by AWS HealthOmics, AWS HealthLake, and AWS HealthImaging. This includes genomic variants as VCF files for roughly 800 individuals, a human reference genome (FASTA file), genomic variant annotations from ClinVar, FHIR R4 bulk data bundles for electronic health records (EHR) for about 1300 individuals, and 300 DICOM imaging study files. These data are publicly available in an S3 bucket named:

guidance-multimodal-hcls-healthai-machinelearning

Using AWS HealthOmics, the reference genome is imported into a HealthOmics Reference store, VCF files into a HealthOmics Variant store, and annotations into a HealthOmics Annotation store. Using AWS HealthLake, the FHIR bulk data bundles are streamed or bulk-loaded into a HealthLake data store. Using AWS HealthImaging, the DICOM files are imported into a HealthImaging data store.

Note that each data store is purpose built for a specific data type, each aligning to standard data formats found in HCLS.

Preprocess and analyze data

Once data is imported and any preprocessing needed is complete, you have tables for each modality in AWS Lake Formation that you link to databases and query for analysis (step 4 of Figure 1).

You can use Amazon Athena to run queries across data stores and extract relevant features from the tables stored in AWS Lake Formation (step 5 of Figure 1). This enables you to derive key information from multiple modalities at once and get a more comprehensive view of patients. You then store the features identified by your multimodal data queries in a SageMaker Feature Store using FeatureGroups (step 8 of Figure 1) which are used to train and test ML models.

When HealthImaging stores DICOM files, it extracts metadata like patient, imaging study, or series information from the file header to simplify search. You can also extract important image features by processing the ImageFrame data (pixels) associated with each DICOM file. To get this data at scale, use the HealthImaging GetImageFrame API. For example, you can use the AWS SDK for Python, the Pyradiomics package, and Amazon SageMaker Preprocessing to retrieve pixel information and apply a Medical Open Network for AI (MONAI) segmentation model to generate radiomic features. These features are also stored in a SageMaker Feature Store for further analysis (step 6 of Figure 1).

Build, train, and deploy machine learning models

You can use Amazon SageMaker to build, train, and deploy ML models to derive further insights from multimodal data. For example, using multimodal features stored in SageMaker Feature Store, we trained a model that predicts the occurrence of stroke, hypertension, coronary heart disease, and Alzheimer’s disease for patients diagnosed with cardiovascular disease (step 9 of Figure 1). To do this, we used Autogluon, an AutoML algorithm offered by Amazon SageMaker, that trains a multi-layer stack ensemble model for classification and regression tasks. We then deployed the trained model as an endpoint for real-time or batch inference on the test dataset (step 10 of Figure 1). Overall, we found that models trained on features from all three data types (genomics, imaging, clinical) resulted in higher predictive capability compared to model trained on only a single data type.

Visualize data

Data visualization dashboards provide visual interfaces that stakeholders like clinicians, bioinformaticians, and radiologists can use to identify and interpret trends, patterns, and outliers in patients’ data. You can unify diverse data types and build interactive dashboards with Amazon QuickSight (step 7 of Figure 1).

Here, we built two dashboards – one for population-level analysis and another for patient-level analysis – each using data combined from AWS HealthOmics, AWS HealthImaging, and AWS HealthLake.

The dashboard for population-level analysis is for practitioners across different domains (eg. clinicians, bioinformaticians, radiologists), to get a comprehensive view of patients at the population or cohort-level. It includes the following:

Clinical Analysis – This provides an overview of clinical data of patients at the population level, including their demographic information, encounters, diagnosis, procedures, and insurance claims.

Figure 2: Population-level data visualization dashboard for clinical data.

Genomic Analysis – This provides an overview of genomic data at the population level, including types of genes, clinical significance of those genes, and distribution of cases.

Figure 3: Population-level data visualization dashboard for genomic data.

Medical Imaging Analysis – This provides an overview of medical imaging data at the population level, including first-order statistics describing the distribution of voxel intensities within the image region segmented by the MONAI model.

Figure 4: Population-level data visualization dashboard for imaging data.

The dashboard for patient-level analysis offers a single, interactive visual interface to help clinicians get a complete view of a patient across multiple data modalities (clinical, genomic, and medical imaging). Selecting a Patient ID from the dashboard menu automatically filters the underlying data and generates visualizations across multiple data types.

Figure 5: Patient-level data visualization dashboard for clinical data, genomic, and imaging data.

Conclusion and next steps

In this blog, we show how you can easily store, process, and analyze genomic, health records, and medical imaging data using purpose-built AWS services to accelerate precision health.

With a few clicks, you have the resources you need to import data and perform integrative analyses across multiple data modalities. The concepts and service integration patterns provide a template that enables going from raw data to insights, like cohort identification and ML-assisted clinical decision support, within minutes. This template also mitigates the overhead of infrastructure design, maintenance, and management.

This easy-to-deploy, automated, and scalable framework can be extended to suit your specific needs. To try this out in your AWS environment, our example use case with the Synthea Coherent data set is available as an open source GitHub repository. Within this repo are example notebooks that provide further technical details for import operations, data preprocessing, model evaluation and benchmarking and more.

To learn more about this end-to-end framework for multimodal analysis, check out our published Guidance in the AWS Solutions Library.