AWS for Industries

Building Scalable Machine Learning Pipelines for Multimodal Health Data on AWS

This post was co-authored by Olivia Choudhury, PhD, Partner Solutions Architect; Michael Hsieh, Senior AI/ML Specialist Solutions Architect; and Andy Schuetz, PhD, Sr. Partner Solutions Architect.

Healthcare and life sciences organizations use machine learning (ML) to enable precision medicine, anticipate patient preferences, detect disease, improve care quality, and understand inequities. Rapid growth in health information technologies has made patient-level data available from an increasingly diverse set of data modalities. Further, research has shown that the utility and accuracy of ML models can be improved by incorporating data from multiple data domains [1]. Intuitively, this is understandable, as we are providing our models a more complete view of the individuals and settings we look to describe.

Applying ML to diverse health datasets, known as Multimodal Machine Learning (Multimodal ML), is an active area of research and development. Analyzing linked patient-level data from diverse data modalities, such as genomics and medical imaging, promises to accelerate improvements in patient care. However, performing analysis of a single modality at scale has been challenging in on-premises environments. On-premises processing of multiple modalities of unstructured data has commonly been intractable due to the distinct infrastructure requirements of different modalities (such as FPG and GPU requirements). Yet, with AWS, you can readily deploy purpose-built pipelines and scale them to meet your needs, paying only for what you use.

In this two-part blog post, we demonstrate how to build a scalable, cloud architecture for Multimodal ML on health data. As an example, we will deploy a Multimodal ML pipeline to analyze the Non-Small Cell Lung Cancer (NSCLC) Radiogenomics data set, which consists of RNA sequence data, clinical data (reflective of EHR data), medical images, and human annotations of those images [2]. Although for the given use case, we predict survival outcome of patients diagnosed with NSCLC, Multimodal ML models can be applied to other applications, including, but not limited to, personalized treatment, clinical decision support, and drug response prediction.

In this first blog post, we step through data acquisition, data processing, and feature construction for each data modality. In the next blog post, we train a model to predict patient survival using the pooled set of features drawn from all of the data modalities, and contrast the results with models trained on one modality. While we present an application to genomic, clinical, and medical imaging data, the approach and architecture are applicable to a broad set of health data ML use-cases and frameworks.

Overview of solution

Our architecture uses Amazon S3 to store the input datasets, and specialized pipelines to processes and transform each data modality, yielding features suitable for model training, as shown in Figure 1.

Architecture for integrating and analyzing multimodal health data

Figure 1: Architecture for integrating and analyzing multimodal health data. 

For genomic data, we leverage secondary analysis workflows, such as Illumina DRAGEN (Dynamic Read Analysis for GENomics) Bio-IT platform [3], that support commonly-used analytic techniques, like mapping and alignment of DNA and RNA fragments, detection of variants, assessment of quality, and quantification of RNA expression. The NSCLC Radiogenomic data consists of paired-end RNA-sequenced data from samples of surgically excised tumor tissue. S3 provides a secure, durable, and scalable storage location, suitable for analysis of large-scale genomic data. S3 can be used to store input data, including sequenced reads and reference genomes, intermediate files, and output data. We conduct secondary analysis of sequenced reads to quantify gene expression level. The output is in tabular form with expression level for each gene per sample. The subject-level, quantitative expression of genes can then be used as features for model training.

secondary analysis pipeline for DNA and RNA-sequenced data

Figure 2: Example of a secondary analysis pipeline for DNA and RNA-sequenced data. Input data (reference genome, gene annotations, sequenced reads) is stored in Amazon S3 and retrieved for various stages of processing, including sequence alignment, variant calling, and quantification of gene expression level. Output files are also stored in Amazon S3 for easy access during downstream analysis. 

The NSCLC Radiogenomic clinical data is in a structured tabular form, as is common for EHR data extracts and health insurance claims data. The NSCLC clinical data consists demographic (gender, ethnicity) and health-behavior (smoking history) information, cancer recurrence status, histology, histopathological grading, pathological TNM staging, and survival outcome. We are using Amazon SageMaker Data Wrangler, a purpose built interactive tool for data enrichment and advanced feature engineering.

For imaging data in this work, we use the Computed Tomography (CT) series and the corresponding tumor segmentations in the NSCLC Radiogenomic imaging dataset to create patient-level 3-dimensional radiomic features that explain the size, shape and visual attributes of the tumors observed in the study subject’s lungs.

Medical imaging data is commonly stored in the DICOM file format, a standard that combines metadata and pixel data in a single object. For a volumetric scan, such as a lung CT scan, each cross-section slice is typically stored as an individual DICOM file. However, for ML purposes, analyzing 3-dimensional data provides a more wholistic view of the region of interest (ROI), thus providing better predictive values. We convert the scans in DICOM format into NIfTI format. Thus, we download the DICOM files, store them to S3, then use using Amazon SageMaker Processing to perform the transformation. Specifically, for each subject and study, we launch a SageMaker Processing job with a custom container to read the 2D DICOM slice files for both the CT scan and tumor segmentation, combine them to 3D volumes, save the volumes in NIfTI format, and write the NIfTI object back to S3.

With the medical imaging data in volumetric format, we compute radiomic features describing the tumor region in the same SageMaker Processing job. We use AWS Step Functions to orchestrate the processing for entire imaging dataset in a scalable and fault-tolerant fashion.

Finally, the features engineered from each data modality are written to Amazon SageMaker Feature Store, a purpose-built repository for storing ML features. Feature preparation and model training is performed using Amazon SageMaker Studio.

Figure 3: Illustration of medical imaging processing pipeline. The DICOM files are stored in an Amazon S3 bucket. Processing steps include 2D slices to 3D volumes conversion, CT and segmentation masks alignment, radiomic feature extraction within the tumor region, and feature ingestion to Amazon SageMaker Feature Store.



The prerequisites for this walkthrough are:

  • An AWS account with permissions to provision Amazon SageMaker, Amazon S3, AWS Step Functions, Amazon EC2, and Amazon Athena.
  • A VPC with at least one public subnet, and one private subnet routed to a NAT gateway.
  • For this example, we use the Illumina DRAGEN platform for secondary analysis of next generation sequencing data; which requires the following:
    • A subscription to DRAGEN AMI on AWS Marketplace.
    • For deployment, you must select the Amazon EC2 F1 instance type in a supported AWS Region.
    • Ensure that your vCPU limit permits the recommended 16 vCPU for f1.4xlarge EC2 instance type. If not, request a limit increase.
  •  The present medical imaging pipeline with Step Functions will launch 50 simultaneous SageMaker Processing jobs of the ml.r5.large instance type. Ensure that your account quota permits 50 ml.r5.large SageMaker Processing jobs. If not, request a limit increase.
  • Access to the code repository that accompanies this blog.

Running each of the steps outlined in this blog post should cost around $13 in AWS services.

Create step section

SageMaker Studio

We are using SageMaker Studio to work with data, author Jupyter notebooks, and access the EC2 instance used in the genomics pipeline. To get started, follow the Standard Setup procedure using Access Management (IAM) to onboard to SageMaker Studio in your account. For simplicity, as shown in Figure 4, select Create role for the Execution role, and do not specify any S3 buckets explicitly. Permit SageMaker access to our S3 objects with the tag “sagemaker” and value “true”. In this way, only your input and output data on S3 will be accessible.

Configuration of AIM role for SageMaker

Figure 4: Configuration of IAM role for SageMaker.

In the Network section, choose to onboard SageMaker Studio in your VPC, and specify the private subnet. Also, set the AppNetworkAccessType to be VpcOnly, to disable direct access from the internet.

Select submit to create a studio, and wait a few moments for the environment to be provisioned. After the SageMaker Studio IDE becomes available, select the Git icon on the left-hand tool bar, and clone the repository that accompanies this blog.

By default, SageMaker Studio notebooks, local data, and model artifacts are all encrypted with AWS managed customer master keys (CMKs). In the present example, we are working with deidentified data. However, when working with Protected Health Information (PHI), it is important to encrypt all data at rest and in transit, apply narrowly defined roles, and limit access in accordance with the principles of least privilege. You can find further guidance on best practices in the white paper Architecting for HIPAA Security and Compliance.

Genomic pipeline

For this demonstration, we are using Illumina DRAGEN platform for secondary analysis. DRAGEN offers accurate and ultra-fast next generation sequencing secondary analysis pipelines for whole genome, RNA-seq, exome, methylome, and cancer data. It uses FPGA-based Amazon EC2 F1 instances to provide hardware-accelerated implementations of genomic analysis algorithms. We run DRAGEN RNA pipeline to map the reads to the human hg19 reference genome [2]. We also use gene annotation files (GTF/GFF format) to align reads to splice junctions and quantify gene expression level. The GTF file for hg19 genome can be downloaded from GENCODE project. The steps to execute the DRAGEN RNA pipeline are as follows:

  1. Subscribe to DRAGEN AMI in the AWS Marketplace
  2. Use the AMI to launch a new f1.4xlarge instance in the same VPC and private subnet used to onboard SageMaker Studio
  3. Use default 100 GB Amazon EBS gp2 volume for storage
  4. Skip the step for adding tags
  5. Use a security group that allows all TCP traffic from within the private subnet

To access the newly launched instance, upload the EC2 key pair to SageMaker Studio by clicking the Upload Files icon, and then launch a new terminal from the File menu. Then SSH to the private IP address of the EC2 instance from the SageMaker Studio terminal.

The RNA-seq data is accessible as SRA (Sequence Read Archive) data. The SRA files are hosted on AWS through the Registry of Open Data and are publicly accessible. For paired-end reads, this generates two FASTQ files containing “.1” and “.2” read suffices. Further details of the RNA-seq data can be found at National Center for Biotechnology Information (NCBI) [4].  Download the SRA files and use fastq-dump from SRA-Tools to convert them to FASTQ format, by executing the commands below on the EC2 instance.

aws s3 cp s3://sra-pub-run-odp/sra/<ACCESSION-ID>/<ACCESSION-ID> <ACCESSION-ID.sra>

fastq-dump -I --split-files <ACCESSION-ID.sra>

Download the human hg19 reference genome to the EC2 instance, and create a reference hash table, specific to DRAGEN RNA pipeline.

aws s3 cp s3://ngi-igenomes/igenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa hg19.fa

dragen --build-hash-table true --ht-build-rna-hashtable true --ht-reference hg19.fa --output-directory <HASHTABLE DIRECTORY>

Use the FASTQ files, reference hash table, and GTF file to run the gene expression quantification module of the DRAGEN RNA pipeline.

dragen --enable-rna=true --enable-duplicate-marking=true --enable-rna-quantification=true --enable-bam-indexing=true --alt-aware=true --output-format=bam --enable-map-align-output=true --enable-map-align=true --enable-sort=true --annotation-file=<GTF FILE > --ref-dir=<HASHTABLE DIRECTORY> --output-directory=. --output-file-prefix=<PREFIX> --input-qname-suffix-delimiter=. -1 <FASTQ_FILE READ1> -2 <FASTQ_FILE READ2> --RGID <ID> --RGSM <ID>

Although we used the DRAGEN AMI from AWS Marketplace for this demonstration, for production-scale data analysis, you can launch the DRAGEN Quick Start [5]. It allows submitting jobs from both AWS Command Line Interface and AWS Batch console.

The above command generates an output file in tabular form with expression level for each gene per sample. As shown in Figure 5, each row corresponds to a case ID or patient and each column represents the quantified expression level of a particular gene. This quantification result is used as a feature vector for training the multimodal ML model.

Output file generated by the RNA-seq pipeline

Figure 5: Output file generated by the RNA-seq pipeline. It quantifies expression level of a gene (column) for each patient (row).

Store this output file, containing quantified gene expression levels in CSV format, in S3.


Clinical pipeline

The de-identified clinical data can be downloaded in CSV format directly from The Cancer Imaging Archive (TCIA) [6], and stored in S3, as shown below.

curl -o Clinical-Data.csv

aws s3 cp Clinical-Data.csv <S3-BUCKET-CLINICAL-DATA>

Ingest the data by creating a new data flow from within SageMaker Studio. Select the option to import data from S3, select the clinical data file, and Import dataset.

We are using SageMaker Data Wrangler for pre-processing clinical data. Our example data needs minimal preparation, so we’ll use just a few of SageMaker Data Wrangler’s many features. Namely, we identify data bias such as gender bias using SageMaker Clarify (Figure 6), explore data leakage towards the target variable Survival Status (Figure 7), and conduct feature transformation.

Understanding the gender bias in the dataset using SageMaker Clarify that is built-in in Amazon SageMaker Data Wrangler

Figure 6: Understanding the gender bias in the dataset using SageMaker Clarify that is built-in in Amazon SageMaker Data Wrangler

Understanding if there is data leakage in the features using Target Leakage analysis in SageMaker Data Wrangler

Figure 7: Understanding if there is data leakage in the features using Target Leakage analysis in SageMaker Data Wrangler

Interactively encoding the Survival Status column from string to [0, 1] in SageMaker Data Wrangler

Figure 8: Interactively encoding the Survival Status column from string to [0, 1] in SageMaker Data Wrangler

Apply the following transforms to the data, all within SageMaker Data Wrangler:

  • One-hot encoding using Encode categorical on all the categorical columns, as described in Amazon SageMaker Transform Data
  • Dropping columns that contain target leakage and that relate to imaging date
  • Ordinally encoding the target column Survival Status using Encode categorical (Figure 8)
  • Dropping rows where Weight (lbs) or Pack Years columns have values Not Collected using Custom Transform
  • Filling missing values with zeros using Handle missing
  • Adding event time column as required by SageMaker Feature Store for versioning purposes
    • To add event time column, select Transforms in the Prepare Tab, and define a new Transform as a Custom formula. The formula is the Spark SQL expression for a timestamp, namely current_timestamp(), and set the Output column name to timestamp. This appends a column to our dataset in which every row contains a timestamp.

Next, select the Export tab and follow the Data Wrangler flow to Export the features to Feature Store. SageMaker automatically creates a Jupyter notebook to handle the export job. However, a few changes must be made before running the notebook. Modify the preexisting cell shown below to assign the record_identifier_name and event_time_feature_name variables to the corresponding column names in our data.

record_identifier_name = "Case_ID"
if record_identifier_name is None:
   raise RuntimeError("Select a column name as the feature group identifier.")

event_time_feature_name = "timestamp"
if event_time_feature_name is None:
   raise RuntimeError("Select a column name as the event time feature name.")

We will only use an offline Feature Store, because the present example focuses on model training. Disable the online store by updating the preexisting cell as follows.

# controls if online store is enabled. Enabling the online store allows quick access to
# the latest value for a Record via the GetRecord API.
enable_online_store = False

Next, run all cells in the notebook, and when the processing is complete, you will see output indicating the Feature Store location of the processed features (Figure 9).

'ProcessingOutputConfig': {'Outputs': [{'OutputName': '6e72c81c-d6c7-41bc-9557-e20e93ce2af5.default', 'FeatureStoreOutput': {'FeatureGroupName': 'clincal-feature-group-xx-xxx-xxxx-xxxx'}, 'AppManaged': True}]}

The processed features will be stored in the feature group as shown in a list of feature groups in Feature Store tab

Figure 9: The processed features will be stored in the feature group as shown in a list of feature groups in Feature Store tab

Medical imaging pipelines

Data download

Downloading the medical imaging data requires a “.tcia” manifest file, which can be downloaded from the The Cancer Imaging Archive (TCIA) [6]. Download the image data to a volume attached to your SageMaker Studio using an alternative, linux friendly NBIA data retriever CLI tool. First, download and build the data retriever tool.

cd SageMaker

sudo yum install golang

git clone ./data_retriever

cd data_retriever/

chmod +x


Then, run the data retriever tool to download all data described in the manifest file to the volume attached to SageMaker Studio.

mkdir data

cd data

../data_retriever/nbia_cli_linux_amd64 -i <path to manifest file>NSCLC_Radiogenomics-11-10-2020Version3.tcia -o . -p 3

The data retriever creates a directory nsclc_radiogenomics, and within that, a subdirectory for each subject containing the corresponding medical image study data. A given study consists of a directory of one or more imaging series and json files with metadata describing each imaging series. Within imaging series directory there are one or more 2D DICOM files of CT/PET scans, or a single DICOM file for the tumor segmentation. The tumor segmentation DICOM file consists of a 3D segmentation object created by software and reviewed by thoracic radiologists.

After the download has completed, copy the data to an S3 bucket using the AWS CLI.

aws s3 mb s3://multimodal-image-data/

aws s3 sync ./nsclc_radiogenomics/ s3://multimodal-image-data/nsclc_radiogenomics/

With the data on S3, we are ready to perform the required image data processing and ML at scale.

DICOM processing

In this example, we focus only on the CT scans that are accompanied by a tumor segmentation. Thus, for each patient and study, read through the json metadata files to determine if the series has both a CT scan imaging series and a segmentation object. If so, convert each CT scan imaging series, downloaded as a set of 2D DICOM files, to a single 3D NIfTI file. Then perform the DICOM to NIfTI conversion using a python package called dcmstack, reading in all the DICOM files, sorting according to spatial slice location, and stacking the slices to create a 3D volume. The 3D volume is then written out in NIfTI format with the NiBabel python package. For each tumor segmentation DICOM object, use the Pydicom package to read in the 3D array, reorient the volume to match that of the corresponding CT scan, and save the output as a NIfTI file. This complete process is written in the python script imaging/src/ in the repository.

Note that the segmentation object for some studies were saved to a DICOM object with empty slices cropped out. This results in a mismatch between the number of slices in the CT scan and the corresponding segmentation object. To address this, match the value in the ImagePositionPatient DICOM attribute to align the tumor segmentation to the corresponding location in the CT scan and pad the segmentation with zeros to have identical number of slices. This process has also been implemented in the imaging/src/ script.

Figure 10 shows example views of overlaying the tumor mask in yellow with transparency on the CT scan for a study (case ID R01-093).

Example visualization of a CT scan, with lung tumor mask overlaid in yellow

Figure 10: Example visualization of a CT scan, with lung tumor mask overlaid in yellow.

Radiomic feature extraction

We are using the pyradiomics library [7] to compute the radiomic features describing the tumors in the annotated CT scans. Using the library, extract 120 radiomic features of 8 classes such as statistical representations of the distribution and co-occurrence of the intensity within tumorous region of interest, and shape-based measurements describing the tumor morphologically. The computation of the radiomic features is performed volumetrically by providing the converted NIfTI images to the RadiomicsFeatureExtractor class in the compute_features function in the imaging/src/ script.

Once the features are computed, they are programmatically written to SageMaker Feature Store, as detailed in the Multimodal Feature Store section later in this post.

Scaling with Step Functions and SageMaker Processing

The two python scripts imaging/src/ and imaging/src/ can be used to process a single study by specifying the study ID in the dataset. To scale to hundreds or thousands of studies, as is typically seen in clinical research, use SageMaker Processing for scalable and elastic execution of the scripts, and AWS Step Functions as an orchestration layer for parallel processing the studies.

SageMaker Processing runs our code in a Docker container image on fully managed infrastructure. Now, build a Docker container image, defined in imaging/src/Dockerfile, containing the two python scripts above, and the runtime requirements (specified in a requirements.txt file) using sagemaker-studio-image-build in SageMaker Studio. The following code snippet can be found in imaging/preprocess-imaging-data.ipynb.

cd src/

!sm-docker build . --repository medical-image-processing-smstudio:1.0

At the end of this command, the Docker container image is built and pushed to AWS Elastic Container Registry at:

Image URI: <account-id>.dkr.ecr.<region>

Then, specify a SageMaker Processing job in the Step Functions state machine to use one instance of ml.r5.large with 5GB of disk volume which has enough RAM and disk space to process one DICOM study.

The Step Functions state machine design is shown in Figure 11. You can find the state machine definition in nsclc-radiogenomics-imaging-workflow.json in the repository.

State machine workflow as rendered in AWS Step Functions console

Figure 11: State machine workflow as rendered in AWS Step Functions console.

Design a Map state to execute the same steps for multiple entries of an array, i.e. each imaging study, in the state input. Set MaxConcurency to 50 in the Map state defining the maximum number of simultaneous processing jobs. To make the state machine fault tolerant, add an error handling Catch field to route any errors within the DICOM/NIfTI conversion to a Fallback step. This avoids the entire execution failing due to an unforeseen error, such as corrupted DICOM study/segmentation, in one or more studies. Another fault tolerant mechanism we implement is the retry logic with backoff to handle ThrottlingException (Rate exceeded) that may occur at launch when StepFunctions submits 50 concurrent SageMaker Processing jobs.

The following code snippet in imaging/preprocess-imaging-data.ipynb executes the workflow for all studies, and saves the radiomic features to the specified feature store.

import boto3

import json

import uuid

sfn = boto3.client('stepfunctions')

stateMachineArn = 'arn:aws:states:<region>:<accountId>:stateMachine:nsclc-radiogenomics-imaging-workflow'

suffix = uuid.uuid1().hex

payload = {
  "PreprocessingJobName": "dcm-nifti-conversion-%s" % suffix,
  "FeatureStoreName": "nsclc-radiogenomics-imaging-feature-group",
  "OfflineStoreS3Uri": "s3://<S3-BUCKET-IMAGE-DATA-PROCESSED>/nsclc_radiogenomics-multimodal-imaging-featurestore",
  "Subject": [
    "R01-001", "R01-002", ...., "R01-162", "R01-163"

sfn.start_execution(stateMachineArn=stateMachineArn, name=suffix, input=json.dumps(payload))

Conversion and feature extraction for a single subject takes about 5 minutes. With MaxConcurency = 50 in our execution, we completed the processing of entire dataset in just 20 minutes instead of 162*5 = 800 minutes had we run them in serial.

Cleaning up

To avoid incurring future charges, delete the resources (EC2 instance, SageMaker services, data on S3, containers on Amazon ECR, Step Functions, and AWS KMS keys) created in the steps above, unless you plan to continue to follow along in part two of this two-part blog series.


In this blog, we’ve shown how to deploy a set of data analysis pipelines to efficiently processes data from diverse, unstructured data modalities. By leveraging fully managed AWS services, we’ve streamlined the setup for processing multi-modal data at scale.

In the next blog post of this two-part series, we construct a multimodal feature store using SageMaker Feature Store. We then train models on the features from each data modality individually, and train a model on all of the modalities pooled together. We conclude by comparing the performance of the respective models, and quantifying how the multimodal model used information from each modality to make predictions.

Leveraging multimodal data promises better ML models for healthcare and life sciences, and subsequently improved care delivery and patient outcomes. We encourage you to continue to follow us in the next blog post, and to experiment with the example we’ve presented here in the meantime.

To learn more about healthcare & life sciences on AWS, visit


[1] Huang, Shih-Cheng, et al. “Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines.” NPJ digital medicine3.1 (2020): 1-9.
[2] Bakr, Shaimaa, et al. “A radiogenomic dataset of non-small cell lung cancer.” Scientific data 5.1 (2018): 1-9.
[4] Gene Expression Omnibus:
[6] Bakr, S. et al. The Cancer Imaging Archive (2017).
[7] Griethuysen, J. J. M, et. al. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Research (2017)

Olivia Choudhury

Olivia Choudhury

Olivia Choudhury, PhD, is a Partner Solutions Architect at AWS. She helps partners, in the Healthcare and Life Sciences domain, design, develop, and scale state-of-the-art solutions leveraging AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.

Andy Schuetz

Andy Schuetz

Andy Schuetz, PhD, is a Sr. Startup Solutions Architect at AWS, where he focuses on helping customers deliver Healthcare and Life Sciences solutions. When Andy’s not building things at AWS, he prefers to be riding a mountain bike.

Michael Hsieh

Michael Hsieh

Michael Hsieh is a Senior AI/ML Specialist Solutions Architect. He works with HCLS customers to advance their ML journey with AWS technologies and his expertise in medical imaging. As a Seattle transplant, he loves exploring the great mother nature the city has to offer, such as the hiking trails, scenery kayaking in the SLU, and the sunset at Shilshole Bay.