Scalable Medical Computer Vision Model Training with Amazon SageMaker Part 1

Introduction

Computer vision (CV) and machine learning (ML) has been playing an increasingly import role in the field of medical imaging and radiology in the past decade. CV is the field that uses computers and programs to replicate human vision abilities such as recognition, classification, detection and measurement. Thanks to the advancement of deep learning (DL), a sub-category of ML, CV models trained with DL have been shown to surpass human cognitive ability in many domains, as well as in medicine.

Medical imaging, the technique to image and visualize a region of interest for clinical and diagnostic purposes, is under the most influence with the recent trend of applying highly accurate CV models trained with DL. Highly accurate and automated CV models that perform tasks such as pattern classification, abnormality detection, tissue segmentation have been used by clinicians in medical domains such as neuroradiology, cardiology, and pulmonology, to augment their clinical routines for a more reliable, faster and better patient care.

Training a CV model with DL in a medical domain, however, is uniquely challenging for data scientists and ML engineers compared to training a CV in other domains. Training a medical CV model requires a scalable compute and storage infrastructure for the following reasons:

Medical imaging data presents much more sophisticated and subtle visual cues for most tasks. It requires a complex neural network architecture and a large amount of data.
Medical imaging data standards are complex in nature as they need to carry much more information for medical and clinical use. ML training on such complex data requires lots of customized coding on top of an existing DL framework.
Medical imaging data is significantly larger in size, attributed to high resolution and multiple dimensions (3D and 4D are common) beyond flat 2D images.
The use of multiple imaging modalities in diagnosis and prognosis, and ML modeling is widely adopted in the medical domain.

A major challenge for model training that arises from these factors is the scalability of compute and storage. A critical topic is how data scientists and ML engineers conduct model training in the cloud in a manageable timeframe.

In this two-part blog series, we show you how we scale a multi-modality MRI brain tumor segmentation training workload on terabytes of data from 90 hours to four hours. In this first post, we describe the challenges when training a medical CV model at scale and level set on distributed training terminology using a multi-modality brain MRI dataset. In part two, we propose a scalable solution and step through the architecture, code, and benchmark result.

Dataset

We used a multi-modal MRI brain tumor image segmentation benchmark dataset named BraTS. The MRI imaging sequences included in this dataset are T2-FLAIR, T1, T1Gd, and T2-weighted (Figure 1). The four modalities are concatenated as the fourth dimension (channel) on top of the X, Y, and Z dimensions. There are 750 patients (484 training and 266 testing) from BraTS’16 and ’17. The target of this dataset is to segment gliomas tumors and subregions. The tumor can also be partitioned as peritumoral edema, GD-enhancing tumor, and the necrotic (or non-enhancing) tumor core (Figure 2). The tumor can also be delineated as a whole tumor (WT), tumor core (TC), and enhancing tumor (ET) (Figure 2), which is the label definition we used in this work. Human expert annotation is conducted and provided within the dataset. This dataset is provided by Medical Segmentation Decathlon, and can be downloaded from the AWS Open Data Registry.

Figure 1. Multi-modality tumor dataset BraTS. From top to bottom: T2-FLAIR, T1, T1Gd, and T2-weighted. All four modalities provide complementary information for identifying tumor regions. Case BraTS_001.

Figure 2. Labels are defined as: 1—peritumoral edema; 2—GD-enhancing tumor, 3—necrotic/non-enhancing tumor core. Whole tumor (WT) is defined as (1+2+3); tumor core (TC) is (2+3); enhancing tumor (ET) is 2 alone.

Individual files are saved as standard GNU zip (gzip) compressed archives of the multi-dimensional neuroimaging NIfTI format (for example, imagesTr/BRATS_001.nii.gz, labelsTr/BRATS_001.nii.gz). On average, each image or label pair takes up about 9.6 MB on disk (~135 MB when in decompressed form .nii). The 484 training pairs have a disk footprint of 4.65 GB (~69 GB when decompressed).

What is the problem we are trying to solve?

Training a CV on a dataset this size does not require significant compute infrastructure. The entire dataset can be loaded into the memory of moderate CPU hosts, decompressed to .nii files, decoded to PyTorch tensors .pt, and served in large batches to GPU for model computation. Seconds per epoch, a measure of model training efficiency, takes about 300 seconds using a standard Dataset on a ml.p3.2xlarge Amazon Elastic Compute Cloud (Amazon EC2) instance (8 vCPUs and 1 NVIDIA Tesla V100 GPU). It can be reduced to approximately 20 to 40 seconds with caching techniques in the data loading step. We can, for example, write decompressed, decoded, and transformed datasets to disk using PersistentDataset or cache it in memory using CacheDataset. Dataset, PersistentDataset, and CacheDataset are from MONAI, a medical imaging domain-specific library that offers features and functionalities for medical imaging data.

However, to reflect real-world scenarios, we performed data augmentation based on the BraTS dataset to 48,884 image or label pairs. It is not practical to train a semantic segmentation model on a dataset this size (~450 GB in nii.gz, ~6.5 TB decompressed to nii format, ~1.5 TB decoded to PyTorch tensors) using one GPU device. Even when doubling the batch size to 16, the training loop is quite slow on this new dataset. The average epoch takes around 31,000 seconds—8 hours 37 minutes—running on the ml.p3.2xlarge instance. A 10-epoch training job thus takes close to four days to finish. Vertically scaling the compute instance (utilizing a bigger Amazon EC2) does not solve this problem. Writing the transformed dataset to disk via PersistentDataset, or caching it in memory via CacheDataset, is not possible due to the size of the dataset when decompressed and decoded during training.

Fortunately, the Amazon SageMaker (SageMaker) Training offers a managed data parallelism option with near-linear scaling efficiency, achieving fast time-to-train with minimal code changes. Data parallelism is a method for training models on large datasets by distributing the data across different nodes with a copy of the model loaded on each node. The SageMaker Distributed Training Library uses MPI (Message Passing Interface) and Amazon Elastic Fabric Adapter (EFA) for internode communication and NCCL (NVIDIA Collective Communication Library) for intra node communication. The library is optimized to work on three Amazon ML training instances (ml.p3.16xlarge, ml.p3dn.24xlarge, and ml.p4d.24xlarge). SageMaker takes care of all resource configurations needed so you don’t have to worry about setting up your own distributed training environment.

Bottlenecks associated with model training on GPU

A typical neural network training loop consists of four operations:

Disk I/O: Loading packets of training/validation data from storage to CPU memory.
Data Preprocessing: A chain of transformations on a batch of data points (or entire dataset) handled by one or more CPUs (for example, image cropping, video sampling, word tokenization). Note that preprocessing operations are either static (for example, image resizing or reorienting) or randomized (for example, shuffling, RandSpatialCropd, or RandFlipd). CPUs are not optimized for image manipulation tasks and thus data preprocessing often causes a bottleneck in CV workloads.
CPU to GPU Transfer: The physical shipment of batches of processed data from CPU memory to GPU memory.
Model Computation: Forward and backward propagation. This step is rarely the cause of a bottleneck. Accelerators (GPUs) continue to improve and CPU hosts are usual unable to keep up with them.

When the training dataset is large (100s of GBs), disk I/O or data preprocessing, or both, can become bottlenecks.

Disk I/O: If the training dataset is stored remotely (for example, on Amazon Simple Storage Service [Amazon S3] or Amazon Elastic Block Store [Amazon EBS]), I/O can pose a networking challenge. Furthermore, if the training loop is orchestrated across multiple nodes, multiple machines start requesting data concurrently and I/O becomes a bottleneck. We will see how Amazon FSx for Lustre helped us scale data transfer throughput in our second post.

Static Preprocessing: Often times static transformations like decoding images, converting to tensors, resizing, reorienting, and many other augmentations are computationally expensive and time consuming. Moreover, if the training dataset is saved on disk as compressed archives, as with medical datasets such as BraTS, static preprocessing is even more time consuming. For example, loading a NIfTI image is seven times slower when it is archived; loading a .nii.gz file takes about 850 ms., whereas loading a .nii file takes 120 ms.

TensorFlow’s tf.data.Snapshot, Nvidia’s Clara.SmartCacheDataset, and MONAI’s monai.data.CacheDataset let you reuse such computations by caching static transformations in CPU memory or on disk. The cached data is then reused by subsequent epochs instead of being computed within each iteration. However, this strategy only works if the transformed dataset fits in the CPU memory or alongside the raw dataset and model artifacts on a disk. If you plan on running multiple experiments (for example, neural architecture search or hyperparameter optimization), and if those experiments share the same static transformations, consider removing static preprocessing from the training loop altogether. For example, you can take advantage of the fully managed SageMaker Processing to host your static preprocessing and save the results on Amazon S3, which we discuss in part two of this series.

Randomized Preprocessing: The computation of randomized transformations (such as random crops and flips) is not reusable. For randomized transformations, consider horizontal scaling. That is, distribute the compute of randomized transformations across multiple CPU workers. Fortunately, most deep learning frameworks offer horizontal scaling in their data loaders. However, you still need to tune the number of CPU workers yourself. Having too few CPU workers leads to underutilization of CPU memory. Alternatively, having too many CPU workers leads to out-of-memory exceptions and errors since each worker runs its own process with its own memory footprint. If data preprocessing remains a bottleneck, consider offloading randomized transformations to GPU. NVIDIA’s Data Loading Library (DALI) is a data loading and preprocessing library that accelerates training loops by moving data preprocessing to GPU. This previous blog post discusses ways to optimize I/O for GPU performance.

Conclusion

In this first part of the blog series, we described the challenges commonly faced in medical domain when training large-scale CV on terabytes of medical imaging data. We studied the data parallelism and its challenges. In part two, we will extend the knowledge described in this post and demonstrate a cloud-native solution to train a tumor segmentation model on multiple terabytes of training data. You will learn how we scale so that we reduce the training time from 90 hours to four hours.

With the proposed solution in our next post, data scientists working in the healthcare and life science domain can train medical imaging models with larger datasets and richer information from multi-modality imaging techniques. ML teams can iterate faster over model tuning, resulting in a more accurate model. This, in turn, would help researchers, radiologists, clinicians, and healthcare providers understand disease patterns better while providing a more personalized treatment.