Predicting diabetic patient readmission using multi-model training on Amazon SageMaker Pipelines
In 2013, the International Diabetes Federation (IDF) estimated that approximately 382 million people had diabetes worldwide. By 2035, this was predicted to rise to 592 million. Diabetes is a major chronic disease that often results in hospital readmissions due to multiple factors.
Approximately 33% of all health care spending in 2009 went to hospital care. An estimated $25 billion is spent on preventable hospital readmissions that result from medical errors and complications, poor discharge procedures, and lack of integrated follow-up care. If hospitals can predict diabetic patient readmission, medical practitioners can provide additional and personalized care to their patients to pre-empt this possible readmission, thus possibly saving cost, time, and human life.
In this blog post, learn how to use machine learning (ML) from Amazon Web Services (AWS) to create a solution that can predict hospital readmission – in this case, of diabetic patients – based on data inputs like number of procedures and medications, number of diagnoses, and admission type, among other features.
Predicting hospital readmission for diabetic patients with machine learning: Solution overview
This solution uses the UCI Diabetes 130-US hospitals for years 1999-2008 dataset, from which we’ll extrapolate data insights with machine learning. This dataset has over one hundred thousand observations with more than 55 features representing patient and hospital outcomes. This dataset uses de-identified data and was captured from clinical care at 130 US hospitals and integrated delivery networks.
Building an ML pipeline has multiple steps, including data preparation, model building, training and tuning, model deployment, and post deployment operations like model monitoring. Although data scientists have tools for each of these steps, they can be challenged by reliability, repeatability, scalability, and collaboration.
Amazon SageMaker addresses these challenges by removing the undifferentiated heavy lifting of managing the infrastructure of the solution, so data scientists and ML engineers can move faster. In this blog post, we use SageMaker capabilities like Amazon SageMaker Data Wrangler and Amazon SageMaker Pipelines to showcase accelerated solutioning for our ML problem. We suggest that you read on these services by clicking the linked documentation before proceeding for a more seamless solution deployment experience. Note: This solution is not intended for production deployment but can be used as model architecture. The ML model can be further improved by using other feature engineering techniques, algorithms, and hyperparameters.
For this walkthrough, you should have the following prerequisites:
- An AWS account
- Onboard to Amazon SageMaker Domain.
- Create an Amazon SageMaker Notebook instance in your AWS account. Make sure your Jupyter instance has the necessary Identity and Access Management (IAM) permissions. You can clone GitHub repository ai-ml-sagemaker-multi-model-pipeline when you create the instance.
Walkthrough to deploy and test the solution using Jupyter notebook diabetes-project.ipynb
- Log in to your AWS account. In the AWS Management Console, search for and select Amazon SageMaker. In the SageMaker dashboard, under Notebook, select Notebook Instances. Open your Jupyter instance that you created in the Prerequisities
- Open Jupyter notebook ai-ml-sagemaker-multi-model-pipeline/diabetes-project.ipynb. Proceed by executing the steps in the Jupyter notebook. The Jupyter notebook walks you through the steps for the following sections, which we outline here, but are explained in full within the Jupyter notebook. Note that the actual commands used to execute the sections below are already in the Jupyter notebook.
Prepare the dataset collection
- Create an Amazon Simple Storage Service (Amazon S3) bucket.
- Upload the UCI Diabetes 130-US hospitals for years 1999-2008 dataset to the Amazon S3 bucket.
- Create the IAM role needed for creating the SageMaker image.
Prepare the Decision Tree custom Docker image
- Create the Docker image with your Decision Tree algorithm. Push your Docker image in your Amazon Elastic Container Registry (Amazon ECR). Then, make your Docker image accessible from SageMaker.
Define and start the SageMaker pipeline
Figure 1. SageMaker Pipelines steps to create the classification model that can predict diabetic patient readmission based on the example dataset.
The SageMaker pipeline should complete in approximately 20-25 minutes. This pipeline is comprised of multiple ML steps. Figure 1 depicts the SageMaker Pipelines steps to create the classification model that can predict diabetic patient readmission for reference. The ML steps in the pipeline are described below, which occur automatically as you move through the steps in the Jupyter notebook.
1. Feature Engineering (DataWranglerProcess step): This step runs a SageMaker Data Wrangler flow that performs the following data transformations on the data set with over 101K rows and 50 columns. This step automatically:
a. Moves column readmitted to the beginning. This column is to be predicted in the classification problem, i.e., whether a patient would be readmitted (1) or not (0).
b. Converts readmitted column value to 0 if it is NO and 1 if it is <30 or >30.
c. Drops the columns that have minimal to zero prediction power based on Data Wrangler Data Quality and Insights Report, e.g. payer_code and encounter_id.
d. Groups values into finite categories using Python custom transform in the following columns: diag_1, diag_2, diag_3, admission_type_id, admission_source_id, and discharge_disposition_id.
e. Fills missing values in columns diag_1, diag_2, diag_3 and replace strings in column race.
f. Drops duplicates, balance data using SMOTE, and one-hot encode the following columns: race, gender, age, diag_1, diag_2, diag_3, max_glu_serum, A1Cresult, metformin, repaglinide, pioglitazone, rosiglitazone, insulin, change, diabetesMed, admission_type_id, discharge_disposition_id, admission_source_id.
2. Preprocessing (Preprocess step): This step reads the transformed data from the DataWranglerProcess, randomizes, and splits the data into train (70%), validation (10%), and test data (20%).
3. DT Model Train & Tune (DTreeHPTune step): This step performs an SKLearn Decision Tree model training using the train data with hyperparameter tuning.
4. XGBoost Model Train & Tune (XGBHPTune step): This step performs an XGBoost model training using the train data with hyperparameter tuning
5. Model Evaluation (DTreeEval and XGBEval steps): These steps evaluate the generated Decision Tree and XGBoost models, respectively, using the test data. AUC-ROC score and Accuracy are used as performance metrics.
6. Register top Model (DTreeReg-RegisterModel or XGBReg-RegisterModel step): This step registers the model with the winning model or the model with the higher Area Under the Curve – Receiver Operating Characteristics (AUC-ROC) score into the model registry.
Approve the top performing model in SageMaker Model Registry
The top performing model in the registry has a status of Pending by default. In this section, we update the model status to Approved in the model registry.
Deploy the SageMaker inference endpoint
Deploy an inference endpoint containing the approved model in the model registry.
Run predictions on model
Once the inference endpoint has been deployed, you can proceed with inference using this endpoint. Test the model using two diabetic patients with the following profile summary:
1. Patient is a Caucasian female age 60-70, who has spent five days in the hospital under emergency care in the current encounter. Prior to this encounter, patient has spent zero days in outpatient care, zero days in emergency care, and seven days in inpatient care. Sixty-four laboratory procedures have been performed on the patient. The patient is not using metformin, repaglinide, pioglitazone, or rosiglitazone, and insulin prescription is steady. If this solution is deployed correctly, the model predicts that this diabetic patient is likely to be readmitted to the hospital (model inference output is 1).
2. Patient is Caucasian female age 70-80, who has spent three days in the hospital under elective care in the current encounter. Prior to this encounter, patient has spent zero days in outpatient care, zero days in emergency care, zero days in inpatient care. Nineteen laboratory procedures have been performed on the patient. The patient is not using metformin, repaglinide, pioglitazone, or rosiglitazone. The patient is not using insulin. If this solution is deployed correctly, the model predicts that this diabetic patient is not likely to be readmitted to the hospital (model inference output is 0).
To avoid incurring future charges, delete created resources such as the Amazon S3 bucket, Amazon ECR repository, and SageMaker Studio. Prior to deleting the SageMaker Studio, make sure to delete the SageMaker model and endpoint resources from the SageMaker Console. Finally, delete the Jupyter instance containing the notebook from the SageMaker Console.
This blog post walks through how to generate an ML model that predicts diabetic patient hospital readmission using SageMaker Pipelines. This post also shows how to transform raw data for model training as part of your ML pipeline using SageMaker Data Wrangler. Though this walkthrough focuses on diabetic patient hospital readmission, it can be adapted to predict other readmission factors as well.
Read more about AWS for healthcare:
- How to modernize legacy HL7 data in Amazon HealthLake
- New research shows EU and UK healthcare sectors could save 14.4 billion euros with AWS
- Getting started with healthcare data lakes: Using microservices
- Breaking down patient data silos in UK healthcare with serverless cloud technology
- How to deploy HL7-based provider notifications on AWS Cloud
Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.
Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.