AWS Big Data Blog

Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning

Ujjwal Ratan is a Solutions Architect with Amazon Web Services

The Hospital Readmission Reduction Program (HRRP) was included as part of the Affordable Care Act to improve quality of care and lower healthcare spending. A patient visit to a hospital may be constituted as a readmission if the patient in question is admitted to a hospital within 30 days after being discharged from an earlier hospital stay. This should be easy to measure right? Wrong.

Unfortunately, it gets more complicated than this. Not all readmissions can be prevented, as some of them are part of an overall care plan for the patient. There are also factors beyond the hospital’s control that may cause a readmission. The Center for Medicare and Medicaid Services (CMS) recognized the complexities with measuring readmission rates and came up with a set of measures to evaluate providers.

There is still a long way to go for hospitals to be effective in preventing unplanned readmissions. Recognizing factors affecting readmissions is an important first step, but it is also important to draw out patterns in readmission data by aggregating information from multiple clinical and non-clinical hospital systems.

Moreover, most analysis algorithms rely on financial data which omit the clinical nuances applicable to a readmission pattern. The data sets contain a lot of redundant information like patient demographics and historical data. All this creates a massive data analysis challenge that may take months to solve using conventional means.

In this post, I show how to apply advanced analytics concepts like pattern analysis and machine learning to do risk stratification for patient cohorts.

The role of Amazon ML

There have been multiple global scientific studies on scalable models for predicting readmissions with high accuracy. Some of them, like comparison of models for predicting early hospital readmissions and predicting hospital readmissions in the Medicare population, are great examples.

Readmission records demonstrate patterns in data that can be used in a prediction algorithm. These patterns can be separated as outliers that are used to identify patient cohorts with high risk. Attribute correlation helps to identify the significant features that effect readmission risk in a patient.  This risk stratification in patients is enabled by categorizing patient attributes into numerical, categorical, and text attributes and applying statistical methods like standard deviation, median analysis, and the chi-squared test. These data sets are used to build statistical models to identify patients demonstrating certain characteristics consistent with readmissions so necessary steps can be taken to prevent it.

Amazon Machine Learning (Amazon ML) provides visual tools and wizards that guide users in creating complex ML models in minutes. You can also interact with it using the AWS CLI and API to integrate the power of ML with other applications. Based on the chosen target attribute in Amazon ML, you can build ML models like a binary classification model that predicts between states of 0 or 1 or a numeric regression model that predicts numerical values based on certain correlated attributes.

Creating an ML model for readmission prediction

The following diagram represents a reference architecture for building a scalable ML platform on AWS.

  1. The first step is to get the data into Amazon S3, the object storage service from AWS.
  2. Amazon Redshift acts as the database for the huge amounts of structured clinical data. The data is loaded into Amazon Redshift tables and is massaged to make it more meaningful as a data source for an ML model.
  3. A binary classification ML model is created using Amazon ML, with Amazon Redshift as the data source. A real-time endpoint is also created to allow real-time querying for the ML model.
  4. Amazon Cognito is used for secure federated access to the Amazon ML real-time endpoint.
  5. A static web site is created on S3. This website hosts the end user facing application using which one can query the Amazon ML endpoint in real time.

The architecture above is just one of the ways in which you can use AWS for building machine learning applications. You can vary this architecture and add services such as Amazon Elastic Map Reduce (EMR) if your use case involves large volumes of unstructured data sets or build a business intelligence (BI) reporting interface for analysis of predicted metrics. AWS provides a range of services that act as building blocks for the use case you want to build.

Prerequisite: Start with a data set

The first step in creating an accurate model is to choose the right data set to build and train the model. For the purposes of this post, I am using a publicly available diabetes data set from the University of California, Irvine (UCI).  The data set consists of 101,766 rows and represents 10 years of clinical care records from 130 US hospitals and integrated delivery networks. It includes over 50 features (attributes) representing patient and hospital outcomes. The data set can be downloaded from the UCI website. The hosted zip file consists of two csv files. The first file, diabetic_data.csv, is the actual data set and the second file, IDs_mapping.csv is the master data for admission_type_id, discharge_disposition_id, and admission_source_id.

Amazon ML automatically splits source data sets into two parts. The first part is used to train the ML model and the second part is used to evaluate the ML model’s accuracy. In this case, seventy percent of the source data is used to train the ML model and thirty percent is used to evaluate it. This is represented in the data rearrangement attribute as shown below:

ML model training data set:

  "splitting": {
    "percentBegin": 0,
    "percentEnd": 70,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""

ML model evaluation data set:

  "splitting": {
    "percentBegin": 70,
    "percentEnd": 100,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""

The accuracy of ML models becomes better when more data is used to train it. The data set I’m using in this post is very limited for building a comprehensive ML model but this methodology can be replicated with larger data sets.

Prepare the data and move it into Amazon S3

For an ML model to be effective, you should prepare the data so that it provides the right patterns to the model. The data set should have good coverage for relevant features, be low in unwanted “noise” or variance, and be as complete as possible with correct labels.

Use the Amazon Redshift database to prepare the data set. To begin, copy the data into an S3 bucket named diabetesdata. The bucket consists of four CSV files:

You can LIST the bucket contents by running the following command in the AWS CLI:

aws s3 ls s3://diabetesdata

Following this, create the necessary tables in Amazon Redshift to process the data in the CSV files by creating three master tables in one transaction table.

The transaction table consists of lookup IDs which act as foreign keys (FK) from the above master tables. It also has a primary key “encounter_id” and multiple columns that act as features for the ML model. The createredshifttables.sql script is executed to create the above tables.

After the necessary tables are created, start loading them with data. You can make use of the Amazon Redshift COPY command to copy the data from the files on S3 into the respective Amazon Redshift tables. The following script template details the format of the copy command used.  For this demo purpose, our cluster has a IAM_Role attached to it which has access to the S3 bucket. For more information on allowing Redshift cluster to access other AWS services please go through this documentation.

COPY diabetes_data from 's3://<S3 file path>' IAM_Role '<RedshiftClusterRoleArn>' 
delimiter ',' IGNOREHEADER 1;

/*If you are using default IAM role with your cluster, you can replace the ARN with default as below

COPY diabetes_data from 's3://' IAM_Role default 
delimiter ',' IGNOREHEADER 1;

The loaddata.sql script is executed for the data loading step.

Modify the data set in Amazon Redshift

The next step is to make some changes to the data set to make it less noisy and suitable for the ML model that you create later. There are various things you can do as part of this clean up, such as updating incomplete values and grouping attributes into categories. For example, age can be grouped into young, adult or old based on age ranges.

For the target attribute for your ML model, create a custom attribute called readmission_result, with a value of “Yes” or “No” based on conditions in the readmitted attribute. To see all the changes made to the data, see the ModifyData.sql script.

Finally, the complete modified data set is dumped into a new table, diabetes_data_modified, which acts as a source for the ML model. Notice the new custom column readmission_result, which is your target attribute for the ML model.

Create a data source for Amazon ML and build the ML model

Next, create an Amazon ML data source, choosing Amazon Redshift as the source. This can be easily done through the console or through the CreateDataSourceFromRedshift API operation by specifying the Redshift parameters like Cluster Name, Database Name, username, password, role and the SQL query. The IAM role for Amazon Redshift as a data source is easily populated, as shown in the screenshot below.

You need the entire data set for the ML model, so use the following query for the data source:

SELECT * FROM diabetes_data_modified

This can be modified with column names and WHERE clauses to build different data sets for training the ML model.

The steps to create a binary classification ML model are covered in detail in the Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift blog post.

Amazon ML provides two types of predictions that you can try. The first one is a batch prediction that can be generated through the console or the GetBatchPrediction API operation. The result of the batch prediction is stored in an Amazon S3 bucket and can be used to build reports for end users (like monthly actual value vs predicted value report).

You can also use the ML model to generate a real-time prediction. To enable real-time predictions, create an endpoint for the ML model either through the console or using the CreateRealTimeEndpoint API operation.

After it’s created, you can query this endpoint in real time to get a response from Amazon ML, as shown in the following CLI screenshot.

Build the end user application

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

The application allows end users to select certain patient parameters and then makes a call to the predict API. The results are displayed in real time in the results pane.

I made use of the AWS SDK for JavaScript to build this application. The SDK can be added to your script using the following code:

<script src=””></script>

Use Amazon Cognito for secure access

To authenticate the Amazon ML API request, you can make use of Amazon Cognito, which allows for secure access to the Amazon ML endpoint without making use of the AWS security credentials. To enable this, create an identity pool in Amazon Cognito.

Amazon Cognito creates a new role in IAM. You need to allow this new IAM role to interact with Amazon ML by attaching the AmazonMachineLearningRealTimePredictionOnlyAccess policy to the role. This IAM policy allows the application to query the Amazon ML endpoint.

  "Version": "2012-10-17",
  "Statement": [
      "Effect": "Allow",
      "Action": [
      "Resource": "*"

Next, initialize credential objects, as shown in the code below:

var parameters = {
      AccountId: "AWS Account ID",
      RoleArn: "ARN for the role created by Amazon Cognito",
      IdentityPoolId: "The identity pool ID created in Amazon Cognito"
 // set the Amazon Cognito region
       AWS.config.region = 'us-east-1';
// initialize the Credentials object with the parameters
 AWS.config.credentials = new AWS.CognitoIdentityCredentials(parameters);

Call the AML Endpoint using the API

Create the function callApi() to make a call to the Amazon ML endpoint. The steps in the callAPI() function involve building the object that forms a part of the parameters sent to the Amazon ML endpoint, as shown in the code below:

var machinelearning = new AWS.MachineLearning({apiVersion: '2014-12-12'});
var params = {
	 	 	MLModelId: ‘<ML model ID>',
	  		PredictEndpoint: ‘<ML model real-time endpoint>',
		var request = machinelearning.predict(params);

The API call returns a JSON object that includes, among other things, the predictedLabel and predictedScores parameters, as shown in the code below:

    "Prediction": {
        "details": {
            "Algorithm": "SGD",
            "PredictiveModelType": "BINARY"
        "predictedLabel": "1",
        "predictedScores": {
            "1": 0.5548262000083923

The predictedScores parameter generates a score between 0 and 1 which you can convert into a percentage:

			finalScore = Math.round(predictedScore * 100);
			resultMessage = finalScore + "%";

The complete code for this sample application is uploaded to PredictReadmission_AML GitHub repo for reference and can be used to create more sophisticated machine learning applications using Amazon ML.


The power of machine learning opens new avenues for advanced analytics in healthcare. With new means of gathering data that range from sensors mounted on medical devices to medical images and everything in between, the complexities demonstrated by these varied data sets are pushing the boundaries of conventional analysis techniques.

The advent of cloud computing has made it possible for researchers to take up the challenging task of synthesizing these data sets and draw insights that are providing us with information that we never knew existed.

We are still at the beginning of this journey and there are, of course, challenges that we have to overcome. The ease of availability of quality data sets, which is the starting point of any good analysis, is still a major hurdle. Regulations like Health Insurance Portability and Accountability Act of 1996 (HIPAA) make it difficult to obtain medical records with Protected Health Information (PHI). The good news is that this is changing with initiatives like AWS Public Data Sets, which hosts a variety of public data sets that anyone can use.

At the end of the day, all this analysis and research is for one cause: To improve the quality of human lives. I hope this is, and will continue to be, the greatest motivation to overcome any challenge.

If you have any questions or suggestions, please comment below.
_ _ _ _ _

Do you want to be part of the conversation? Join AWS developers, enthusiasts, and healthcare professionals as we discuss building smart healthcare applications on AWS in Seattle on August 31.

Seattle AWS Big Data Meetup (Wednesday, August 31, 2016)


Building a Multi-Class ML Model with Amazon Machine Learning