AWS for Industries

Predict patient health outcomes using OHDSI and machine learning on AWS

Build a machine learning model to predict the likelihood of stroke in a patient with newly diagnosed atrial fibrillation

In healthcare, patient outcome prediction is a critical step in improving the effectiveness of care delivery while reducing its overall cost.  Being able to accurately forecast what will happen next to patients, at scale, is key to improving population and individual patient health.  Health prediction is also important as our healthcare reimbursement models move their focus from services delivered to overall value.  The process of patient-level prediction is built on observational health data and the ability to use advanced analytics and machine learning technology.

Attaining these prediction capabilities is important to many segments of healthcare. From providers being accountable for the care of their patient populations, to payers seeking opportunities to intervene and influence their member’s health decisions, life sciences organizations focused on producing real-world evidence to drive the development of new products, and researchers using prediction to better understand relationships between phenotypes.  Our ability to harness large observational health databases, through analytics and machine learning, powered by a massive increase in computational power, opens up unprecedented opportunities to improve patient care.


The Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) initiative and community are working toward the goal of improving outcomes for patients by producing data standards and open-source tools to store and analyze observational health data. The OHDSI community includes hundreds of collaborators from commercial, academic, non-profit, and government healthcare organizations.  These collaborators produce and maintain the open-source toolset used by thousands of healthcare professionals globally.

Using the OHDSI tools, you can visualize the health of a population, identify, and analyze cohorts of patients based on complex criteria, analyze and visualize the treatment pathways taken for a condition, analyze incidence rates for various conditions, and estimate the effect of treatments on patients with certain conditions. You can also train machine learning models to predict patient health outcomes.

In this blog post, we walk through how to use the OHDSIonAWS architecture and automation to deploy a full-featured, enterprise class, scalable and secure OHDSI architecture on AWS.  We then train and apply a prediction model using synthetic observational health data.  Below is a dashboard produced by OHDSI Achilles showing summary demographics of the sample population we are using for this study.  It visualizes population statistics for the 2.3M person De-SynPUF dataset provided by CMS and synthesized from de-identified Medicare claims data.  In order to follow along with this blog post, you’ll need an AWS account with permissions to deploy an Amazon Virtual Private Cloud (VPC), AWS Elastic Beanstalk, Amazon Redshift, and Amazon Relational Database Service (RDS).

OHDSI application architecture on AWS

Before deploying an application on AWS that transmits, processes, or stores protected health information (PHI) or personally identifiable information (PII), address your organization’s compliance concerns. Make sure that you have worked with your internal compliance and legal team to ensure compliance with the laws and regulations that govern your organization. To understand how you can use AWS services as a part of your overall compliance program, see the AWS HIPAA Compliance whitepaper. With that said, we paid careful attention to the HIPAA control set during the design of this solution.

OHDSI Project Description
Atlas A web interface that allows you to perform complex patient and population studies on OMOP CDM databases
OMOP CDM The Observational Medical Outcomes Partnership Common Data Model is an open source, industry standard data model for capturing patient-level observational health data.
Athena A collection of medical ontologies, or “vocabularies” that are mapped to the OMOP CDM format
Achilles Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES)—descriptive statistics and data quality checks on an OMOP CDM database.
PatientLevelPrediction An R package that builds and validates patient-level predictive models using data in the OMOP Common Data Model format.
Population Level Effect Estimation An R package that estimates and compares the effect of medical treatments on patient outcomes.

The OHDSIonAWS architecture implements the OMOP Common Data Model on Amazon Redshift and the Atlas web application in Elastic Beanstalk.  Many of the more advanced OHDSI tools are provided as R libraries, so an RStudio and Jupyter Notebooks server running on an EC2 instance is also deployed.  RStudio is an integrated development environment (IDE) for working with R. It can be licensed either commercially or under AGPLv3.  This RStudio server is built with the OHDSI Patient Level Prediction, Population Level Effect Estimation, and many other libraries pre-installed and ready to use with the observational health data in Amazon Redshift.  Below is a block diagram of this system.  More details and documentation are available in the OHDSIonAWS GitHub repository.  There is also a condensed deployment of the OHDSI toolset for AWS called OHDSI-in-a-Box that is useful for personal learning and training environments.

To use the OHDSI toolset, you must have an OMOP Common Data Model (CDM) formatted data source.  This data typically comes from electronic medical records, claims, labs, and other sources and is used privately by an organization or group of collaborators.  The OHDSIonAWS architecture allows you to load your own data, but for learning, training, and demonstration purposes, it also allows you to deploy sample datasets of synthetic, public domain data from CMS DE-SynPUF, and the Synthea project.  These sample datasets are each offered in sizes of one thousand, one hundred thousand, and over two million patients.

Using your data in OHDSI

The OHDSI tool set is built to work with data in the OMOP Common Data Model (CDM), an industry standard, open source data model used for observational health data.  From the OHDSI website:

“The OMOP Common Data Model allows for the systematic analysis of disparate observational databases. The concept behind this approach is to transform data contained within those databases into a common format (data model) as well as a common representation (terminologies, vocabularies, coding schemes), and then perform systematic analyses using a library of standard analytic routines that have been written based on the common format.”

To use the OHDSI tools on your own observational health data, you must first transform it into the OMOP format.  This is done through a process called extract, transform, and load, commonly called ETL for short.  OHDSI provides several tools for mapping and transforming your patient health data into OMOP format, and you can learn more about this process by looking at the OHDSI ETL best practices Wiki and slides from an OMOP ETL training session.  The OHDSI ETL toolset is also included in the OHDSI-in-a-Box image, which helps you quickly deploy and get started using them.

Once you have your health data transformed into the OMOP Common Data Model, you can use the instructions in the OHDSIonAWS GitHub repository to automatically load them as you deploy your OHDSIonAWS environment.

Deploying OHDSIonAWS

The entire OHDSIonAWS architecture is automatically deployed by using an AWS CloudFormation template.  In the OHDSIonAWS GitHub repository you can find links to deploy OHDSIonAWS in various AWS Regions, full documentation of all of the parameters, a guide for on-going operation, and the source code for all of the CloudFormation templates.  When you deploy your OHDSIonAWS environment, you must include both the CMSDESynPUF23m and the CMSDESynPUF100k datasets to follow the rest of this blog post.  This provides you with the 2.3 million and 100 thousand patient versions of the CMS DE-SynPUF dataset, respectively.  You can include other data sources as well if you like.  Also, be sure that you choose the option to include ‘example studies’ in Atlas.

Once the CloudFormation template deployment completes, you are able to access your Atlas and RStudio environments using the URL provided in the CloudFormation Outputs tab of the parent stack, as shown below.

It is also possible to follow along with this study using OHDSI-in-a-Box, however you need to use the smaller SynPUF100k data source for both training and prediction.

Building your health outcome prediction ML model

The OHDSI toolset allows you to execute many types of analysis on OMOP-formatted, patient datasets, including:  Input data quality analysis, identifying patient cohorts, patient cohort characterizations, patient treatment pathways, incidence rates, and population-level effect estimation.  For this blog post, however, we focus on OHDSI’s patient-level outcome prediction capabilities.

Now that your environment is deployed, you can begin using the OHDSI tools to develop patient-level prediction models.  In order to do that, you first must define the prediction problem by specifying three elements.

  • The group of patients you want to be able to apply the prediction model to, termed the target population. A target population example could be patients newly diagnosed with atrial fibrillation.  This also defines an index date for the prediction, following on from the previous example the index date would be the date of the initial atrial fibrillation diagnosis.
  • The outcome you wish to predict, an example of this is stroke.
  • The time period relative to the prediction index you want to predict the occurrence of the outcome, termed the time-at-risk (TAR).  An example of TAR is 1 day until 365 days following index (that is, in the next year).

We use the OHDSI toolset to identify target and outcome patient cohorts within our sample data.  Then, we generate a machine learning model that can predict which patients are at risk of migrating from the target cohort to the outcome cohort during a particular TAR.  For an overview of the type of prediction problems that can be handled with the OHDSI Patient-Level Prediction framework, see the chapter 14 of the Book of OHDSI.

For this example, we follow a sample study from the vignette Building patient-level predictive models by Jenna Reps, Martijn J. Schuemie, Patrick B. Ryan, and Peter R. Rijnbeek.  This study aims to predict whether a patient with atrial fibrillation will have an Ischemic Stroke.  From the document:

“We will apply the PatientLevelPrediction package to observational healthcare data to address the following patient-level prediction question: Amongst patients who are newly diagnosed with Atrial Fibrillation, which patients will go on to have Ischemic Stroke within 1 year?”

For more information on patient-level prediction with OHDSI, see the tutorial videos on the OHDSI website.  Also, for an introduction to the concept of using a standardized framework for predictions, see Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data by Jenna Reps, Martijn Schuemie, Marc Suchard, Patrick Ryan, and Peter Rijnbeek. The process we will follow is outlined in the diagram below.

We define this prediction study using the Atlas web interface.  Choose the link for Atlas provided in the Outputs tab of your master stack in the AWS CloudFormation Management Console.  If you chose to enable Atlas authentication during deployment, then choose the sign in () link in the top-right corner of Atlas and login with one of the user names and passwords you provided.

As a part of the OHDSIonAWS deployment, various example studies have been included to help users get started learning the OHDSI toolset.  On the left-side of the Atlas interface, choose Prediction and then choose the CHADS2 study.

As you can see, part of defining a prediction study is specifying one or more target and outcome cohorts.  The cohort definitions shown here have been imported with the study, but we need to apply these definitions to our data sources to identify the patients that meet the criteria before actually executing the prediction study.  The process of identifying these patients is called ‘cohort generation.’

Choose Cohort Definitions on the left menu and then choose the target cohort (T: patients who are newly diagnosed with Atrial fibrillation).  This is the target cohort for our prediction study, meaning that they are the cohort of patients for which we want to make a prediction.

On the Definition tab, you can see the criteria being used to define our cohort of patients with atrial fibrillation as shown in the image below.  You can learn to use Atlas to create cohorts like this by watching the Cohort Definition and Phenotyping tutorial on the OHDSI website.

Now, we need to use this definition to identify the patients within our data source that meet this criteria.  Below is the cohort generation process.

  • Choose the Generation tab and then
  • Choose the Generate buttons next to the CMSDESynPUF23m data source and next to the CMSDESynPUF100k data source.
  • This process takes a few minutes and will identify all of the patients within the CMS DESynPUF 2.3 million patient dataset and the CMS DESynPUF 100,000 person data that meet the criteria for atrial fibrillation. Once it’s completed, you’ll see the number of patients who met this criteria that were identified from each dataset.
  • Now, choose the blue ‘x’ icon in the top right to close this cohort

Repeat the same cohort generation process, with the CMSDESynPUF23m and CMSDESynPUF100k data sources, for the Ischemic stroke cohort.

Now, we can download the prediction package for our study and execute it in RStudio.

  • ChoosePrediction on the left menu
  • Choose the CHADS2 study
  • Choose the Utilities tab
  • Choose the Review & Download button
  • Scroll to the bottom of the page and inside the Download Study Package box give your study package a name like ‘CHADS2’ as shown in the image below. The study package contains the ML algorithm selection, parameter definitions, and OMOP data pointers we need to generate a machine learning model to predict whether patients with Atrial Fibrillation will have an Ischemic Stroke.
  • Choose the Download This downloads a ZIP formatted R package to your local computer.

We continue the rest of the prediction process in RStudio.  Follow the link for RStudio provided in the Outputs tab of your master stack in the AWS CloudFormation Management Console.  Now log into RStudio using a user name and password you provided as a parameter when you deployed your OHDSIonAWS environment.  You are greeted with the RStudio Server user interface, as shown below.

In the file browser in the lower-right, choose the New Folder button in the file browser and create a folder called PLP.  You may have to choose the refresh icon () on the right side of the files window to see the new folder after you create it.  Now, choose the PLP folder to open it, then choose the Upload button in the file browser and choose the prediction study package we downloaded earlier.  It has a file name similar to

After this file is uploaded, it will automatically be extracted into the PLP directory.  Now, choose the file called CHADS2.Rproj (if you named your PLP project ‘CHADS2’), then confirm that you want to open the project by choosing Yes, and then decline to save the current workspace image by choosing Don’t Save.

Now, we need to build and install the custom PatientLevelPrediction R library that was generated by Atlas to support our study.  This library contains R code uses our chosen cohorts, ML algorithms, and parameters to call the OHDSI PatientLevelPrediction library. Choose the Build tab at the top right, and then choose the Install and Restart button as depicted below.  This will install and load the R library for our study, as shown below.

Once the library is loaded, choose the Home directory icon () and then choose the file called ConnectionDetails.R to open it.  This file contains all of the connection information for your Amazon Redshift data warehouse and all of the OMOP schemas you deployed.  We use this information to tell the OHDSI R libraries how to read from our OMOP database in Amazon Redshift.

Now, choose the PLP folder again, then on the folder called extras, and then the file inside of it called CodeToRun.R.  At the top of this file, you find USER INPUTS that we need to populate for our study code to run correctly.  Use the information contained in the ConnectionDetails.R file to populate the “Details for connecting to the server” variables as shown below.  The cdmDatabaseSchema and cohortDatabaseSchema should match the name of the data source on which we want to train our ML model, in our case, CMSDESynPUF23m.  Your user inputs should look similar to the below code.

# The folder where the study intermediate and result files will be written:
outputFolder <- "./CHADS2Results"

# Specify where the temporary files (used by the ff package) will be created:
options(fftempdir = "~/fftemp/")

# Details for connecting to the server:
dbms <- "redshift"
user <- 'master'
pw <- 'yourOHDSIpassword1'
server <- ''
port <- '5439'

connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = dbms
server = server,
user = user,
password = pw,
port = port)

# Add the database containing the OMOP CDM data
cdmDatabaseSchema <- 'CMSDESynPUF23m'
# Add a sharebale name for the database containing the OMOP CDM data
cdmDatabaseName <- 'mycdm'
# Add a database with read/write access as this is where the cohorts will be generated
cohortDatabaseSchema <- 'CMSDESynPUF23mresults'

Now that we’ve configured the User Inputs, we need to tell PatientLevelPrediction what kind of outputs we want it to produce.  In addition to an ML model for prediction, the OHDSI PatientLevelPrediction package can also generate a variety of useful documentation as detailed in the section 14.6 of the Book of OHDSI.  Set the parameters as shown below and uncomment the viewShiny() line.  These ‘T’ or ‘F’ parameters indicate the kinds of outputs that you want from PatientLevelPrediction.  This configuration indicates that we want to output our ML model as well as an RShiny dashboard of its performance and text documents describing the details of the model.

execute(connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cdmDatabaseName = cdmDatabaseName,
cohortDatabaseSchema = cohortDatabaseSchema,
cohortTable = cohortTable,
outputFolder = outputFolder,
createProtocol = T,
createCohorts = T,
runAnalyses = T,
createResultsDoc = T,
packageResults = F,
createValidationPackage = F,
minCellCount= 5,
createShiny = T,
createJournalDocument = T,
analysisIdDocument = 1)

# if you ran execute with: createShiny = T
# Uncomment and run the next line to see the shiny app:

Once you have filled in the correct parameters, press CTRL-A on Windows or CMD-A on macOS to select all of the code.  Then press the Run button () in the upper-right area of the code editor window.  Now the PatientLevelPrediction OHDSI package begins training the three machine learning models we specified in our study.  All of these use a Lasso Logistic regression model with the same model parameters. They are labeled Analysis_1, Analysis_2, and Analysis_3.  Each of these models will be trained on the same 245,787 person target population of newly diagnosed patients with atrial fibrillation and 482,100 person outcome population of patients with Ischemic stroke events, but the three models differ by the number of features, or “covariates”, they consider.  Analysis_1 considers the fewest number of covariates, Analysis_2 considers a few more, and Analysis_3 considers the most.  You can see the details of the covariates included in each model by looking at the PatientLevelPrediction study definition in Atlas.  You will also see them in the model dashboard that’s output from the PatientLevelPrediction training process.  Once the models are trained, we are able to compare and understand how the inclusion of additional covariates impacts the overall performance of our ML models.

The training process takes about 35 minutes using a single dc2.large Amazon Redshift node and a t3.xlarge instance for RStudio.  This performance is based on a default configuration. You can adjust parameters like the number of Amazon Redshift nodes and the RStudio instance size to arrive at the right balance of cost vs performance.

After the training is complete, you should see a window pop-up that contains the results of the machine learning model training.  If you don’t, check to see if your browser’s pop-up blocker prevented it.  This RShiny application allows us to browse important information about the models we’ve just trained.  On the Summary screen, in the Results tab, shown below, first pick Analysis_3.  Analysis_3 is the machine learning model that considered the largest number of covariates.

Now, choose the Performance screen on the left side, and then select the Discrimination tab.  Here you can see ROC plot, Precision/Recall plot, and other graphs that help us understand the model’s overall performance, as shown below.  If you are interested to see how the performance of each model varies, choose back on the Summary tab, select a different model, and look at its performance graphs.  You can see these models increase in accuracy as we consider a greater number of covariates.

Our machine learning model outputs the probability of the outcome between 0 and 1 when we input data about a patient who meets the criteria of having recently been diagnosed with atrial fibrillation.  To make a decision, we have to select a threshold.  If the predicted score for a patient exceeds this threshold value, we interpret that as being a predicted positive and if it’s below our threshold we interpret it as being a predicted negative.  Next, choose back to the Summary tab (top) while remaining on the Performance screen (left).  Here you can see a variety of performance measures relative to a given Threshold value.

Move the Threshold value slider until the threshold value reaches approximately 0.20.  From the performance dashboard, you can see that this threshold produces a Positive Predictive Value (PPV) of about 24%, as shown below.  This is roughly one and a half times the Incidence of about 15%.  This means that, using the ML model produced by Analysis_3, a patient who was newly diagnosed for atrial fibrillation, with a prediction score of greater than 0.20, is at about 50% higher risk of having an Ischemic stroke than newly diagnosed atrial fibrillation patients whose prediction score is less than 0.20.

Using your ML model to make health outcome predictions

Now that we’ve created our prediction model, understand its performance, and have chosen a threshold value, let’s apply it to a separate patient dataset and get predictions for each patient.  Close the Multiple PLP Viewer window and paste the below R code to the bottom of your CodeToRun.R file.

# External Validation with CMS SynPUF 100k persons dataset

# Add the database containing the OMOP CDM data
cdmDatabaseSchema <- 'CMSDESynPUF100k'
# Add a sharebale name for the database containing the OMOP CDM data
cdmDatabaseName <- 'mycdm'
# Add a database with read/write access as this is where the cohorts will be generated
cohortDatabaseSchema <- 'CMSDESynPUF100kresults'

plpResult <- PatientLevelPrediction::loadPlpResult("~/PLP/CHADS2Results/Analysis_3/plpResult")

results <- PatientLevelPrediction::externalValidatePlp(plpResult = plpResult,
validationSchemaTarget = cohortDatabaseSchema,
validationSchemaOutcome = cohortDatabaseSchema,
validationSchemaCdm = cdmDatabaseSchema,
databaseNames = cdmDatabaseName,
validationTableTarget = "cohort",
validationTableOutcome = "cohort",
validationIdTarget = NULL,
validationIdOutcome = NULL,
oracleTempSchema = NULL,
verbosity = "DEBUG",
keepPrediction = T,
sampleSize = NULL)

threshold = 0.20231
for (i in 1:length(results$validation$mycdm$prediction)){
if (results$validation$mycdm$prediction$value[i] >= threshold) {
print(paste0("Person ID: ",results$validation$mycdm$prediction$subjectId[i], "     Value: ",results$validation$mycdm$prediction$value[i]))

We now use the machine learning model we trained using the CMS SynPUF 2.3 million patient dataset to make predictions about the CMS SynPUF 100,000 patient dataset.  This demonstrates the generalization of our ML model, meaning that we can apply our ML model trained on one dataset to a patient population represented by another dataset.  Select the newly pasted code in RStudio and choose the Run button ().

The OHDSI PatientLevelPrediction library will now use the same cohort definition for atrial fibrillation we defined earlier to identify patients that meet this criteria within the CMSSynPUF100k data source. It then makes a prediction about their risk of having an Ischemic stroke using the ML model we created earlier.  For each patient in the atrial fibrillation cohort, it assigns a score between 0 and 1.  We then use the threshold we determine earlier, 0.2, to identify the patients whom the model identifies as having an approximately 50% higher risk of Ischemic stroke than other atrial fibrillation patients.  You can see that the model is able to identify 3 such patients out of a population of 3,804 total patients, as shown below.

Terminating the OHDSI environment

Once you’re finished performing your analyses, it’s easy to terminate the OHDSIonAWS environment.  Simply go the AWS CloudFormation Management Console, select the parent Stack (the one without ‘NESTED’ above its name), and choose the Delete button as shown in the image below.  This will begin the process of deleting all of the OHDSI resources from your AWS account.

Cost of deploying this environment

The smallest environment for OHDSIonAWS can be run for a reasonable monthly cost and is able to store and analyze about 150 GB of observational health data.  OHDSIonAWS can also be scaled up to handle dozens of concurrent users analyzing petabytes of observational health data.  These OHDSI environments don’t have to be long-running.  You can use the OHDSIonAWS automation to deploy environments in just a couple of hours, develop and run your analysis over a few days, save your results, and then terminate the entire environment.  This elastic, cloud model is able to yield tremendous cost savings over running similar infrastructure within a traditional data center.


The OHDSI toolset, powered by the OMOP Common Data Model, provides an incredible resource for analysis of observational health data. And the free, open source nature of OHDSI ensures that their tools are available to everyone.  This also makes it possible to share models across data sites and therefore perform extensive external validation.  The cloud has also democratized the underlying technology needed to power tools like OHDSI, removing economic barriers and making it available to organizations of any size, and even to individuals.  Cloud data, analytics, and compute services, along with automation, also remove technical knowledge barriers enabling the quick deployment of OHDSI without needing an IT background.

The applications of machine learning models generated by the PatientLevelPrediction project are broad, enabling health risk identification by providers, life sciences organizations, payers, and many others.  These capabilities further underscore the importance of machine learning in its ability to enable the large-scale analysis of data in a way that has been previously unattainable.  The application of health outcome prediction is a critical part of identifying debilitating and expensive conditions earlier to lessen their impact on the patient and our overall health economy.

Now that you’ve gained an understanding of how to use the advanced capabilities of the OHDSI toolset, you can begin the process of converting your organization’s health data into the OMOP Common Data Model.  This enables you to gain deep insights into your real patient or member data.  Another advantage of OHDSI’s standardized toolset, is that it enables the analytics you develop to be run by anyone with their data in the OMOP format.  That means that you can broaden the scope of your studies to include more patient data by collaborating with other institutions without needing to gain access to the underlying, personal health data.  You can continue to learn more about OHDSI at the website or from the Book of OHDSI at

Learn more about AWS for Healthcare

James Wiggins

James Wiggins

James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.

Jenna Reps

Jenna Reps

Jenna Reps is a Senior Epidemiology Informaticist at Janssen research and Development where she is focusing on developing novel solutions to personalise risk prediction. Jenna’s areas of expertise include applying machine learning and data mining techniques to develop solutions for various healthcare problems. She is currently working within the patient level prediction OHDSI workgroup with the aim of developing open source and user friendly software for developing risk models using data sets in the OMOP Common Data Model format.