AWS Machine Learning Blog
Create a model for predicting orthopedic pathology using Amazon SageMaker
Artificial intelligence (AI) and machine learning (ML) are gaining momentum in the healthcare industry, especially in healthcare imaging. The Amazon SageMaker approach to ML presents promising potential in the healthcare field. ML is considered a horizontal enabling layer applicable across industries. Within healthcare, this can serve analogous to a radiology or lab report as a key factor towards eventual diagnosis.
This blog post uses the UCI ML Dataset, which describes using ML in Orthopedics to automate the prediction of spinal pathology conditions. This technology presents an opportunity to take steps to minimize the number of visits and/or prescriptions by shortening diagnosis time and instilling rejection option techniques with ML. This will leave the difficult cases to the experts, such as the Orthopedists. Disc Hernia and Spondylolisthesis, the two diagnoses in the datasets, are among the spinal pathologies that can cause musculoskeletal pain disorders. There is opportunity within computer-aided diagnostic systems that use ML techniques to identify and treat patients-at-risk objectively and effectively to minimize opioid prescriptions for pain disorders.
For this blog post, I downloaded these datasets to present an example for predicting if a person has a normal or abnormal Orthopedic or spinal pathology (Hernia or Spondylisthesis) based on characteristics or features of their vertebral column. Preliminary diagnostic tools that take into account these features for these pathologies have high false positive rates. MRIs are used to detect containment of lumbar disc herniation; this technique has a ~33% false positive rate. Diagnostic spinal blocks (injections) carry a false positive rate of 22% to 47%. (Note: This will be used as a baseline when we evaluate the ML model.)
These datasets present both a multiclass and binary classification problem.
Creating an ML model in Amazon SageMaker for pathology prediction
In this post, we create two models—a multi-class categorical classification model and a binary classification model— and we evaluate both. The multi-class categorical classification will predict if a person has Normal, Herniated Discs, or Spondylolisthesis pathology. The binary classification will predict a binary response: 0 – Normal or 1 – Abnormal.
Here are the high-level steps we will follow to for this example:
- Prepare your Amazon SageMaker Jupyter notebook.
- Load a dataset from Amazon Simple Storage Service (S3) using Amazon SageMaker.
- Estimate a model using the Amazon SageMaker XGBoost (eXtreme GradientBoosting) algorithm.
- Host the model on Amazon SageMaker to make ongoing predictions.
- Generate final predictions on the test data set.
Setup
Download the first notebook and upload it to your SageMaker instance to follow along with this blog post. Let’s start by specifying the following:
- Specify the Amazon SageMaker role Amazon Resource Name (ARN) used to give learning and hosting access to your data. Note, if more than one role is required for notebook instances, training, and/or hosting, the boto3 call should be replaced with the appropriate full Amazon SageMaker role ARN string.
- Specify the Amazon S3 bucket that will be used for training and storing model objects.
You’ll also install liac-arff because Attribute-Relation File Format (ARFF) is the formatting we’ll use, given the dataset.
Now you need to import the relevant Python libraries that we’ll use throughout the analysis.
Let’s define the Amazon S3 bucket used for the example.
Data
Data comes in two separate files: column_2c_weka.arff and column_3c_weka.arff. The column names are in both files.
The hosted zip file (“vertebral_column_data”) consists of four data files, two of which contain the actual data and attributes: column_2C_weka.arff for binary classification and column_3C_weka.arff for categorical classification. The dataset consists of 310 rows representing 310 patient records.
Classes and attributes
The dataset includes six biomechanical attributes of the patient and the outcome or pathology. The attributes describe the vertebral column (group of vertebras, invertebrate discs, nerves, muscles, medulla and joints). These spino-pelvic system parameters include angle of pelvic incidence (PI), angle of pelvic tilt (PT), lordosis angle, sacral slope (SS), pelvic radius, and grade of slipping.
Each patient has six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius, and grade of spondylolisthesis. There is also a class or diagnosis for each patient–either binary: Normal (NO) and Abnormal (AB) or multiclass: DH (Disk Hernia), Spondylolisthesis (SL), Normal (NO).
Prepare
To get the data into Amazon S3 in a format that XGBoost can read, I extracted the relevant files from the zipped file, converted to CSV, and added them to the Amazon S3 bucket so that S3 and Amazon SageMaker can read them.
Extract files and read data pre-conversion
Conversion
Now we take the extracted file and convert it into .csv files in the proper format (Ortho_dataset.csv for binary and Ortho_dataset_2.csv for multi-class). XGBoost requires binary attributes to be classified as 0 and 1. Therefore, in the binary classification file, I replaced “Abnormal” and “Normal” with “1” and “0” in the class variable column ‘diagnosis’, and in the multi-class categorical classification file, I replaced “Normal”, “Hernia”, and “Spondylolisthesis” with “0,” “1,” “2,” respectively.
See the following Python script to read the .arff file and convert it to .csv format.
Data Exploration
Now we’ll explore the dataset to understand the size of data, the various fields, the values that different features take, and the distribution of target values.
Data exploration and transformations
For an effective ML model and potentially for higher accuracy, the more data the better. Since the dataset we’re using in this blog post is quite limited, I didn’t remove any features. This methodology can be applied for larger data sets.
Data histograms and correlation
Here we can visualize the data to see the spread of data within each feature in a histogram and scatter matrix. The scatter-plot matrix displays the correlation between pairs of variables. The matrix makes it easy to look at all pairwise correlations in one place.
Data Description
Let’s talk about the data. At a high level, we can see:
- There are 7 columns and 217 rows in the training data
- There are 7 columns and 93 rows in the test data
- Diagnosis is the target field
Specifics on the features:
- 6 out of 6 features are numeric
Target variable:
- diagnosis: MultiClass: Whether the patient has Hernia, Spondylosisthesis or is Normal or Binary: Whether or not the patient has an abnormal spine condition
Training
For our first training algorithm we use the xgboost
algorithm. xgboost
is an extremely popular, open-source package for gradient boosted trees. It’s computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let’s start with a simple xgboost
model, trained using the Amazon SageMaker managed, distributed training framework.
First we’ll need to specify training parameters. These include the following:
- The role to use
- Our training job name
- The
xgboost
algorithm container - Training instance type and count
- S3 location for training data
- S3 location for output data
- Algorithm hyperparameters
The supported training input format is csv, libsvm. For csv input, we assume that the input is separated by delimiters (automatically detect the separator using Python’s built-in sniffer tool), without a header line, and the label is in the first column. The Scoring Output Format is csv. Our data is in CSV format, so we’ll convert our dataset to the way that the Amazon SageMaker XGBoost supports. We will keep the target field in first column and remaining features in the next few columns. We will remove the header line. We will also split the data into a separate training and validation sets. Finally, we’ll store the data in our S3 bucket.
Split the data into 70% training and 15% validation and save it before calling XGBoost
Upload training and validation data sets in the S3 bucket with prefix below (i.e., ‘train/’)
Specify parameters based on the model
- Multiclass: objective: “multi:softmax”, num_class: “3”
- Binary: objective: “binary:logistic”, eval_metric: “error@t” (where t is the score threshold of error)
Hosting
Now that we’ve trained the xgboost
algorithm on our data, let’s set up a model that can later be hosted. We will do the following:
- Point to the scoring container.
- Point to the model.tar.gz that came from
- Create the hosting model.
After we set up a model, we can configure what our hosting endpoints should be. Here we specify the following:
- EC2 instance type to use for hosting.
- Initial number of instances.
- Our hosting model name.
Create endpoint
Finally, we create the endpoint that serves up the model, through specifying the name and configuration defined earlier. The end result is an endpoint that can be validated and incorporated into production applications. This takes about 7-11 minutes to complete.
Prediction
Here is the model that was created. This model will be used to predict values.
Generate predictions on train, validation, and test sets
Evaluate model accuracy for multiclass categorical
There are many ways to compare the performance of a machine learning model.
For multiclass models, typically we use the F1 measure and not the AUC score, area under the ROC curve which is typically used in binary models. The F1 measure is a statistical measure of the precision and recall of all the classes in the model. The score ranges from 0 to 1; the higher the score, the better the accuracy of the model. For example, an F1 score of ~0.9 would indicate a better model than a score of 0.7.
Other evaluation metrics include Sensitivity, or true positive rate, and Precision, or positive predictive value. This will be covered more in depth in the binary classification example.
This ML model received an average F1 score of ~0.9.
Confusion matrix
You can also delve into the performance for each class by looking at the confusion matrix.
The confusion matrix provides a visual representation of the performance based on the accuracy of the multiclass classification predictive model. In this table you can find the percentage of true positives and false positives.
For example, you can see that the class (or diagnosis) Spondylolisthesis had a high accuracy rate (97%), with 146 cases out of 150 cases predicted correctly in the evaluation dataset. The F1 score of 0.97 is also relatively high. However, class Hernia had a lower F1 score of 0.85 showing that the model confused it with the Normal pathology. For more information on multiclass model evaluation and insights, go to Multiclass Model Insights: https://docs.aws.amazon.com/machine-learning/latest/dg/multiclass-model-insights.html.
Once you are done, delete the endpoints by running the following command:
Binary classification model
Now, we’re going to show the Binary Classification model which yields a binary response (0 or 1)—in this case, 0 was normal and 1 was abnormal. The evaluation produces the following four statistics in a similar confusion matrix.
Setup, data, and training
Setup for binary is similar to the setup for multiclass categorical. You can begin by downloading the notebook with Part 2 for this section. However, there are a few differences, which I will call out for you.
Extracting data files:
Conversion:
Ultimately the data set will have 310 rows with 200 as abnormal and 100 as normal.
When training, the hyperparameters will be different.
Specify parameters based on the model (example: objective: binary:logistic, eval_metric: error@0.40).
Evaluation metrics for the binary classification model
For the binary classification the model uses AUC, area under the curve, as the score. AUC is a metric used to measure the quality of a binary classification ML model. It ranges from 0.5 to 1; the higher the AUC score, the better the ML model quality. In this case, you can also adjust the score threshold.
Compute performance metrics on the training, validation, test data sets
I took a look at the graph plot of True Positive Rate vs. False Positive Rate based on varying thresholds. The goal is to have a high True Positive Rate (TPR or Sensitivity) and low False Positive Rate (FPR or Fall out) and thereby have a higher AUC.
Other model evaluation metrics
- F1 Score: Weighted average of the precision and recall
- Sensitivity, hit rate, recall, or true positive rate
- Specificity or true negative rate
- Precision or positive predictive value
- Negative predictive value (NPV)
- Fall out or false positive rate (FPR)
- False negative rate (FNR)
- False discovery rate (FDR)
- Overall accuracy
Here I can adjust the threshold to increase sensitivity and minimize the FNR in order to have close to zero false negatives.
In a false negative case or if a patient is diagnosed too late, costs and treatment can be more aggressive and could put patients’ lives in danger. Therefore, I will err towards minimizing false negatives closer to zero.
In this case, I adjusted the threshold toward the 4.3% false negative, with a high accuracy rate (89%) and the low error rate (11%) at cutoff 0.3. The error percentage indicates the rate at which the model made a prediction mistake. The error rate, or in this case also the false positive rate, sets at 25%. Comparing this false positive (FP) rate to the industry baseline of MRI (33% FP) and diagnostic blocks (22%-47% FP), using Amazon SageMaker-based models, with this dataset, produces results that are in the range of other tools typically used.
Conclusion
As you can see in this blog post, the binary “abnormal” and “normal” pathology classification in Orthopedics can yield a decision-support system that labels critical cases (as opposed to categorically classifying them into exact pathologies). The ML filters leave the complex and critical cases for the human expert, such as an orthopedic surgeon. In addition, this approach provides factors for guidelines on when to prescribe opioids, thus narrowing the pool for opioid prescriptions. We are at the beginning of exploring the ways that ML can advance healthcare diagnosis. Look for more advancements as we continue to get more data and learn.
About the Author
Sunaina Ahuja Rajani manages Video Games, Media, Toys GLs, as well as events and seasonality on Prime Now, a two hour delivery e-commerce channel. She received a BSc and Masters in fields of healthcare and cognitive neuroscience. She is from Texas and has lived in Utah, D.C., NYC, Cambridge, and now Seattle. Sunaina loves to read the Wall Street Journal, Healthcare and Tech news, and enjoys dancing, biking, investing, and the Matrix movies.