AWS for Industries

No-Code ML Approach to Predict Heart Disease with Amazon SageMaker Canvas

Understanding human biology and disease progression remains a critical challenge in Life Sciences and Healthcare, with potentially life-saving implications. Amazon SageMaker Canvas offers a powerful, no-code solution for building predictive models based on biomedical, laboratory, and health data that can help identify disease biomarkers.

The World Health Organization reports that cardiovascular diseases (CVD) cause approximately 17.91 million deaths each year, often beginning with subtle symptoms like chest pain that go unnoticed until it’s too late. This staggering statistic underscores the urgent need for innovative approaches to early disease detection and intervention.

By leveraging machine learning advancements, Life Sciences and Healthcare professionals can now transform complex medical data into actionable insights without extensive coding expertise. Our example heart disease prediction use case will demonstrate how you can harness the power of machine learning without extensive coding expertise. By using SageMaker Canvas, you can:

  • Quickly analyze complex tabular biomedical data
  • Build accurate predictive models for early disease detection
  • Iterate and improve models with ease
  • Potentially improve patient outcomes through earlier interventions

Using a dataset from the UCI Machine Learning Repository, we’ll walk you through the process of data preparation, model building, and result analysis. You’ll see firsthand how SageMaker Canvas streamlines the entire machine learning workflow, from data transformation to production, in a secure, collaborative environment.

Exploring the Dataset: Understanding Heart Disease Indicators

Before we dive into building our model, let’s talk about data preparation, often the most time-consuming part of any machine learning project. This is where Amazon SageMaker Data Wrangler comes in. It’s a powerful service that reduces the time it takes to prepare tabular, image, and text data, cutting weeks of work down to minutes.

SageMaker Data Wrangler offers a visual interface and leverages natural language processing, making it straightforward for both beginners and experienced practitioners to clean and transform their data. Whether you’re dealing with missing values, outliers, or need to engineer new features, SageMaker Data Wrangler streamlines these tasks with its intuitive design.

In our heart disease prediction project, we’ll use SageMaker Data Wrangler to quickly assess our dataset’s quality and prepare it for model training. This service will help us identify any data issues and confirm our features are ready for machine learning, all without writing a single line of code.

We’ll begin by uploading and importing our dataset with SageMaker Data Wrangler, and then generating a Data Quality and Insights Report. This report automatically checks for issues like missing values, duplicate rows, and anomalies such as outliers or class imbalance. This allows you to rapidly apply domain expertise to prepare data for machine learning (ML) model training. Below we have a visual dataflow created with the sample Data Quality and Insights report (Figure 1) and an export of the data to a SageMaker Canvas dataset (Figure 2).

Figure 1 - Data Wrangler helps visualize and create data transforms. This data flow creates a Data Quality and Insights Report to investigate important aspects of our data and also exports the data as a Canvas dataset.Figure 1 – SageMaker Data Wrangler helps visualize and create data transforms. This data flow creates a Data Quality and Insights Report to investigate important aspects of our data and also exports the data as a Canvas dataset.

Figure 2- A summary of our dataset’s statistics and warnings, to review potential issues, biases, or imbalances.Figure 2 – A summary of our dataset’s statistics and warnings, to review potential issues, biases, or imbalances.

Let’s view the report summary for our data and confirm there are no high priority warnings to investigate.

The report offers a comprehensive view of the heart disease dataset. The report has identified that there are 16 features, no duplicates values, and no high-priority warnings, among other details, to address.

Key insights reveal:

Our target column chest pain (cp) has four classes. We are provided with a histogram of the frequent values in the report (Figure 3). We can view our target column cp in relation to the cholesterol (chol) column with the corresponding target distribution. The lower plot provides the feature distribution and the upper provides the corresponding target class frequency (Figure 4).

Figure 3 – Frequency of values

Figure 4 – Data exploration reveals potential correlations between cholesterol levels and chest pain categories

Figure 4 – Data exploration reveals potential correlations between cholesterol levels and chest pain categories. The top chart shows the target class frequency for chest pain (cp) types across different cholesterol levels. The bottom chart is a histogram of cholesterol (chol) values, overlaid with a line graph of chest pain categories.

This analysis sets the stage for making informed decisions in data preprocessing, feature engineering, and model selection within SageMaker Canvas. We can now confidently proceed with building a predictive model for heart disease.

Model building and iteration

As we export data to a SageMaker Canvas dataset and then build our model, use either Quick Build or a Standard Build. We used a Quick Build, which prioritizes speed over accuracy and is ideal for an initial exploration. We chose cp as our target column and use the provided model type 3+ category prediction.

Figure 5- In the Build tab we initiate the model building by specifying the target column as chest pain (cp) and selecting Quick build.Figure 5- In the Build tab we initiate the model building by specifying the target column as chest pain (cp) and selecting Quick build.

Once the Quick Build completes, we are provided with the characteristics of the column impact on predictability.

Figure 6 – The Analyze tab panel displays model accuracy, feature importance scores, and a visual representation of column impact on predictionsFigure 6 – The Analyze tab panel displays model accuracy, feature importance scores, and a visual representation of column impact on predictions

Next, the advanced metrics show the averages for accuracy, F1, precision, recall, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Recall is often a critical metric because it measures the model’s ability to correctly identify all actual positive cases (for example, patients with a disease). Precision is also vital, because it measures the proportion of positive identifications that were actually correct. AUC-ROC provides a single metric that captures the model’s ability to discriminate between positive and negative classes across all classification thresholds. AUC-ROC is specifically used for binary classification problems. If your problem is a multi-class classification or regression, SageMaker Canvas will not report AUC-ROC.

Figure 7 - The Advanced Metrics tab shows accuracy, recall, precision and average AUC-ROC.Figure 7 – The Advanced Metrics tab shows accuracy, recall, precision, and average AUC-ROC.

Based on the scoring we want to enhance the model’s performance. We want to explore the possibility of reducing the number of prediction categories to two: chest pain or no chest pain, so we can have a binary classification, since we are concerned with early detection of CVDs.

Feature engineering using Amazon SageMaker Data Wrangler

To do this, we return to SageMaker Data Wrangler and change our model into a binary prediction using Chat for data prep. We can use natural language to have it generate the code needed. We ask it to:

“Help me convert the “cp” column data to a binary 0 or 1 based on converting the categories as follow as typical angina=1, atypical angina=1, non-anginal=1, asymptomatic =0”

Figure 8 – Using natural language, we can create Data transforms using Chat for data prep. This creates the code to perform the feature engineering step that transforms the categories into a binary choiceFigure 8 – Using natural language, we can create Data transforms using Chat for data prep. This creates the code to perform the feature engineering step that transforms the categories into a binary choice

Chat for data prep creates code for us, which we add to our steps, and provide a name for the transformation step transform-cp. We see our data for the cp column is distributed now to the binary choice we asked for in the chat without needing to create the code ourselves.

We now add the transform_cp transformation step and create a model based on the data exported after that step.

Figure 9 – The code outputted by Chat for data prep is added, which creates the transformation step

We now add the transform_cp transformation step and create a model based on the data exported after that step.

Figure 10 - The original data flow is updated with the newly added transform_cp step and with an export to a new SageMaker Canvas dataset, to begin model building.Figure 10 – The original data flow is updated with the newly added transform_cp step and with an export to a new SageMaker Canvas dataset, to begin model building.

We choose our target column of cp again and perform a Standard Build this time, because we want to focus prioritizing accuracy over speed.

Analysis of results and potential impact

After the Standard Build, the model’s accuracy, precision, and recall have improved. Let’s review the advanced metrics to see what improvements have been made.

Figure 11 – The Advanced Metrics tab shows the performance of the model. By all metrics, this model is improved over the previous model.Figure 11 – The Advanced Metrics tab shows the performance of the model. By all metrics, this model is improved over the previous model.

We have transitioned from a multi-class categorization model to a binary classification model for diagnosing early heart disease using chest pain. The new model demonstrates significant improvements with a precision of 0.825, recall of 0.800, and an AUC of 0.846. These enhancements ensure higher accuracy in identifying patients with heart disease while minimizing false positives, leading to more reliable and effective diagnostic outcomes. This significant improvement was achieved with minimal data science expertise and provides valuable insights. These results can now be shared with the data science team for further model refinement and investigation.

Empowering Healthcare with Amazon SageMaker Canvas: Conclusion and Future Directions

By leveraging the intuitive data preparation and transformation capabilities of Amazon SageMaker Canvas, you can clean, normalize, and engineer features from the dataset. SageMaker Canvas builds and selects the best performing model for our problem type. We demonstrated how to use SageMaker Canvas to build a predictive model for heart disease, focusing on chest pain as a key symptom to help accelerate detection and key decision making.

Amazon SageMaker Canvas democratizes machine learning for life sciences and healthcare teams, allowing you to focus on interpreting results and making informed decisions rather than grappling with complex data science techniques.

Contact an AWS Healthcare Representative or Life Sciences Representative to know how we can help accelerate your business.

Ready to continue your journey with Amazon SageMaker Canvas? Here are some next steps:

James Gaines

James Gaines

James Gaines is a Senior Solutions Architect for Healthcare and Life Sciences at AWS. He has a background in highly regulated environments, including the Department of Defense and pharmaceutical industry. James holds all active AWS Certifications and specializes in cloud migrations, application modernization, and advanced analytics to drive innovation in Healthcare and Life Sciences.