Human-in-the-loop review of model explanations with Amazon SageMaker Clarify and Amazon A2I

Domain experts are increasingly using machine learning (ML) to make faster decisions that lead to better customer outcomes across industries including healthcare, financial services, and many more. ML can provide higher accuracy at lower cost, whereas expert oversight can ensure validation and continuous improvement of sensitive applications like disease diagnosis, credit risk management, and fraud detection. Organizations are looking to combine ML technology with human review to introduce higher efficiency and transparency in their processes.

Regulatory compliance may require companies to provide justifications for decisions made by ML. Similarly, internal compliance teams may want to interpret a model’s behavior when validating decisions based on model predictions. For example, underwriters want to understand why a particular loan application was flagged suspicious by the model. AWS customers want to scale such interpretable systems with a large number of models supported by a workforce of human reviewers.

In this post, we use Amazon SageMaker Clarify to provide explanations of individual predictions and Amazon Augmented AI (Amazon A2I) to create a human-in-the-loop workflow and validate specific outcomes below a threshold on an income classification use case.

Explaining individual predictions with a human review can have the following technical challenges:

Advanced ML algorithms learn non-linear relationships between the input features, and traditional feature attribution methods like partial dependence plots can’t explain the contribution of each feature for every individual prediction
Data science teams must seamlessly translate technical model explanations to business users for validation

SageMaker Clarify and Amazon A2I

Clarify provides ML developers with greater visibility into their data and models so they can identify potential bias and explain predictions. SHAP (SHapley Additive exPlanations), based on the concept of a Shapley value from the field of cooperative game theory, works well for both aggregate and individual model explanations. The Kernel SHAP algorithm is model agnostic, and Clarify uses a scalable and efficient implementation of Kernel SHAP.

Amazon A2I makes it easy to build the workflows required for human review at your desired scale and removes the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers. You can send model predictions and individual SHAP values from Clarify for review to internal compliance teams and customer-facing employees via Amazon A2I.

Together, Clarify and Amazon A2I can complete the loop from producing individual explanations to validating outcomes via human review and generating feedback for further improvement.

Solution overview

We use Amazon SageMaker to build and train an XGBoost model performing census income classification in a Jupyter notebook environment. We then use SageMaker batch transform to run inference on a batch of test data, and use Clarify to explain individual predictions on the same batch. Next, we set up Amazon A2I to create a human-in-the-loop review with a workforce. We then extract the SHAP values from the Clarify output and trigger the Amazon A2I review for predictions under a specific threshold. We present the human reviewers a plot of SHAP values for each review instance and allow them to validate the prediction outcome. The feedback from the reviewers is stored back to Amazon Simple Storage Service (Amazon S3) and made available for the next model training cycle.

The following diagram illustrates this architecture.

Let’s walk through this architecture to understand the details:

Train an XGBoost model using the training data stored in Amazon S3. The trained model is stored on another S3 bucket.
Upon arrival of inference data, select a batch of records and perform the following:
1. Send the batch of inference requests to the endpoint.
2. Set up the SageMaker batch transform endpoint using the model created in Step 1.
3. Run explainability analysis on the same batch, using Clarify, to generate SHAP values for each feature, both globally and for individual predictions.
4. Filter the negative outcomes from the prediction results and generate the plot of SHAP values for those outcomes. Store the plots to an Amazon S3 location, to be picked by the Amazon A2I template.
5. Create a human review loop using Amazon A2I and supply the outcome and the plot of SHAP values to the Amazon A2I task template.
6. Human reviewers look at the prediction and the SHAP values to understand the reason for the negative outcome and verify if the model is making decisions without any inherent bias.
Capture all the human-reviewed results and use that as ground truth labeled data for retraining purposes.
Supply the next batch of records and repeat all the sub-steps in Step 2.

The batch size of 100 is taken for demonstration purposes only. We can send all the data at once as well.

Dataset

We use the UCI Adult Population Dataset. This dataset contains 45,222 rows (32,561 for training and 12,661 for testing). Each data instance has 14 features concerning demographic characteristics of individuals like age, workclass, education, marital-status, sex, and ethic group. The dataset provides the target variable showing if the individual earns more or less than $50,000. The dataset also provides different files for training and testing.

Data preparation

As part of data preprocessing, we apply label encoding on the target variable to denote 1 as the person earning more than $50,000 and 0 as the person earning less than $50,000. Similarly, features like sex, workclass, education, marital-status, relationship, and Ethnic-status are categorical encoded.

We create a small batch of 100 records out of the training data, which we use for performing batch inference using SageMaker batch transform, and use the same batch for generating the SHAP values using Clarify. All these files are then uploaded to Amazon S3.

Model training

We train an XGBoost model using the built-in XGBoost algorithm container provided by SageMaker. SageMaker provides lots of built-in algorithms that you can use for training. SageMaker also provides built-in containers for popular deep learning frameworks like TensorFlow, PyTorch, and MXNet, and also supports bringing your own custom containers for training and inference purposes.

Inference

After training the model, it’s time to set up a batch transform job for making predictions in batches. For this post, we send a small batch of records to the endpoint based on the batch size we defined in the data preparation phase. Based on the use case, we can change the batch size accordingly.

Generate explanations for model predictions

We start the explainability analysis using Clarify. Clarify sets up an ephemeral shadow endpoint and runs a SageMaker Processing job to perform batch inference on the shadow endpoint to calculate the SHAP values.

To calculate Shapley values, Clarify generates several new instances between the baseline and the given instance, in which the absence of a feature is modeled by setting the feature value to that of the baseline, and the presence of a feature is modeled by setting the feature value to that of the given instance. Therefore, the absence of all features corresponds to the baseline and the presence of all features corresponds to the given instance. Clarify computes local and global SHAP values for the given input, which is a batch of records in our case.

The explainability results are stored in an Amazon S3 location specified while setting up the explainability analysis job.

Clarify computes global SHAP values, showing the relative importance of all the features in the dataset, and produces reports in HTML, PDF, and notebook formats. A separate output file is generated that contains SHAP values for individual data instances. It also produces an analysis.json file, which contains the global SHAP values and the expected value in JSON format. We use this base value to generate the plots for SHAP values.

Post-processing the predictions and explainability results

Next, we download the batch transform results, the CSV file containing SHAP values, and the analysis.json file. With these locally available, we create a single Pandas DataFrame that contains the SHAP values and their corresponding predictions. Here we convert the probability score of each prediction to a binary output based on a threshold of 0.5. This sets all the predictions with probability scores less than the threshold value to 0 and predictions with probability scores greater than the threshold value to 1. We can change the threshold from 0.5 to any other value between 0 and 1 as per our use case. We refer to 0 as a negative outcome and 1 as a positive outcome for the rest of the post.

The following code demonstrates our postprocessing:

from sagemaker.s3 import S3Downloader
 import json 
 
# read the shap values 
S3Downloader.download(s3_uri=explainability_output_path+"/explanations_shap", local_path="output") 
shap_values_df = pd.read_csv("output/out.csv")
 
# read the inference results 
S3Downloader.download(s3_uri=transformer_s3_output_path, local_path="output") 
predictions_df = pd.read_csv("output/test_features_mini_batch.csv.out", header=None) 
predictions_df = predictions_df.round(5)
 
# get the base expected value to be used to plot SHAP values 
S3Downloader.download(s3_uri=explainability_output_path+"/analysis.json", local_path="output")
with open('output/analysis.json') as json_file: 
    data = json.load(json_file) 
    base_value = data['explanations']['kernel_shap']['label0']['expected_value']
     
print("base value: ", base_value) 
 
predictions_df.columns = ['Probability_Score']
 
# join the probability score and shap values together in a single data frame 
prediction_shap_df = pd.concat([predictions_df,shap_values_df],axis=1)

# create a new column as 'Prediction' converting the prediction to either 1 or 0 
prediction_shap_df.insert(0,'Prediction', (prediction_shap_df['Probability_Score'] > 0.5).astype(int))

# adding an index column based on the batch size;to be used for merging the A2I predictions with the groundtruth
prediction_shap_df['row_num'] = test_features_mini_batch.index

Set up a human review workflow

Next, we set up a human review workflow with Amazon A2I to review all the negative outcomes along with their SHAP values and probability scores. This increases transparency and trust in the whole ML lifecycle because Clarify can provide useful insights to the human reviewer to check if the features that contributed to the negative outcomes are making the model predictions rely heavily on certain features, and also if the probability score is quite close to the threshold, thereby making it a negative outcome.

Create the Amazon A2I worker task template

To start, we create a worker task template with the tabular data. Amazon A2I supports a variety of worker task templates for use cases covering images, audio, tabular data, and many more. For more examples of worker task templates, see the GitHub repo.

Before you proceed further, make sure the SageMaker role has the required permissions to run tasks related to Amazon A2I. For more information, see Prerequisites to Using Augmented AI.

After all the prerequisites are met, we create the worker task template. The template has the following columns:

Row number (a unique identifier for each record, to be used later for preparing the ground truth data for retraining)
Predicted outcome (0 or 1)
Probability score on the outcome (a higher value indicates a higher probability of a positive outcome)
SHAP value plot for the outcome (showing the top three features that influenced the model prediction)
Agree or disagree with the rating
Reason for change

Create the workflow definition for Amazon A2I

After we create the template, we need to create a workflow definition. The workflow definition allows us to specify the following:

The workforce that our tasks are sent to
The instructions that our workforce receives (called a worker task template)
Where our output data is stored

Prepare the data for the human review workflow

As we discussed earlier, we send all the negative outcomes for human review. We filter the records with negative outcome, generate a plot for SHAP values for those records, and upload them to Amazon S3. To generate the plot for SHAP values, we use the open-source SHAP library. In the plot for SHAP values, we only show the top three features contributing the most to the outcome.

We also modify the same Pandas used during postprocessing by appending the Amazon S3 URIs for each uploaded plot for SHAP values because we need to supply these paths in the Amazon A2I worker task template. See the following code:

import shap
import matplotlib.pyplot as plt

column_list = list(test_features_mini_batch.columns)

s3_uris =[]
for i in range(len(negative_outcomes_df)):
    explanation_obj = shap._explanation.Explanation(values=negative_outcomes_df.iloc[i,2:-1].to_numpy(), base_values=base_value, data=test_features_mini_batch.iloc[i].to_numpy(), feature_names=column_list)
    shap.plots.waterfall(shap_values=explanation_obj, max_display=4, show=False)
    img_name = 'shap-' + str(i) + '.png'
    plt.savefig('shap_images/'+img_name, bbox_inches='tight')
    plt.close()
    s3_uri = S3Uploader.upload('shap_images/'+img_name, 's3://{}/{}/shap_images'.format(bucket, prefix))
    s3_uris.append(s3_uri)

    
negative_outcomes_df['shap_image_s3_uri'] = s3_uris

Start the human review loop

Now we have everything ready to start the human review loop. We send a small batch of three records to each of the reviewers so that they can analyze the individual predictions and compare the SHAP values across predictions to understand if the model is biased towards certain attributes. This comparison capability greatly improves the transparency and trust in the model because reviewers also get a sense of model feature attribution across predictions. They can also share their observations in the comments section in the same UI.

Understanding the plots for SHAP values

The following screenshot shows a sample task UI that the reviewers can work on.

From the preceding image, the reviewer sees three rows, each with a negative outcome instance, along with its probability score and plot for SHAP values. In the plot for SHAP values, features that push the prediction score higher (to the right) are shown in red, and those pushing the prediction lower are in blue. So the red ones are pushing the prediction towards a positive outcome and blue ones are pushing it towards a negative outcome. The function E[f(x)] is the expected value for the SHAP baseline calculated by Clarify, and f(x) is the actual SHAP value the model came up with for the prediction. The units on the X-axis are log-odds units, so negative values indicate probabilities of less than 0.5 (negative outcome) that the person makes over $50,000 annually. Along the Y-axis, we see the top three features contributing the most to the outcome. We also see some values in gray text next to the feature names. These are the label-encoded values of these top features as provided during inference.

In the first row, the probability score is 0.13, which is very low, and the top three features are Capital Gain, Relationship, and Marital Status. The value of Capital Gain feature is 0 and the value of Relationship is 1, which points to the value Married-AF-spouse, and Marital Status feature value is 5, which points to the value Wife. All three features are pushing the probability score towards a positive outcome, but due to other less significant features collectively pulling the prediction towards a negative outcome, the model gives a very low probability score.

The second row shows another negative outcome instance detail. The probability score is 0.48, which is very close to the threshold (0.5), which indicates that it could have been a positive outcome if there were a slightly higher attribution by any feature towards the positive outcome. Again, the same three features Capital Gain, Relationship, and Marital Status are influencing the model’s prediction the most and pushing it towards a positive outcome, but not enough to cross over the threshold.

The third row shows a different trend, however. The probability score is very low and the Age feature has the highest attribution, but towards pushing the prediction to a negative outcome. Capital Gain comes second and it moves the model’s prediction in the opposite direction by a very good margin, but the third feature Sex again pulls it back slightly. In the end, the model gives a very low probability score, indicating a negative outcome. Upon careful observation, we can see although the Age and Capital Gain features are shown as big contributors on the X-axis ranges between -7.5 and -5.5. So, even when the feature attributions look quite prominent, they’re still affecting the model’s prediction by a very low margin (2.5).

The human reviewer can carefully analyze each of the predictions and understand the trends of feature importance for each prediction and also across predictions. Based on their observations, they can decide whether to change the decision and provide the change reason in the template itself. Amazon A2I automatically records all the responses and stores them in JSON format to an Amazon S3 location.

Prepare the ground truth data based on Amazon A2I results

Next, we download that Amazon A2I result data and merge it with the batch data to generate the ground truth data. For this post, all the negative outcomes have been reviewed by human reviewers so those can be treated as ground truth for retraining purposes.

The complete code related to this blog post can be found on the GitHub repo.

Clean up

To avoid incurring costs, delete the endpoint created as part of the code example.

Conclusion

This post demonstrated how we can use Clarify and Amazon A2I to improve transparency and bring trust into the ML lifecycle, in which Clarify provides details on which feature contributed the most to a particular outcome and Amazon A2I provides a human-in-the-loop feature to review the predictions and their SHAP attributions.

As a next step, test out this approach with your own dataset and use case, taking inspiration from the code to make the ML solutions more transparent.

For more information on Clarify, check out the whitepaper Amazon AI Fairness and Explainability.

About the Author

Vikesh Pandey is a Machine Learning Specialist Specialist Solutions Architect at AWS, helping customers in Nordics and wider EMEA region, design and build ML solutions. Outside of work, Vikesh enjoys trying out different cuisines and playing outdoor sports.

Hasan Poonawala is a Machine Learning Specialist Solutions Architect at AWS, based in London, UK. Hasan helps customers design and deploy machine learning applications in production on AWS. He is passionate about the use of machine learning to solve business problems across various industries. In his spare time, Hasan loves to explore nature outdoors and spend time with friends and family.

Artificial Intelligence