Training models with unequal economic error costs using Amazon SageMaker
Many companies are turning to machine learning (ML) to improve customer and business outcomes. They use the power of ML models built over “big data” to identify patterns and find correlations. Then they can identify appropriate approaches or predict likely outcomes based on data about new instances. However, as ML models are approximations of the real world, some of these predictions will likely be in error.
In some applications all types of prediction errors are truly equal in impact. In other applications, one kind of error can be much more costly or consequential than another – measured in absolute or relative terms, in dollars, time, or something else. For example, predicting someone does not have breast cancer when they do (a false negative error) will, according to medical estimates, likely have much greater cost or consequences than the reverse error. We may even be willing to tolerate more false positive errors if we sufficiently reduce the false negatives to compensate.
In this blog post, we address applications with unequal error costs with the goal of reducing undesirable errors while providing greater transparency to the trade-offs being made. We show you how to train a model in Amazon SageMaker for a binary classification problem in which the costs of different kinds of misclassification are very different. To explore this tradeoff, we show you how to write a custom loss function – the metric that evaluates how well a model makes predictions – that incorporates asymmetric misclassification costs. We then show you how to train an Amazon SageMaker Build Your Own Model using that loss function. Further, we show how to evaluate the errors made by the model and how to compare models trained with different relative costs so that you can identify the model with the best economic outcome overall.
The advantage of this approach is that it makes an explicit link between an ML model’s outcomes and errors and the business’ framework for decision-making. This approach requires the business to explicitly state its cost matrix, based on the specific actions to be taken on the predictions. The business can then evaluate the economic consequences of the model predictions on their overall processes, the actions taken based on the predictions, and their associated costs. This evaluation process moves well beyond simply assessing the classification results of the model. This approach can drive challenging discussions in the business, and force differing implicit decisions and valuations onto the table for open discussion and agreement.
Background and solution overview
Although model training always aims to minimize errors, most models are trained to assume that all types of errors are equal. However, what if we know that the costs of different types of errors are not equal? For example, let’s take a sample model trained on UCI’s breast cancer diagnostic data set.1 Clearly, a false positive prediction (predicting this person has breast cancer, when they do not) has very different consequences than a false negative prediction (predicting this person does not have breast cancer, when they do). In the first case, the consequence is an extra round of screening. In the second case, the cancer might be at a more advanced stage before it’s discovered. To quantify these consequences they are often discussed in terms of their relative cost, which then allows trade-offs to be made. While we can debate what the exact costs of a false negative or a false positive prediction should be, we believe we’d all agree that they’re not the same – although ML models are generally trained as if they are.
We can use a custom cost function to evaluate a model and see the economic impact of the errors the model is making (utility analysis). Elkan2 showed that applying a cost function to the results of a model can be used to compensate for imbalanced samples when used in standard Bayesian and decision tree learning methods (for example: few loan defaults, versus a large sample of repaid loans). The custom function can also be used to perform this same compensation.
We can also have the model “shift” its predictions in a fashion that reflects the difference in cost, by providing the costs of different types of errors to the model during training using a custom loss function. So, for example, in the breast cancer example we’d like the model to make fewer false negative errors and are willing to accept more false positives to achieve that goal. We may even be willing to give up some “correct” predictions in order to have fewer false negatives. At least, we’d like to understand the trade-offs we can make here. In our example, we’ll use costs from the healthcare industry.3,4
In addition, we’d like to understand in how many cases the model’s predictions are “almost” predicted as something else. For example, binary models use a cutoff (say, 0.5) to classify a score as “True” or “False.” How many of our cases were in fact very close to the cut-off? Are our false negatives classified that way because their score was 0.499999? These details can’t be seen in the usual representations of confusion matrices or AUC measures. To help address these questions, we have developed a novel, graphical representation of the model predictions that allows us to examine these details, without depending on a specific threshold.
In fact, there are likely cases where a model trained to avoid specific types of errors would begin to specialize in differentiating errors. Imagine a neural network that’s been trained to believe that all misrecognitions of street signs are equal. 5 Now, imagine a neural network that’s been trained that misrecognizing a stop sign as a sign for speed limit 45 mph is a far worse error than confusing two speed limit signs. It’s reasonable to expect that the neural network would begin to recognize different features. We believe this is a promising research direction.
We use Amazon SageMaker to build and host our model. Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. We author and analyze the model in a Jupyter notebook hosted on an Amazon SageMaker notebook instance, then build and deploy an endpoint for online predictions, using its “Bring Your Own Model” capability.
Note that while the terms “cost function” and “loss function” are often used interchangeably, we differentiate between them in this post, and provide examples of each:
- We use a “loss function” to train the model. Here, we specify the different weights of different kinds of errors. The relative weight of the errors is of most importance here.
- We use a “cost function” to evaluate the economic impact of the model. For the cost function, we can specify the cost (or value) of correct predictions, as well as the cost of errors. Here, dollar costs are most appropriately used.
This distinction allows us to further refine the model’s behavior or to reflect differing influences from different constituencies. Although in this model we’ll use the same set of costs (quality adjusted life years, QALY) for both functions, you could, for example, use relative QALY for the loss function, and costs of providing care for the cost function.
We’ll break up this problem into three parts:
- In “Defining a custom loss function,” we show how to build a custom loss function that weights different errors unequally. The relative costs of the prediction errors are provided as hyperparameters at runtime, allowing the effects of different weightings on the model to be explored and compared. We build and demonstrate the use of a custom cost function to evaluate our “vanilla” model, which is trained to assume that all errors are equal.
- In “Training the model,” we demonstrate how to train a model by using the custom loss function. We emulate and extend a sample notebook, which uses the UCI breast cancer diagnostic data set.
- In “Analyzing the results,” we show how we can compare the models to better understand the distribution of predictions as compared to our threshold. We’ll see that by training the model to avoid certain kinds of errors, we’ll affect the distributions so that the model differentiates more effectively between its positive and negative predictions.
We are building our own model and not using one of the Amazon SageMaker built-in algorithms. This means that we can make use of the Amazon SageMaker ability to train any custom model as long as it’s packaged in a Docker container with the image of that container available in Amazon Elastic Container Registry (Amazon ECR). For details on how to train a custom model on Amazon SageMaker, see this post or the various sample notebooks available.
To set up the environment necessary to run this example in your own AWS account, first follow Steps 0 and 1 in this previously published blog post to set up an Amazon SageMaker instance. Then, as in Step 2, open a terminal to clone our GitHub repo, https://github.com/aws-samples/amazon-sagemaker-custom-loss-function, into your Amazon SageMaker notebook instance.
The repo contains a directory named “container” that has all the components necessary to build and use a Docker image of the algorithm we run in this blog post. You can find more information on the individual components in this Amazon SageMaker sample notebook. For our purposes, there are two files that are most relevant and contain all the information to run our workload.
- This file describes how to build your Docker container image. Here you can define the dependencies of your code (for example, which language you are using, such as Python), what packages your code needs (for example, TensorFlow), and so on. More details can be found here.
- custom_loss/train. This file is executed when Amazon SageMaker runs the container for training. It contains the Python code that defines the binary classifier model, the custom loss function used to train the model, and the Keras training job. We describe this code in more detail later.
The notebook then imports libraries, creates some helper functions, imports the breast cancer data set, standardizes it, and exports training and test sets to Amazon S3 for later use by Amazon SageMaker training.
Defining a custom loss function
We now construct a loss function that weighs false positive errors differently from false negatives one. To do this, we build a binary classifier in Keras to use Keras’ ability to accommodate user-defined loss functions.
To build a loss function in Keras, we define a Python function that takes model predictions and ground-truth as arguments and returns a scalar value. In the custom function, we input the cost associated with a false negative error (fn_cost) and with a false positive error (fp_cost). Note that internally the loss function must use Keras backend functions to perform any calculations.
The following function defines the loss of a single prediction as the difference between the prediction’s ground-truth class and the predicted value weighted by the cost associated with misclassifying an observation from that ground-truth class. The total loss is the unweighted average of all of these losses. This is a relatively simple loss function, but building upon this foundation, more complex, situation-specific benefit and cost structures can be constructed and used to train models.
Training the model
Since we are using Amazon SageMaker to train a custom model, all of the code related to building and training the model is located in a Docker container image stored in Amazon ECR. The code shown here is an example of the code contained in the Docker container image.
The files containing the actual model code (and custom loss function, mirroring the copy shown earlier) as well as all the files necessary to create the Docker container image and push it to Amazon ECR are located in the repository associated with this blog post.
We construct and train three models so we can compare the predictions of various models using Keras’ built-in loss function as well as our custom loss function. We use a binary classification model that predicts the probability that a tumor is malignant.
The three models are:
- A binary classification model that uses Keras’ built-in binary cross-entropy loss with equal weights for false negative and false positive errors.
- A binary classification model that uses the custom loss function defined previously with false negatives weighted 5 times as heavily as false positives.
- A binary classification model that uses the custom loss function defined previously with false negatives weighted 200 times as heavily as false positives.
The costs used in the last model’s loss function are based upon the medical literature.3,4 The costs of screening are measured in QALYs. One QALY is defined as one year of life in full health (1 year x 1.0 health). For example, if an individual is at half health, that is, 0.5 of full health, then one year of life for that individual is equal to 0.5 QALYs (1 year x 0.5 health). Two years of life for that individual is worth 1 QALY (2 years x 0.5 health).
Here, a true negative outcome is measured as the baseline of costs, that is, all other costs in the table are measured relative to a patient without breast cancer that tests negative for breast cancer. A woman with breast cancer that tests negative loses 2.52 QALYs relative to the baseline, and a woman without breast cancer that tests positive loses 0.0128767 QALYs (or about 4.7 days) relative to the baseline. A QALY has an estimated economic value of $100,000 USD. So these values can also be translated into dollar costs by multiplying the cost in QALYs by $100,000 USD. Given these values, a false negative error is about 200 times more costly than a false positive one. See the medical literature referenced in the introduction for more detail surrounding these costs.
The middle model value of 5 was chosen for demonstration purposes.
With these costs in hand, we can now estimate the model. Estimating the parameters of a model in Keras is a three-step process:
- Defining the model.
- Compiling the model.
- Training the model.
Defining the model architecture
First, we define the structure of the model. In this case, the model consists of a single node in a single layer. That is, for each model that follows, we add a single Dense layer with a single unit that takes a linear combination of features and passes that linear combination to a sigmoid function that outputs a value between 0 and 1. Again, the actual executable version of the code is in the Docker container, but is shown here for illustrative purposes. We’ll provide the relative weights in a later step.
Next, let’s compile the models. Compiling a model refers to configuring the learning process. We need to specify the optimization algorithm and the loss function that we will use to train the model.
This is the step in which we incorporate our custom loss function and relative weights into the model training process.
Training the model
Now we’re ready to train the models. To do this, we call the fit method and provide the training data, number of epochs, and batch size. Whether you use a built-in or a custom loss function, the code is the same in this step.
Building the Docker image
We then execute a shell script (
build_and_push.sh) that builds the Docker image that contains the custom loss function and model code and pushing image to Amazon Elastic Container Registry (ECR). The “image_name” defined at the top of this notebook is the name that will be assigned to the repository in ECR that contains this image.
As mentioned previously, we perform the actual training of the binary classifier by packaging the model definition and training code in a Docker container and using the Amazon SageMaker bring-your-own-model training functionality to estimate the model’s parameters.
The following code blocks train three versions of the classifier:
- One with Keras’ built-in
- One with a custom loss function that weighs false negatives 5 times more heavily than false positives.
- One with a custom loss function that weighs false negatives 200 times more heavily than false positives.
We first create and execute an Amazon SageMaker training job for built-in loss function, that is, Keras’s binary cross-entropy loss function. By passing the
loss_function_type set to
builtin, Amazon SageMaker knows to use Keras’s binary cross-entropy loss function. Then, we create and execute an Amazon SageMaker training job for the custom 5:1 loss function, that is, custom loss with false negatives being 5 times more costly than false positives. By passing the
loss_function_type set to custom and
fn_cost set to 5 and
fp_cost set to 1, respectively, Amazon SageMaker knows to use the custom loss function with the specified misclassification costs.
Finally, we create and execute the
Medical model the same way, but with
fn_cost set to
fp_cost set to
After training the model, Amazon SageMaker uploads the trained model artifact to the Amazon S3 bucket we specifed in the output_path parameter in the training jobs. We now load the model artifacts from S3 and make predictions with all three models variants, then compare results.
Analyzing the results
What characteristics are we generally looking for in a well-performing model?
- There should be a small number of false negatives, that is, a small number of malignant tumors classified as benign.
- Predictions should cluster closely around ground truth values, that is, predictions should cluster closely around 0 and 1.
Keep in mind as you rerun this notebook that the data set used is small (569 instances), and therefore the test set is even smaller (143 instances). Because of this, the exact distribution of predictions and prediction errors of the model may vary from run to run due to sampling error. Despite this, the following general results hold across model runs.
Accuracy and the ROC Curve
First, we’ll show traditional measures of the model.
By these traditional measures, the two custom loss function models do not perform as well as the built-in loss function (by a small margin).
However: accuracy is less relevant in judging the quality of these models. In fact, accuracy may be lowest in the “best” model because we are willing to have more false positives as long as we decrease the number of false negatives sufficiently.
Looking at the ROC curve and AUC score for these three models, all models appear very similar according to these measures. However, neither of these metrics show us how the predicted scores are distributed over the [0, 1] interval so we are not able to determine where those predictions are clustered.
Keep in mind that the cost of a false negative is increasing as we move through these three models. That implies that the number of false negatives is likely to decrease in each successive model.
What does this imply for the values in these classification reports? It implies that the negative class (benign) should have higher precision and that the positive class (malignant) should have higher recall. (Remember that precision = tp / (tp + fp); recall = tp / (tp + fn).)
Remember that for our classification problem we are classifying tumors as benign or malignant. According to the costs reported in the medical literature cited previously, a false negative is much more costly than a false positive. Because of that, we want to classify all malignant tumors as such and are not bothered by that resulting in more false positive predictions (to a point). Therefore, for the negative class (benign), we care more about having a high precision, and for the positive class (malignant), we care more about having a high recall.
Built-in Loss Function
5:1 Custom Loss Function
Medical Custom Loss Function
These classification reports show that we’ve achieved our goal: the medical model has the highest precision for benign, and the highest recall for malignant.
What this implies is that when using the medical model, we are least likely to falsely classify a malignant tumor as benign, and we are most likely to identify all malignant tumors as malignant.
Looking at the detail of these reports allows us to see that the medical model is the “best” of these three models, despite having the lowest F1-score, and the lowest average precision and recall.
To better understand the errors, a better tool is the confusion matrix.
Since our goal is to reduce the number of false negatives, the model with the fewest false negatives is “best,” provided that the increase in false positives is not excessive.
As we move through these three confusion matrices, the cost of a false negative relative to a false positive increases. As such, we expect the number of false negatives to decrease and the number of false positives to increase. However, the number of true positives and true negatives might also shift because we’re training the model to weight differently than before.
Built-in loss function
5:1 Custom Loss Function
Medical Custom Loss Function
We can see from the results that modifying the loss function values provided when training the model allows us to shift the balance between the categories of error. Using different weightings for the relative cost has a significant impact on the errors, and moves some of the other predictions as well. An interesting direction for future research is to explore the changing features that are identified by the model in support of these prediction differences.
This gives us a powerful lever to influence the model based on the moral, ethical, or economic impacts of the decisions we make about the relative weights of the different errors.
Custom confusion matrix
A specific observation is classified as positive or negative by comparing its score to a threshold. Intuitively, the further away the score is from the threshold chosen, the higher is the assumed probability that the prediction is correct (assuming that the threshold value is well-chosen).
When comparing the model’s prediction and the threshold used for dividing classes, it’s possible that the values are very close to the threshold. In the extreme, the difference in values between a “true” or a “false” could be less than the error between two different readings of an input sensor or measurement; or even, less than the rounding error of a floating point library. In the worst case, the majority of the scores for our observations could be clustered quite close to the threshold. These “close confusions” are not visible in the confusion matrix, or in the previous F1-scores or ROC curves.
Intuitively, it’s desirable to have the majority of the scores further away from the threshold, or, conversely, to identify the threshold based on gaps in the distribution of scores. (In cartography, for example, the Jenks’ natural breaks method is frequently used to address the same problem.) The following graphs give us a tool to explore the relationship of the scores to the threshold.
Each of the following sets of distribution plots shows the actual scores for each sample in the confusion matrix. In each set, the top histogram plots the distribution of predicted scores for all actual negatives, that is, predicted scores for benign tumors (the top row of the confusion matrix). The bottom histogram plots predicted scores for actual positives (the bottom row).
The correctly classified observations on each plot are colored blue, and the incorrectly classified observations are colored orange. The threshold value of 0.5, used in other functions in this notebook, is used for coloring the plots. However, this threshold choice does NOT affect the actual scores or shape or level of the plots, only the coloring. Another threshold could be chosen, and the results in this section would still hold.
In the charts below, a “good” distribution is one in which the predictions are largely grouped around the 0 and 1 points. More specifically, a “good” set of histograms would have the actual positives largely clustered around 1 with few false negatives, that is, few orange points. We would like to see the actual negatives clustered around 0, but for this use case we are willing to accept a prediction spread over the support with false positives as long as this gets us a small number of false negatives with predictions clustered around 1 for the actual positives.
We can see from these plots that as we increase the ratio, the distributions shift. As the ratio between the error types increases, the model pushes a larger number of samples to the extremes, essentially becoming much more discriminatory.
We can also see that few of our actual positives have scores close to the cutoff. The model is demonstrating increased “certainty” in its classifications of the positives.
We now calculate the expected value (economic value) of each of the three classification models. The expected value captures the probability-weighted loss expressed in US dollars that an individual patient is expected to suffer if given a specific diagnostic test. The diagnostic test with the highest expected value is considered the “best” under this metric. The expected value is stated in US dollars.
For an explanation of QALY and the dollar values associated with testing outcomes defined in the following cell, see the discussion of screening costs earlier in this blog post.
Note that this section reflects the value of all four possible test outcomes—true and false negatives, as well as true and false positives.
The binary classifier trained with the custom loss function based upon the costs reported in the medical literature is the least costly of the three.
Note that while we used QALY values to train the model and also to evaluate the economic cost, it is not necessary to use the same values. This provides a second powerful lever to use in influencing, understanding and evaluating the model.
Now that we have demonstrated how to train the classifier with a custom loss function and inspected the results, feel free to experiment with different relative values for FN and FP, and with different costs, and explore the impact.
In the example worked in this blog post, we’ve shown how to use a custom loss function to modify the balance of FN and FP errors. We’ve shown that we can impact that balance separately from the costs of different kinds of treatment plans applied to each set of predictions.
There are several ways in which this work can be extended. Promising avenues include:
- Using the relative loss and the economic costs as hyperparameters and exploring the hyperparameter space to find optimal trade-offs.
- Exploring different or more complex cost functions, including making the costs dependent on specific features within an observation.
- Further exploration and understanding of how the model changes with different relative costs, and the features most relevant to those changes.
We’ve shown the approach applied to a binary classification problem, however, it is generalizable to multiclass classification problems. Designing the input cost matrix that accurately (or adequately) reflects the costs of different kinds of errors or misclassifications is more challenging. For example, identifying a stop sign as a 45 mph sign will probably not have the same cost or consequences as the reverse. However, a model trained with this understanding could provide a better overall economic value than one trained to simply maximize precision, recall, F1-score, or AUC.
In this blog post, we’ve shown the power of using a custom loss function to represent the true impacts of different kinds of errors. The custom loss function allows us to choose the relative balance of the types of errors made by the model, and to evaluate the economic impact of changing that balance. Visualizing the resulting score distributions lets us evaluate the discriminatory power of the model. We can also evaluate the costs and tradeoffs of different approaches for the different predictions and their errors. This combination gives the business a powerful new tool to link machine learning to business results, providing greater transparency to the trade-offs being made.
- Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Elkan, Charles. “The Foundations of Cost-Sensitive Learning.” In International Joint Conference on Artificial Intelligence, 17:973–978. Lawrence Erlbaum Associates Ltd, 2001.
- Wu, Yirong, Craig K. Abbey, Xianqiao Chen, Jie Liu, David C. Page, Oguzhan Alagoz, Peggy Peissig, Adedayo A. Onitilo, and Elizabeth S. Burnside. “Developing a Utility Decision Framework to Evaluate Predictive Models in Breast Cancer Risk Estimation.” Journal of Medical Imaging 2, no. 4 (October 2015). https://doi.org/10.1117/1.JMI.2.4.041005.
- Abbey, Craig K., Yirong Wu, Elizabeth S. Burnside, Adam Wunderlich, Frank W. Samuelson, and John M. Boone. “A Utility/Cost Analysis of Breast Cancer Risk Prediction Algorithms.” Proceedings of SPIE–the International Society for Optical Engineering 9787 (February 27, 2016).
- Eykholt, Kevin, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. “Robust Physical-World Attacks on Deep Learning Models.” ArXiv:1707.08945 [Cs], July 27, 2017. http://arxiv.org/abs/1707.08945.
About the Authors
Veronika Megler, PhD, is a senior consultant for AWS Professional Services. She enjoys adapting innovative big data, AI and ML technologies to help customers solve new problems, and to solve old problems more efficiently and effectively.
Scott Gregoire is a Data Scientist with AWS Professional Services. He holds a PhD in Economics from the University of Texas at Austin and has advised clients in sectors ranging from international finance to retail. Currently, he is working with customers to develop innovative machine learning solutions on AWS.