Explainable AI using expressive Boolean formulas

This post was contributed by Gili Rosenberg, Kyle Brubaker, Martin Schuetz, Grant Salton, Jason Zhu, and Helmut Katzgraber from the Amazon Quantum Solutions Lab, and Elton Zhu, Serdar Kadıoğlu, Sima Borujeni from the Fidelity Center for Applied Technology (FCAT) and the AI Centre of Excellence at Fidelity Investments.

Customers routinely develop large, sophisticated machine learning (ML) models to accomplish a variety of tasks. These models tend to be extremely complex by design (sometimes with billions of parameters), and hence their inner workings are difficult to understand and interpret. However, in many customer applications explainability is desired or even required due to industry regulations, especially in high-stakes situations such as in finance or healthcare. For example, models for the approval of credit requests, or a model that diagnoses a condition or disease need to be interpretable, as important personal decisions are made based on these models.

In this post, we discuss recent, joint work by scientists from the Amazon Quantum Solutions Lab (QSL) and the Fidelity Center for Applied Technology (FCAT) to train interpretable ML models based on expressive Boolean formulas. We start by introducing expressive Boolean formulas, continue by defining the problem that we solved, describe the local native solver we developed for this problem (with and without non-local moves), and then explain how a quantum computer could be used to accelerate the training by fast proposal of non-local moves. We conclude by stating the main takeaways from our work and this blog post.

Explainable AI

Explainable AI (XAI) is a branch of ML that aims to explain or interpret the decisions of ML models. Broadly speaking, there are two prevalent approaches to XAI:

Post hoc explanation of black-box models (Explainable ML)—Methods that attempt to explain the decisions of a model after they are made. These approaches (such as LIME and SHAP) are typically model agnostic and can be applied to arbitrarily complex models, such as the ones commonly used in deep learning. Some issues that can occur with these methods are a lack of robustness, such as being easily fooled by adversarial attacks, and ambiguity, where multiple “explanations” may exist.
Training interpretable models (Interpretable ML)—Models that are in some sense interpretable – it is possible to check and understand the inner workings of the model. Common examples include decision trees and linear regression. Some ML models have a reputation of being interpretable, such as decision trees, but it’s easy to find simple examples for which a huge decision tree is required (see Figure 1).

Our work falls under the latter approach — we introduce an interpretable ML model. Note that interpretable ML models can generally be used as standalone interpretable models, or they can be used to explain black-box models.

Figure 1: The rule y=AtLeast3(f0,f1,f2,f3,f4) can be grasped easily, but the equivalent decision tree representation of this rule is complex.

Expressive Boolean Formulas

The advantages of interpretable ML models for high-stakes decision-making motivated FCAT to explore this area further. Quantum computers can be used to solve combinatorial optimization problems and may one day outperform classical computers on such problems. Given the QSL’s interest and expertise in applying quantum computers, it was natural to look for interpretable ML models that might be sped up by having access to a fast optimizer.

We settled on what we call “expressive Boolean formulas” (or just “formulas”). These formulas consist of literals and operators. Literals are variables f_i or negated variables ∼f_i, where f_i is the i-th feature in the input data. Operators are operations that are performed on two or more literals, such as And(f₀,f₁,∼f₂) (see Figure 2 for an illustrative example). Some operators are parameterized; for example, AtLeast2(f₀,f₁,f₂) returns true only if at least two of the literals are true.

Figure 2: A simple expressive Boolean formula. This formula contains six literals and two operators. It can also be stated as And(Choose2(a,b,c,d),∼e,f).

A formula defines a rule that can be used as the basis for a binary classifier. Given a set of inputs, the rule is evaluated such that we get a binary result: the prediction of the ML model for that set of inputs. We hypothesized that expressive Boolean formulas are more expressive than many other interpretable ML models, due to their flexible structure and relatively wide vocabulary (number of operators). See Figure 3 for a comparison of the required complexity (for formulas – the number of operators and literals) for fully representing various rules. Of the methods compared, we see that ours (denoted by “Rule”) requires a significantly lower complexity to fully represent the given rules, with various literals under the operator.

Figure 3: A comparison of the complexity required to represent rules of the form AtLeast3, AtMost3, Choose3 while varying the number of literals included under the operator. “CNF” is a conjunctive normal formula (And of Or), “DT” is a decision tree, as implemented in scikit-learn, and “Rule” is an expressive Boolean formula, as defined in this work.

The Problem

The problem that we seek to solve is the following: given a set of binarized input data and binarized labels, how do we determine the rule that best fits the data (inputs) and labels (outputs)? This should be thought of as training the ML model – an interpretable classifier. In this case, the training is done by solving a combinatorial optimization problem. Our main objective is thus to come up with a solver that will do this efficiently (see Figure 4).

Figure 4: Given binarized data and binarized labels, the solver trains the interpretable classifier by solving a combinatorial optimization problem.

Native Local Solver

Below we describe how our native local solver works. Here, “native” refers to optimization in the natural search space for the problem. A natural search space for this problem is the space of feasible expressive Boolean formulas. This is helpful since it makes sure that the solver does not spend precious time on sifting through infeasible solutions. This approach is in contrast to reformulating the problem in a fixed-format, such as MaxSAT, ILP (integer linear programming), or QUBO (quadratic unconstrained binary optimization), which would be difficult (if not impossible), and often requires searching a much larger space containing many infeasible solutions. “Local” refers to exploration of the search space via stochastic search, i.e., by performing a series of moves that make relatively small changes to the current configuration.

Local moves in this context can be, for example, removing a literal (see Figure 5), adding a literal, and so on. The native local solver runs a series of local moves, until a good enough (ideally near-optimal) solution is found.

Figure 5: An example local move — remove literal d.

Adding Non-local Moves

Imagine that you have access to a hardware accelerator, classical or quantum, that can solve combinatorial optimization problems extremely fast. How could you use it to potentially speed up the solver? One option is to propose larger, “non-local” moves that are computationally expensive, but can be determined (proposed) by solving the same type of optimization problem that the accelerator solves. If we can perform such non-local moves faster than getting the same improvement via local moves, then we may expect to see an advantage.

But what are non-local moves, as they apply to expressive Boolean formulas? The idea is to choose an operator from the existing formula, and then to optimize the subtree underneath it (see Figure 6).

Figure 6: An example non-local move — swap node with a subtree of depth one. This move replaces an operator (or a literal, not shown) with a chosen operator (in this case, Or).

It turns out that we can formulate the search for non-local moves as an ILP/QUBO, which can be solved by quantum computers (and which may outperform classical computers on this problem, one day). As an example, we’ll briefly outline how to construct an ILP formulation for finding the best Or rule. We start by observing that the rule Or(f₀,f₁) can be equivalently expressed as:

We can then define an optimization problem to find the smallest subset of features to include in the Or rule to achieve perfect accuracy:

where b is a vector of indicator variables, indicating whether each feature should be included in the rule (i.e., b_i=1 if feature f_i is included and b_i=0 otherwise), X_P is a matrix containing only the rows labeled as “positive” (y=1), X_N is a matrix containing only the rows labeled as “negative” (y_i=0), and 0 and 1 are vectors containing only zeros and ones, respectively.

In our paper, we extend this formulation in various ways, which we outline below:

Negated features – Adding the possibility to include negated features in the rule by adding an additional indicator vector ~b for negated features ~f_i.
Imperfect classification – In practice, we typically do not expect to be able to achieve perfect accuracy. With this in mind, we introduce a vector of “error” indicator variables η indicating whether each data row is misclassified. We change the constraints so that when the error variable corresponding to a particular sample is 1, the corresponding constraint is always true by construction, effectively deactivating that constraint. Accordingly, we also change the objective function so that it minimizes the number of errors, given by ∣∣η∣∣₀. To deal with unbalanced datasets, we allow the positive and negative error terms to be weighted differently, by introducing the weights w_P and w_N (respectively).
Controlling the complexity – To control the complexity of the rule, we add a regularization term, as well as an explicit constraint on the number of literals.

For more details on the formulation, formulations for other types of rules, and a recipe for converting ILPs to QUBOs, please see our paper. As an example, here’s the complete ILP formulation to find the best Or rule:

Results

In our paper, we benchmarked native local optimization with and without non-local moves on several well-known datasets. For example, rules found for each dataset, see Table 1. Since these datasets are imbalanced, we chose balanced accuracy as the performance metric. Balanced accuracy is defined as the average of the true positive rate (sensitivity) and the true negative rate (specificity).

For a comparison of the balanced accuracy achieved with and without non-local moves, see Figure 7. We see that the inclusion of non-local moves provides a benefit at the higher complexity value of 30, but not at the lower value of 3. The complexity here is an upper bound on the total number of literals and operators. Note that increasing the complexity generally yields diminishing returns.

Table 1: Example rules obtained by the native local solver for each dataset. The variable names in the rules were obtained from the original datasets.

Figure 7: Balanced accuracy vs. number of iterations, with non-local moves (“Non-local”) and without (“Local”), for the Credit Risk dataset.

BoolXAI – Code Package

We will be open-sourcing part of our code. BoolXAI is a package that finds expressive Boolean formulas via native local optimization. Here’s a quick-start example that trains a binary classifier:

import numpy as np
from sklearn.metrics import balanced_accuracy_score

from boolxai import BoolXAI

# Create some random toy data for binary classification
# Note: X and y must both be binarized
rng = np.random.default_rng(seed=42)
X = rng.choice([0, 1], size=(100, 10))
y = rng.choice([0, 1], size=100)

# Rule classifier
rule_classifier = BoolXAI.RuleClassifier(random_state=42)

# Learn the best rule
rule_classifier.fit(X, y)

The rule can then be printed:

best_rule = rule_classifier.best_rule_
best_score = rule_classifier.best_score_
print(best_rule)

which gives the output:

'And(~[926.96<=worst area<1269.0], ~[worst radius>=23.682], ~[worst compactness>=0.4478], ~[1269.0<=worst area<1673.0], ~[781.18<=worst area<926.96])'

or plotted:

rule.plot(feature_names)

which displays:

Figure 8 – Our BoolXAI plot generated by the code in this post.

Conclusion

We started this blog post by introducing the concept of expressive Boolean formulas and explained how they can be used to form an interpretable ML model for binary classification. We continued by explaining how one can train such a model using native local optimization, and how a quantum computer could be used to accelerate the training by fast proposal of non-local moves. Finally, we’ll be introducing an open-source package that can be used to do all this (but not the non-local moves, currently). For further technical details on this work, see our recently published paper.

If you work with complex models in regulated industries and/or involving high-stakes decisions, and would like to understand more about how explainable AI with expressive Boolean formulas could apply to your business, reach out to the Amazon QSL to start a conversation.

AWS Quantum Technologies Blog