Assessing the Stability of Interpretable Models
Abstract
Interpretable classification models are built with the purpose of providing a comprehensible description of the decision logic to an external oversight agent. When considered in isolation, a decision tree, a set of classification rules, or a linear model, are widely recognized as humaninterpretable. However, such models are generated as part of a larger KDD process, which, in particular, comprises data collection and filtering. Selection bias in data collection or in data preprocessing may affect the model learned. Although model induction algorithms are designed to learn to generalize, they pursue optimization of predictive accuracy. It remains unclear how interpretability is instead impacted. We conduct an experimental analysis to investigate whether interpretable models are able to cope with data selection bias as far as interpretability is concerned.
1 Introduction
Interpretable machine learning models aim at tradingoff predictive performance with humancomprehensibility and verifiability. They are also used to explain the global logic of inscrutable blackbox machine learning models [guidotti2018survey]. This is achieved by a form of reverse engineering, where interpretable models are trained on a (typically, random) sample of the population. If the interpretable model can accurately reproduce the blackbox decisions, it can be used as a surrogate model of the blackbox. The KDD process of learning an interpretable model includes a number of design choices:

on the set of features to use (feature selection). A blackbox uses a set of features which may be not completely known, hence reverse engineering it must consider which features to use for the surrogate model;

on the subset of data to use (instance selection). Instance generation in blackbox explanation can be purely random [ribeiro2016should], or adopt refined approaches, e.g., genetic algorithms [guidotti2018local].

on the machine learning model to use (model selection), on the specific learning algorithm, and on its parameters. An experimental phase is typically part of the design, with the purpose of selecting the most accurate model.
Such a process must be accountable, namely the interpretable (surrogate) model must be able to provide ‘‘a satisfactory answer [about blackbox decisions] to an external oversight agent’’^{1}^{1}1IEEE Glossary of Ethics of Autonomous and Intelligent Systems: ethicsinaction.ieee.org.. However, since the above design choices include a number of elements subject to randomness, it may end up with unstable results, i.e., variations in training data and/or design choices may lead to different interpretable models and decision explanations. Stability of interpretable models is then a key property towards accountability of machine learning (blackbox) decision making.
We present an experimental study of the stability of interpretable classification models with respect to the three design choices above. We will consider decision trees, rulebased classifiers, and linear models, which are widely agreed to provide explanations of their decisions that are easily interpretable by humans [freitas2014comprehensible, DBLP:journals/dss/HuysmansDMVB11]. We conclude that, in order to pursue accountability, interpretable model’s learning processes should comprise a stability impact assessment which is currently missing in guidelines and bestpractices.
2 Related Work
Stability is a property of the output of a learning process. The representation of the output can be an intensional (a classifier) or extensional (its predictions). Extensional stability of classifier predictions was modelled by [DBLP:journals/ml/Turney95] through a measure of agreement among predictions. He proposed a fold crossvalidation approach. At each of the steps, two classifiers are built on the two folds, and tested on artificially generated instances from a population distribution. The agreement measure is the percentage of instances whose predictions of the two classifiers coincide. The average agreement over the runs is the final estimate of stability of the learning process. Agreement is a semantic measure, and it has the advantage of being classifieragnostic. Related to measurement of extensional stability is the biasvariance decomposition of the error of classifiers [DBLP:books/lib/HastieTF09]. Bias is reduced and variance is increased with increasing model complexity at the risk of overfitting. This would suggest that less interpretable models are also more unstable and overfitted. On the theoretical side, [DBLP:journals/jmlr/BousquetE02] proved that generalization error can be bound by (expectation of) stability.
Measures of interpretability of classifiers must, however, be necessarily syntactic, since this is the level at which humans interface with models. This paper concentrates then on intensional stability of a learning process. One of the early studies regards the impact of training set size on the accuracy of decision trees [OatesJ97], showing that the best performance can be achieved with sufficiently many data, after which there is no convenience to add more. Coping with variability of classifiers due to random noise in data has been tackled by adopting statistical tests for validating split tests at decision nodes [DBLP:journals/jcst/KatzSRO14], or by adopting split methods that account for almost equal split attributes (sources of instability) [DBLP:conf/kdd/LiB02]. Finally, decision tree simplification is another class of approaches that tradeoff accuracy with simplicity [Breslow:1997:SDT:976289.976290]. Intensional stability of feature selection method considered variability in the set of features selected [DBLP:journals/kais/KalousisPH07, Nogueira016]. Measures of stability include average Jaccard similarity and Pearson’s correlation among all pairs of feature subsets selected from different training sets generated using crossvalidation, jacknife or bootstrap. As pointed out by [DBLP:journals/kais/KalousisPH07], intensional instability of feature selection does not necessarily implies extensional instability of the final classifier, due to redundant features. In summary, an experimental study of the intensional instability of interpretable models at the variation of the learning process design choices is missing in the literature. This is becoming relevant in the context of blackbox explanation, where an early attempt at studying robustness of single explanations is [DBLP:journals/corr/abs180608049].
3 Setting the Stage
Interpretable models. Interpretability is the ability to explain or to provide meaning in terms understandable to a human [guidotti2018survey]. Decision trees, rulebased classifiers, and linear models are acknowledged as being interpretable classification models. Decision trees (DT) consist of a tree graph with internal nodes representing tests on predictive features, and leaf nodes assigning a class label to instances reaching the leaf (see e.g., Figure. 1 (a)). A path from the root to a leaf represents an explanation of the decision at the leaf in terms of a conjunction of test conditions. We consider the two mostly adopted learning algorithms: CART (Classification and Regression Trees) [breiman1984classification] as implemented by the scikitlearn Python library^{2}^{2}2http://scikitlearn.org., and C4.5 [quinlan1993c4] as implemented by the computationally efficient YaDT (Yet another Decision Tree) system^{3}^{3}3http://pages.di.unipi.it/ruggieri/software. [ruggieri2004yadt]. C4.5 performs multiway univariate splits and it includes tree simplification (errorbased pruning). We do not consider instead the split condition of [DBLP:conf/kdd/LiB02], designed for stability, since it produces disjunctive test conditions, thus leading to a higher expressivity language.
RuleBased (RB) classifiers consist of a set of classification rules, typically in the form of ifthen rules stating the class label for a given conjunctive condition on the predictive feature values (see Figure. 1 (b)). In this work, we consider the FOIL (First Order Inductive Learner) [quinlan1993foil] and CPAR (Classification based on Predictive Association Rules) [yin2003cpar] algorithms, as implemented by the LUCSKDD library^{4}^{4}4https://cgi.csc.liv.ac.uk/frans/KDD/Software.. The former generates a very small number of rules, but has lower accuracy than the latter. Similarly to DTs, and for space reasons, we restrict to sets of conjunctive classification rules. Another natural choice would have been RIPPER [DBLP:conf/icml/Cohen95], which unfortunately produces ordered sequences of conjunctive rules. I.e., we compare DT and RB classifiers with the same expressivity.
Linear Models (LM) classifiers consist of the sign and the magnitude of the contribution of feature values (or ranges) to a class label (encoded as an integer) as stated by coefficients in a linear formula (see Figure. 1 (c)). If the contribution is positive (resp., negative), the value of the feature increases (resp., decreases) the probability of the model’s decision. We focus on three algorithms for linear models: Linear Regression (LINREG) [yan2009linear], and its regularized forms RIDGE [tikhonov1963solution] and LASSO, [tibshirani1996regression] as implemented by the scikitlearn library. They are commonly used in blackbox explanation approaches [kononenko2010efficient, ribeiro2016should].
Measuring interpretability and stability. Several syntactic measures of interpretability were considered in the literature. Structural measures (SM) look at models in isolation, and quantify the degree of syntactic (intensional) interpretability of a model resorting to model complexity. Stability is quantified through the deviation of the measure distribution over models learned from different samples of the population. Comparative measures (CM) look at pairs of models, and quantify the syntactic similarity between the two models. Stability is quantified by the mean value over all pairs of models learned from different samples of the population. Measures common to decision trees, rulebased classifiers and linear models include:

number of features (SM) used^{5}^{5}5While YaDT and LUCSKDD work directly on discrete features, algorithms of the scikitlearn require binarization of such features. Nevertheless, we count the number of original features.: for DT the features used in at least one split node, for RB those used in at least one rule, for LM the features with nonzero coefficient.

Jaccard coefficient (CM): the ratio of the number of shared features of two models over the total number of features used by at least one such models.

sample Pearson’s (CM) correlation coefficient [Nogueira016]: the Pearson’s coefficient over the 0/1 vector of features used by two models.
Measures specific of a model type include:

for decision trees: number of nodes (SM).

for rulebased classifiers: number of rules (SM) and size of rules (SM), namely the total number of conjuncts in the ifpart of rules.

for linear models: Kendall’s (CM) rank correlation of coefficients.
In summary, for structural measures, one aims at low mean values (interpretability) and low deviation (stability). For comparative measures, one aims at high mean values (stability) and low deviation (extreme outlier models).
Finally, in order to investigate the relationship between model stability, prediction accuracy, and overfitting, we will also compute the F1scores of models on the training set () and on the test set (), and their relative difference (), which represents a measure of overfitting.
Feature and instance selection. Feature selection (FS) [Guyon2006] and instance selection (IS) [DBLP:journals/air/OlveraLopezCTK10] are beneficial in removing noise and redundancies, in reducing the data collection effort, in balancing the data distribution, in speeding up model learning. They are supposed to enhance model interpretability by reducing the number of features and by preventing overfitting. Both techniques are widely used in reverse engineering of blackbox models. We consider the following standard methods, as provided by the scikitlearn^{6}^{6}6http://scikitlearn.org/stable/modules/feature_selection. library. For feature selection:

RFE (Recursive Feature Elimination): given an external estimator that assigns weights to features (a decision tree by default), it greedily removes the least important feature until a given number of features is left (we consider half of the total number of features);

SKB (Select K Best) removes all but the top scoring features according to the ANOVA Fvalue function of the features (default: );

SP (Select Percentile) removes all but a userspecified top scoring percentage of features with respect to the ANOVA Fvalue (default: ).
For instance selection, we consider the following methods, as provided by the imbalancedlearn^{7}^{7}7http://contrib.scikitlearn.org/imbalancedlearn. library:

RUS (Random Under Sampling) undersamples the majority class by randomly picking instances of the other classes;

ROS (Random Over Sampling) oversamples the minority class by replicating instances of that class at random with replacement;

SMOTE (Synthetic Minority Oversampling Technique) oversamples minority class by generating instances along the linear segment between an instance of the minority class and one of its nearest neighbors (default: ).
We restrict here to class balancing and random sampling methods, because they are widely adopted in blackbox explanation approaches [craven1994using, guidotti2018local].
algocf[htbp] \[email protected]
4 Evaluation Framework
Interpretable models are the end products of an articulated KDD process. We will evaluate the impact of process design on their intensional stability. To this end, we consider the following steps, which motivate the procedure of Algorithm LABEL:alg:eval.
dataset  adult  anneal  census  clean1  clean2  coil  cover  credit  sonar  soybean 

instances  48,842  898  299,285  476  6,598  9,822  581,012  1,000  208  683 
features  14  38  40  166  166  85  54  20  60  35 
class values  2  6  2  2  2  2  7  2  2  19 
First, any observational research project must account for variability/bias in data collection [DBLP:conf/ijcai/DanksL17]. Following standard methodology for estimating accuracy of classifiers [kohavi1995study, kim2009estimating], we adopt a 5repetition of 10fold stratified crossvalidation as a methodology to account for variations in the data. At each iteration, all the available data is split in 10 folds. For each fold, the process described next is applied on 9 folds used as training data, and one fold as test (denoted by the hat ). This is formalized in the two outer loops at lines 2–16 of Algorithm LABEL:alg:eval.
Second, the impact of preprocessing steps is evaluated by considering no preprocessing, feature selection, instance selection, and possibly combinations of them. Let be the a set of preprocessing methods, including no modification at all. The inner loop at lines 6–16 of Algorithm LABEL:alg:eval iterates over for the current fold at iteration . A preprocessing is applied to the training data, and then the model is learned from the processed data. In Algorithm LABEL:alg:eval, models are stored in the set . Moreover, lines 13–16 keep track of the predictive performance and of the degree of overfitting on the test data (the fold).
Third, measures of interpretability, performance, and overfitting of the learned models must be aggregated over the 50 models (5 repetitions, 10 models each) of each preprocessing method. Performance, overfitting, and structural measures (SM) are aggregated using the mean value (lines 18–21). Comparative measures (CM) are aggregated by taking the allpairs average (lines 22–23). Both loops are inside the loop at lines 17–23 that iterates over the set of preprocessing algorithms.
The results of the above framework are intended to support a number of accountability questions that the data analyst should answer before deploying a classification model, namely, how sensitive is the interpretability of a classification model to changes: in data collection? in feature selection? in instance selection? in model selection?
5 Experiments
We run experiments on a selection of ten small and medium sized datasets widely referenced for classification tasks and publicly available from the UCI ML repository. Table 1 shows summary statistics on the datasets: instances are in the range 208–581K, features in 14–166, and number of classes in 2–19. The framework of Algorithm LABEL:alg:eval has been implemented in Python^{8}^{8}8Source code and datasets available at url hidden for blind review. by integrating external libraries (YaDT and LUCSKDD) through wrappers of inputs/outputs. The software has been designed to be extensible to additional models, preprocessing methods, and intepretability measures. Unless specified otherwise, parameters of algorithms are the defaults in their original systems^{9}^{9}9C4.5: split = Gain Ratio, stop criterion = m 2, pruning = ebp (errorbased); CART: split = Gini, min_samples_split = 2, min_samples_leaf = 1, max_depth = None; CPAR: delta = 0.05, alpha = 0.3, gain_similarity_ratio = 0.99, min_gain_thr = 0.7; FOIL: min_gain_thr = 0.7; LASSO: alpha = 1.0; RIDGE: alpha = 1.0. .
Common measures. Let us start focusing on the number of features used by a classification model. Figure 2 considers the census dataset. Left plots report on DT models (CART and C4.5), middle plots on RB models (CPAR and FOIL), and right plots on LM models (LASSO and RIDGE^{10}^{10}10We omit LINREG for space reasons as it behaves as RIDGE.). Each plot shows the boxplots for no preprocessing (“”), for 3 Feature Selection (FS) methods (SKB, SP, and RFE), and for 3 Instance Selection (IS) methods (ROS, SMT, and RUS). Feature selection methods reduce the total number of features used by the classification model, as one would expect, thus improving the interpretability measure. Moreover, since redundant/noisy features are removed as well, this also reduces deviation over the 50 folds, thus improving stability. Instance selection has a similar beneficial effect on deviation, but in some cases (LASSO and C4.5) it increases the number of features. However, for a grossgrained measure such as the number of features, the low variability provides a distorted indication of stability. In fact, two models may still largely differ in the set of features used while the number of such features is the same for both models. Jaccard similarity or Pearson’s correlation among all pairs of feature sets across the 50 folds of training data can better measure variability of the set of features used by a classifier. Figure 3 reports Pearson’s correlation for the census dataset. We omit the Jaccard measure for lack of space and because it yields similar patterns. Linear models are stable, independently from the preprocessing method. In fact, Pearson’s correlation is always very close to . For rulebased models, FS also leads to stable models. Finally, IS increases deviation of Pearson’s correlation for rulebased and decision trees classifiers. This means that extreme outlier models (in terms of feature’s vector) become more frequent.
Statistical comparison of models’ stability. The nonparametric Friedman test compares the average ranks of learning methods over multiple datasets w.r.t. an evaluation measure, in our case Pearson’s correlation. The null hypothesis that all methods are equivalent is rejected (). The comparison of the ranks of all methods against each other can be visually represented as shown in Figure 4 (see [Demsar06] for details). The posthoc Nemenyi test is used to connect methods that are not significantly different among each other. Linear models have the best ranks. For a fixed classifier, models obtained using feature selection preprocessing rank better than methods without. Instance selection methods and decision trees have the lowest ranks, i.e., they are the most unstable with respect to the set of features used by the learned model.
Stabilityinterpretability. We summarize the relation between interpretability and stability through the scatter density plots in Figure 5, where Pearson’s correlation (stability) is plotted against the ratio of the number of used features over the total number of features (interpretability). There are 4 scatter plots. Each point represents an experiment (50 folds). From left to right and top to bottom: experiments for all datasets/classifiers/preprocessing, experiments for all datasets and classifiers but only those with no preprocessing, experiments for all datasets and classifiers but only those with feature selection preprocessing, and experiments for only linear model classifiers. Numbers on top of scatter plots are linear correlation and, in parenthesis, pvalues of such correlation. The top left plot does not highlight correlation between the measures of stability and interpretability, in general. Using nopreprocessing methods increase the correlation (higher stability means lower interpretability). Feature selection does not impact on the correlation. Finally, the right bottom plot shows some positive correlation for linear models at 95% significance level.
Stabilityaccuracy. Figure 6 compares the ranks of the various models w.r.t. the F1 measure averaged over the 50 experimental folds. Ranks are approximately symmetric to the ones of Pearson’s correlation shown in Figure 4. Decision trees and rulebased classifiers are the best performing. Linear models are at the bottom of the ranking. The adoption of instance selection does not improve ranks of classifiers. In summary, for the interpretable models considered here, stability and accuracy are contrasting objectives, which then require a tradeoff analysis.
Stabilityoverfitting. Let us now contrast stability with overfitting. Figure 7 reports scatter plots of stability vs overfitting, defined as the relative difference of F1 accuracy between training and test set averaged over 50 folds. A negative correlation is clearly observed and statistically significant: higher Pearson’s correlation (stability) leads to smaller overfitting (generalizability). This is more apparent in experiments with no preprocessing (right in Fig. 7). This is somehow expected, due to the biasvariance decomposition [DBLP:books/lib/HastieTF09]. In summary, stability and overfitting appear to be contrasting objectives.
Modelspecific measures. When restricting to specific classifiers, finergrained measures of interpretability can be adopted. Let us start considering the number of nodes in decision trees (for the tree depth measure, we obtain similar findings). We study the relation between interpretability and stability by exponentially varying the stopping parameter in tree construction from (default value) to half of the size of the dataset. Such parameter stops node splitting during tree construction if the number of cases at the node is below the threshold . Thus, we can control the maximum size of a decision tree. Figure 8 shows the scatter plot of mean number of nodes vs standard deviation of the number of nodes over the 50 experimental folds. A statistically significant positive correlation is clearly visible, especially when restricting to a dataset in isolation (experiments with the two largest datasets are shown in different colors).
For rulebased classifiers, Figure 9 shows the stabilityinterpretability relation in terms of number of rules (left) and size of rules (right). Each point has coordinates the standard deviation (xaxis) and the mean (yaxis) number/size of rules over the 50 experimental folds. Basically, the two plots are RBspecific versions of the density scatter plots in Figure 5. Contrasting the two figures, there is now a larger statistically significant positive correlation between stability and interpretability. The correlation for the finer grained measure of sizes of rules is smaller than for the gross grained measure of number of rules, which is somehow expected.
Finally, let us consider linear models. Kendall’s measures the rank correlation of two sets of features, where the rank of a feature is calculated w.r.t. the descending absolute value of its coefficient. Figure 10 reports the boxplots of ’s values over the 50 experimental folds for a few datasets and methods. LASSO is generally more stable than RIDGE (high values of ), due to the fact it uses less features. Feature selection increases variability of the measure (extreme outlier models) for RIDGE, but not for LASSO. Viceversa, IS increases variability for LASSO, but not for RIDGE.
Discussion. Experimental results highlight a tension between optimizing predictive accuracy from one side, and intensional stability of interpretable classifiers on the other side. Stability and generalizability appear to be common goals, or, stated otherwise, stability and overfitting appear contrasting objectives. Also, stability and interpretability appear to be slightly positively correlated. Existing approaches for improving generalizability of classifiers, however, cannot be always applied to interpretable models. Aggregation methods (e.g., bagging, boosting, random forests) produce models that are widely agreed difficult to interpret. Thus, we claim that the data analyst should conduct a stability impact assessment together with predictive performance analysis in order to alleviate the tension between the two objectives. Such a stability impact assessment amounts at analysing the empirical distribution of the relevant interpretability measures at the variation of the design choices.
6 Conclusion
Our main contributions consist of a framework for intensional stability impact assessment, and experiments parametric to several preprocessing methods and classification algorithms. The approach is implemented, released as open source, and extensible to new classifiers, methods, and measures. Experimental results show that the studied interpretable models exhibit considerable variability in terms of structural and comparative measures. Interpretability of linear models appears to be more stable than for other models, but at the expenses of lower accuracy. Decision trees, on the other hand, exhibit more variability, but they are more accurate. Stability is clearly negatively correlated to accuracy and to overfitting. However, no other generally valid pattern can be drawn.
Several extensions of the approach are possible. First, for sake of space, we considered only a limited number of interpretable models, preprocessing methods, datasets, and measures. E.g., the comparative measure of tree edit distance [DBLP:conf/sisap/SchwarzPA17] is even more finegrained than decision tree size. Second, with the exception of Figure 8, we did not consider parameters of the learning algorithms and preprocessing methods. This would add a further loop to Algorithm LABEL:alg:eval, where parameters are optimized from a parameter space (uniformly, greedly, etc.). Third, we considered only objective measures of interpretability and stability. A lab experiment can test subjective measures (legibility, understantability) on a pool of actual users.