Machine Learning Explanations with Topological Data Analysis

Topology provides an alternative perspective from traditional tools for understanding shape and structure of an object. With modern advances of the computational aspects of topology, these rich theories of shape can be applied to sparse and high dimensional data, spurring the field of Topological Data Analysis (TDA). Mapper and Persistent homology are the two most popular methods in the field of TDA, but the field is nascent and rich with exciting new ideas.

The goal of this work is to introduce and explore the use of Mapper (Singh, Memoli, and Carlsson 2007). We will demonstrate a novel application of Mapper to the domain of explanatory machine learning and show how Mapper can be used to understand not just the shape of data, but also understand a machine learning model.

Topological Building Blocks

Both Mapper and Persistent Homology make use of simplicial complexes as the tool for understanding complicated shape. A simplicial complex is a generalization of a graph, with a few special features. The most particular feature is that simplicial complexes can contain higher order analogs of vertices and edges, referred to as simplices. Simplices can be the familiar vertices and edges of a graph, or triangles drawn between 3 vertices, tetrahedron between 4 vertices, and higher still.

Most often, simplicial complexes are built from the nerve of a cover. Intuitively named, a cover of a data set is a collection of subsets of the data such that every data point is in at least one of the subsets. Formally, we say a cover {U_i} of a data set X is satisfies the condition that for any x ∈ X, there exists at least on U_i ∈ {U_i} such that x ∈ U_i. In practice, we often have that each point is contained in multiple cover elements. The nerve is a simplicial complex created from a cover by collapsing each cover element into vertices and connecting vertices when the cover elements had points in common. If a point was included in two cover elements U_i and U_j, then the vertices σ_i, σ_j would have an edge drawn between them, denoted σ_ij. We continue this process to higher order intersections to create higher order simplices.

The image below shows a space shaped like a 2-dimensional kidney bean with a cover of 4 elements. The nerve is drawn on the right. Each cover element becomes a node of the same color, pairwise intersections become edges between those nodes, and the three-way intersections become the triangles. Because this cover was particularly nice, the resulting simplicial complex now provides a combinatorial representation of the kidney bean that is topologically equivalent to the original shape. In practice, we have little guarantee that the cover will satisfy the conditions for topological equivalence, but we can use heuristics to generate nice covers that produce close representations.

The transition of topological ideas from theory to applied has been enabled by new methods for building simplicial complexes from data. Persistent Homology and Mapper are manifestations of these advances and represent two different ways to build and study covers. Persistent homology builds many different covers on the same data set and studies properties of the entire set of covers. Mapper constructs a single cover in a supervised manner and provides a representation suitable for exploratory data analysis.

In this work, we will focus on understanding the exploratory power of Mapper and how we can apply it to understand a machine learning model. For an exceptional introduction to persistent homology, we recommend (Ghrist 2008) and for other demonstrations of using Mapper for understanding a data set, see (Lum et al. 2013).

Mapper

It is often a reasonable assumption that complex and high dimensional data exists on a much lower dimensional manifold or stratified space. Mapper can be thought of as a method of estimating and capturing this essential high dimensional structure (Singh, Memoli, and Carlsson 2007). It encodes information about relationships between regions of the data and the simplicial complex produced can be both aesthetically pleasing and useful for exploring such relationships.

The construction of Mapper makes use of two particular ideas to generate its cover, the lens and the pullback cover. The lens is a secondary representation of the data that encodes some interested aspect or information about the data set. This can be as simple as taking a subset of important columns. In this work, we interpret a machine learning model as a form of dimensionality reduction on the data, taking the predicted probabilities and decision functions as the important representation. The lens is usually where the analyst or scientist can embed domain knowledge into the construction, in our case, the representation is learned from labels. This lets the user build Mapper with respect to some information of interest and is why we say Mapper is semi-supervised.

After creating a lens, we generate a cover of this space. This cover is shown in the first image of the sequence below. In practice, the lens is chosen to be a low dimensional space, and because of this, constructing a cover is not too difficult. A common first choice of cover is to use overlapping intervals or hypercubes. We will discuss how do design a cover specifically for the output of a machine learning model a little later.

A cover of the data set is found by computing the pullback cover of our constructed lens cover. The pullback of the lens cover is found by taking each cover element in the lens, finding the associated data points in the original data. Because points that are close together in the lens space are not necessarily close together in the input data space, we must refine the pulled back cover elements by applying a clustering algorithm. If our data space was a continuous space rather than discrete data, the refinement process would consist of splitting the cover element into connected components. The second image in the sequence above shows how one cover element in the lens space becomes three cover elements in the pullback cover. The third image shows the complete pullback cover of all four lens cover elements.

The final step of Mapper is to compute the nerve of our constructed pullback cover. The result is a simplicial complex that represents the data set while capturing information about the lens representation of the data. Encoded in the simplicial complex is relationships between observations and relationships between groups of observations. These relationships are determined by nearness in the lens space and by differences in the data space. The image above shows the nerve constructed in this example.

Data and Model

For our demonstration of applying Mapper to understand a machine learning model, we will build a classifier of leaf types using the UCI Leaf Data Set (Silva, Marcal, and Silva 2013). We chose the leaf data set because it has images included with the structured data, which is useful for illustrative purposes. As the modeling is not the emphasis of this work, we applied only a modest amount of energy into model selection and training and we are satisfied with the accuracy obtained. This is a relatively small data set and so to demonstrate that Mapper is apt for analysis of data orders of magnitude larger in both dimension and size, we will later apply Mapper to understanding a model trained on the Fashion MNIST data set (Xiao, Rasul, and Vollgraf 2017).

Data

The leaf data set has 30 different types of leaves, but for our purposes, we will use the 10 largest classes, excluding Engl. Oak, Maple, Silver Poplar, Geranium, and Hawthorn. These five classes have very different characteristics from the rest and thus excluding them simplifies the demonstration here. The figure below shows examples from each of the 10 chosen classes. The color of the title will be used to differentiate between leaf classes in representations used later in the demonstration.

The data used for our prediction model is a structured representation of these images with 14 expert engineered features such as lobedness, elongation, and smoothness. Many of the images could easily be differentiated by eye, but some classes can be easily confused.

Model

For the classification we train a Logistic Regression model to 91.5% test set accuracy using Scikit-Learn (Pedregosa et al. 2011). Before training we applied normalization using standard scaling and applied dimensionality reduction using PCA. Using grid search and a stratified 3-fold cross validation, we found the best model using an l₁ penalty with parameter c = 0.9.

Mapper for explaining machine learning

Mapper has an often-overlooked limitation requiring the lens to have low dimension so that this space can be feasibly covered with a grid. To sample a 2-D space with a resolution of 10 cover elements per dimension requires a grid with 10² units. However, for a typical machine learning classifier with 10 classes, the same resolution grid would have 10¹⁰ units, which is computationally intractable.

To apply Mapper to understanding machine learning models, we adapted the typical cover approach by designing a new type of cover specifically for machine learning classifiers. To construct the cover, we rank and threshold the predicted probabilities for each observation and generate a cover element for similarly ranked observations. When considering a model with m classes, our approach is parameterized the top p predictions, and threshold ϵ (always taken as 1/m). We create mp cover elements, with each cover element denoted by C_ij, where i is the class and j is the rank of that class in an observation’s predicted probabilities. An observation x belongs to cover element C_ij if the jth highest predicted probability of x is greater than ϵ, and the class corresponding to this predicted probability is i. This allows us to work directly with the spaces generated by the model and avoid using dimensionality reduction on the prediction space.

As an example, suppose the most likely class for leaf ids #12 and #76 are both Spindle, then both leaves are added to C_{(Spindle), 1}, Spindle’s top cover element. If the second most likely class for #12 is Chestnut, then it would be placed in C_{(Chestnut), 2}, Chestnut’s second cover element as well. Suppose the second most likely class for #76 is Cork Oak, but the predicted probability is less than our threshold of 1/m. Then #76 will not be added to another cover element. These two observations would then produce two cover elements, one with two points and one with just one.

The graphs above show two different ways of generating a cover of our lens. The first shows a typical approach of reducing the high dimensional space into 2 dimensions using PCA and then using a hypercube cover. To make the visualization easier to read, the cubes are shown with no overlap. Notice how many of the hypercubes in the traditional cover (on the left) are empty. In high dimensions this issue becomes even more pronounced. On the right, we demonstrate our new covering technique applied to the first observation in the data set. In this case, the leaf would be placed into 3 cover elements, C_(Hazel), 1, C_{(Cork Oak), 2}, and C_{(Spindle), 3}.

Instance Based Explanation

Instance based explanation is a particular type of explainable machine learning that focuses on explaining predictions through other observations. We can treat the machine learning model as a black box and explore the relationship between the input space and the prediction space using Mapper, requiring only the original data and the predicted probabilities for each observation. Because of this, the technique is generalizable to most machine learning techniques and is suitable for use by non-experts. We envision a system like what will be described would be suitable for end users to understand model predictions without needing a deep understanding of the specific modeling technique.

After looking at the global structure of Mapper, we’ll explore one particular leaf and try to use Mapper to understand what the model thinks of the leaf. We’ll introduce the idea of an escape route, a sequence of leaves that demonstrate how a leaf of interest is similar or different from the other nodes around it. After demonstrating out the routes can be used for explaining leaves, we apply the method to understanding a model trained on the Fashion MNIST data set.

A Global View

The graph below shows the Mapper constructed using the predicted probabilities from the trained model and the particular cover described above. In this graph, the color of each node represents the majority class in that node. Because of the way the Mapper is constructed, each node represents a heterogenous set of leaves that are the model considers similar and each edge represents a relationship between nodes. Remember that because of the overlapping nature of the cover, each leaf could and probably will be in multiple different nodes.

In the implementation, we use Affinity Propagation clustering with damping factor of 0.5 and compute the nerve only considering 2-way intersections so edges are the highest dimensional simplices. For exploratory and exposition purposed, we’ve only kept the largest connected component of the graph. In this case, any disconnected pieces were isolated nodes.

Referring back to the grid of leaves above, we see that the pointer leaves (Hackberry, Primrose, Chestnut, and Saucer Magnolia) comprise most of the lower portion of the graph while rounder leaves (Hazel, Bougainvillea, and Birch) show up primarily at the top of the graph. The Spindle and Cork Oak join the two groups and judging from the images, do share characteristics with both.

Local Instances

Suppose we are interested why our model decided that the particular Chestnut shown below was in fact a Chestnut. Judging from the image alone, I might have easily thought it was a Saucer Magnolia or Hazel. This particular Chestnut was chosen for demonstration because, while the classification was correct, the confidence in prediction was not very large. In fact, there were 4 classes with predicted probability greater than 1/m (with m = 10).

We’re immediately curious which nodes of Mapper contain this Chestnut, where in the graph does this Chestnut live? In the series of graphs below, we show three different ways of visualizing the location of our Chestnut of interest. These three graphs show different levels of information about the instance. The first shows only the nodes that contain this particular Chestnut. The middle graph shows all nodes that contain any Chestnut leaves at all. The third graph shows the same region of Chestnut nodes, but also shows all of neighboring nodes, colored by their majority leaf type. The edges of the third graph are also colored based on the majority class comprising that edge.

It’s interesting that this particular leaf is in a disconnected region of Chestnuts. Maybe this one is peculiar in some way?

Below we see all of the leaves that are in the same two nodes as the Chestnut of interest. Surprisingly, we see that the Chestnut is the only Chestnut in this subset. It is not surprising to see the Primrose and Spindle leaves in the same node, as they appear similar with an untrained eye.

Confusion Regions

The nearby regions in the graph are one way to visualize the relationships between different observations and instances. Considering we are exploring a machine learning model, we can obtain more specific information from the Mapper as well. Remember that there were 4 different classes that had relatively high predicted probability on this Chestnut. Maybe these regions of Mapper can tell us something about the observation?

The table below shows example leaves from the 4 top predicted classes. These are Chestnut, Cork Oak, Primrose, and Spindle. While some of the Primrose have a very distinct tail or circular shape, by looking at multiple examples from the class, we can see some representatives that could cause confusion. We’ll dig much deeper into this idea while looking at escape routes.

The graphs below show the Mapper regions for each of the 4 classes shown above. The color of the node matches the labels in the grid above.

Escape Routes

Escape routes are way of exploring how a particular class or observation is related to another class by visualizing the sequence of nodes and edges between them in the Mapper. Each class creates a region in the graph, and by drawing paths between regions, we see gradual transformations. An escape route is a short path through the topological space starting in a region dominated by one class and terminating at a node that contains no observations of that class. This allows the user to understand how subtle changes to the classifier’s input effects the model’s output prediction.

An escape route is calculated by finding the shortest path tree starting at a given node in a region. The tree is pruned to exclude all hops occurring after the path has entered other regions. We define the starting region as a subgraph of the topology containing all nodes with observations whose ground truth label is the same as the Chestnut of interest. Terminating regions are regions connected to the start region containing no observations of the user selected class. Escape routes allow us to present the user with example observations, allowing the user explanation to be conveyed in terms of the underlying observations, instead of a more abstract statistical or feature-based representations.

Below, we begin exploring escape routes by first looking at the neighboring regions to the Chestnut class to gain a sense for what classes are deemed similar based on the lens and covers. These neighbors are found by first finding all nodes that have at least one Chestnut, then considering the majority class from each of the neighboring nodes containing no Chestnut instances.

The figures below show the Chestnut region, along with the 3 adjacent classes, Saucer Magnolia, Primrose, and Hackberry. Above the graphs is one representative of each class chosen at random, for illustrative purposes.

The figure below shows many more examples from each of the neighbor classes. When viewing many of them, it’s easy to see their similarities. Only the most teardrop-like Hackberry and Primrose would be easy for the untrained eye to correctly classify.

Escape routes are computed by finding the shortest paths from a node with the leaf of interest to one of the neighboring regions. The figure below shows a directed tree of the escape routes for our leaf of interest. The root node is enlarged and the arrows depict direction, with color encoding the majority class in the edge.

Intuitively, these paths can be thought of as the fastest ways out of the space containing our observation. These paths are interesting to rationalize as a series of transformations to turn our observation into an observation of another class. The steps along the path capture small changes in the leaf shapes, slowly transforming until the leaves are unrecognizable as the original class.

Below is a visualization of one escape route from the Chestnut to a Saucer Magnolia region. The graph on the left shows the route of five nodes.

In the middle and right table, we show two different ways of visualizing the escape route. The middle table depicts the leaves in each node along the path, while the right table shows the leaves in each of the edges. For each of these tables, the path starts at the top and continues down, row by row. Each image is annotated with its class multiplicity in the node, and the id of the particular image being shown.

Both of these representations provide a great deal of information, and the verdict is still out on which works better in a particular context. Watch the transformation of Chestnut as we walk the path. In the beginning, the Chestnut of interest is the only Chestnut present in the node. Walking through Primrose region, we get back to a region with Chestnut in it, but this time, the Chestnut shown (id 70) looks completely different than our original leaf. We can see though how the Saucer Magnolia and Primrose are very similar in pointedness.

The next path we look at paints a very different picture. The first two steps along the path are the same, but now, when the path enters back into a region containing Chestnut, these Chestnuts are completely different looking. Why is that? These are very similar to the Primrose and Hackberry that abut the Chestnuts. This path shows a gradual transformation from rounder (similar to Spindle) to pointier (similar to Hackberry).

Scaling Up

Now that we’ve seen how Mapper could be used for exploring a model on small data, let’s see how it works in a larger scale case. For this, we have chosen to use the exciting Fashion MNIST data set (Xiao, Rasul, and Vollgraf 2017). It is a drop-in replacement for the ubiquitous Digits MNIST data set, but much harder and more interesting to work with. To build a classification model, we first reduce the data to 100 dimensions using PCA and then train a Logistic Regression model with l₁ regularization. This results in 84.8% accuracy on the designated test set. In this example, we’ll demonstrate how the traditional hypercube cover scheme can be used for exploring machine learning models. To do this, we reduce the decision boundary of the model to 2 dimensions using UMAP (McInnes and Healy 2018) and then apply a 30 by 30 hypercube grid with 50% overlap. For constructing the Mapper, we make use of Kepler Mapper, an open source Mapper implementation (Veen and Saul 2017).

The largest connected component of the resulting Mapper is shown above. The result validates that the model is getting confused about similar parts of the data as we would. The lines between Shirt, T-shirt/top and Pullover is vague while both Bag and Trouser classes are mostly separate. Dresses and and Trouser have some common intersection where there is confusion about where the rompers and overalls belong. We see that Sandals and bags have a small connection where bulky sandals might be confused with strappy bags. The regions of Sneaker, Ankle boot, and Sandal are tightly linked together, but separate from the regions of shirts, jackets, tops, and sweaters.

To briefly demonstrate the explanatory power of this Mapper representation, we’ll look at the borders between the Sneaker, Ankle boot, and Sandal regions. To understand the relationships between the three, we’ve chosen a path starting in the Sneaker region, heading to the Sandal the region, and passing through the region of angle boots. Same as the escape routes before, a single representative from each class is shown and the multiplicity of that class is indicated in the corner of each image.

Tracking the path above, we see each of the nodes along the path contain instances from each of the three classes, but the first few have considerably more Sneakers than later nodes. In the first node, all three classes show low cut examples. As we walk down the path, the composition of node changes to be Ankle boot dominant and both the Sneaker and Sandals are seen to have higher ankle cut. As the path continues, transitioning from primarily Ankle boot nodes to primarily Sandal nodes, the heel for both grows and the amount of material in the Sandals shrinks, the defining features of Sandal become more prominent. When in the center of the Sandal region, both the Ankle boots and Sandals that they both contain some with heels.

This example shows how we can relate regions in the data space of Fashion images to how the model views them. The model sees the relationships between many of these images in more nuanced ways than just the 10 classes, and the visualization we create provides us with a map for exploring these relationships.

Conclusion

We have seen how Mapper can construct an interactive representation of the data and predictions manifold. Using this representation, we can extract insights into what the model is thinking, where it gets confused, . By connecting and compressing the input and output spaces of the model, we gain a keen understanding of the data and how the model interprets the data.

We saw how Mapper provides an information dense representation of a data and model suitable for exploring complex relationships and interactions between instances. We have demonstrated a few novel ways of using Mapper to understand a machine learning model. These explanations come in the form of examples and can help illuminate why certain predictions were made and what differentiates particular instances from others.

In closing, we note that this application of Mapper to machine learning is one of many new ways to use Mapper. There are many other great examples of Mapper applied in interesting contexts from studying genetics to voting patterns (Kamruzzaman, Kalyanaraman, and Krishnamoorthy 2016), (Lum et al. 2013). Recently, Mapper and TDA in general have been used to help understand machine learning failure modes (Carlsson, Carlsson, and Vejdemo-Johansson 2018) or detect adversarial examples (Gebhart and Schrater 2017). We hope that by expanding the available cover schemes and formalizing methods for exploring Mapper a wider audience will see the applicability of Mapper.

Acknowledgements

The research and development involved with this project was conducted under the Laboratory Directed Research and Development Program at PNNL. All visualizations were built using Matplotlib and Networkx (Hunter 2007) (Hagberg, Schult, and Swart 2008). Scikit-Learn was used for all modeling (Pedregosa et al. 2011). Numpy and Scipy provided the backbone of all computations (Oliphant 2006) (Jones et al. 2001).

References

Carlsson, Leo, Gunnar Carlsson, and Mikael Vejdemo-Johansson. 2018. “Fibres of Failure: Classifying Errors in Predictive Processes.” arXiv Preprint arXiv:1803.00384.

Gebhart, Thomas, and Paul Schrater. 2017. “Adversary Detection in Neural Networks via Persistent Homology.” arXiv Preprint arXiv:1711.10056.

Ghrist, Robert. 2008. “Barcodes: The Persistent Topology of Data.” Bulletin of the American Mathematical Society 45 (1): 61–75.

Hagberg, Aric A., Daniel A. Schult, and Pieter J. Swart. 2008. “Exploring Network Structure, Dynamics, and Function Using Networkx.” In Proceedings of the 7th Python in Science Conference, edited by Gaël Varoquaux, Travis Vaught, and Jarrod Millman, 11–15. Pasadena, CA USA.

Hunter, J. D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering 9 (3). IEEE COMPUTER SOC: 90–95.

Jones, Eric, Travis Oliphant, Pearu Peterson, and others. 2001. “SciPy: Open Source Scientific Tools for Python.” http://www.scipy.org/.

Kamruzzaman, Methun, Ananth Kalyanaraman, and Bala Krishnamoorthy. 2016. “Characterizing the Role of Environment on Phenotypic Traits Using Topological Data Analysis.” In Proceedings of the 7th Acm International Conference on Bioinformatics, Computational Biology, and Health Informatics, 487–88. ACM.

Lum, Pek Y, Gurjeet Singh, Alan Lehman, Tigran Ishkanov, Mikael Vejdemo-Johansson, Muthu Alagappan, John Carlsson, and Gunnar Carlsson. 2013. “Extracting Insights from the Shape of Complex Data Using Topology.” Scientific Reports 3. Nature Publishing Group: srep01236.

McInnes, L., and J. Healy. 2018. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” ArXiv E-Prints, February.

Oliphant, Travis E. 2006. A Guide to Numpy. Vol. 1.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.

Silva, Pedro FB, Andre RS Marcal, and Rubim M Almeida da Silva. 2013. “Evaluation of Features for Leaf Discrimination.” In International Conference Image Analysis and Recognition, 197–204. Springer.

Singh, Gurjeet, Facundo Memoli, and Gunnar Carlsson. 2007. “Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition.” The Eurographics Association.

Veen, Hendrik Jacob van, and Nathaniel Saul. 2017. “KeplerMapper: Python Implemention of Mapper for Visualizing Structure and Shape of High-Dimensional Data.” http://doi.org/10.5281/zenodo.1054444.

Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017. “Fashion-Mnist: A Novel Image Dataset for Benchmarking Machine Learning Algorithms.” August 28, 2017. http://arxiv.org/abs/cs.LG/1708.07747.

Around About

Machine Learning Explanations
with
Topological Data Analysis

Nathaniel Saul^1, 2 and Dustin L. Arendt¹

¹Visual Analytics Group,
Pacific Northwest National Laboratory

²Department of Mathematics and Statistics,
Washington State University Vancouver

Introduction