Introduction to Topological Data Analysis

A very basic introduction to Topological Data Analysis

Nathaniel Saul

3 minute read

The traditional process of the scientific method is to create a hypothesis, design an experiment, collect specific data, and then analyze the results. With modern advances, data collection has become so cheap that it is often collected before we even know what to look for. The scientific method is being flipped on end.Data is being Now, the hard part in the process is sifting through the piles of data and figuring out what to questions to ask. New methods of analysis are emerging to help us explore these troves of data and to eek out insights.

A topologist can’t tell the difference between a coffee cup and a donut.

The fundamental assumption in topology is that connectivity is more important than distance. This lack of distinction is important to why topology is useful for analyzing data. Instead of getting wrapped up in myopic details of how far apart two points are, topology is concerned with qualitative questions like how many holes does the object have, or how many pieces is it constructed out of. Essentially, topology is a way to explore the shape of data without concern for things like which metric to use.

A common question to ask when analyzing data is whether there are any distinct clusters within the data. This question can naturally be posed in the topological setting also. By assuming that point cloud data was sampled from some sort of high dimensional surface (or multiple surfaces), if we can estimate a simplicial complex, we can use it to answer many questions (of the topological sort) about the data. The main point of persistent homology and mapper is to build simplicial complexes that estimate these underlying surfaces well.

A graph is a fundamental object of study in both mathematics and computer science. There have been limitless applications of graphs since they lend themselves easily to modeling many concepts. We can extend graphs to higher dimensions by considering not just two way interaction between vertices, but more complicated interaction as well. By adding triangles, tetrahedron, and so on into higher dimensions we can model complex relationships that cannot be modeled by graphs.

For example, suppose we want to model communication within groups. With a graph, we might say that each person in the group is a vertex and each edge represents them talking. What if 3 people were in a room all having a conversation to each other? Would you add three edges between them? How would this be different than three people having three one-on-one conversations? If we used a simplicial complex instead of a graph, we could distinguish this difference by including the triangle to represent the three way interaction. Besides being able to capture higher dimensional interactions, simplicial complexes are useful for modeling surfaces and objects.

A fundamental question in topological data analysis is how do we construct simplicial complexes from point cloud data. Once these simplicial complexes (or sequences of simplicial complexes) are constructed, we can compute the topological features of the complex. It becomes easy to ask how many holes there are or how many distinct pieces there are.

,