Topological methods have the potential of exploring the shape of datasets across different scales by analyzing topological features (connected components, holes) that persist at different spatial resolutions. Given a distance defined on a dataset, we propose a hierarchical topological clustering algorithm that creates clusters from the components of an associated one parameter family of simplicial complexes. Cluster configurations that persist for large parameter ranges are considered dominant. Persistent clusters formed by one or a few points are candidates to be meaningful outliers, that is, outliers with a particular significance for the dataset under study (such as key genes or trade partners), rather than noise.
We demonstrate through MATLAB tests the potential of the algorithm on datasets of moderate size in which multiclustering and outliers are relevant, such as swarm motion data, gene and microscope data for cancer research, compressed images, and trade volume data, using different distances (Euclidean distance, Wasserstein distance, Fermat distance).
Unlike DBSCAN, this strategy is suitable to understand multiclustering phenomena in hierarchical data and does not require blind parameter choices. While popular methods like K-means or K-medoids fail to identify non convex clusters, the proposed algorithm creates clusters of any shape. Moreover, its stability under noise is quantifiable through distances defined on persistence diagrams, a property not enjoyed by standard hierarchical clustering. Unlike linkage strategies such as centroid or median, this algorithm can be implemented with any distance choice, not just Euclidean distances. The complexity of the algorithm is bounded from above by MN2, N being the amount of data and M the resolution levels.
Published in https://doi.org/10.1007/s00500-025-11028-6
https://rdcu.be/e2L7C