CDC: A Clustering Algorithm using Local Direction Cenrality

CDC is a novel boundary-seeking clustering algorithm for data with heterogeneous density and weak connectivity. We developed a CDC toolkit for versatile clustering applications, including, but is not limited to, scRNA-seq cluster, CyTOF analysis, speech recognition, face image recognition.
Published in Protocols & Methods
CDC: A Clustering Algorithm using Local Direction Cenrality
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Heterogeneous density and weak connectivity are two common obstacles that have heavy impacts on the accuracy and effectiveness of cluster analysis. Existing methods have difficulty identifying the dense and sparse clusters simultaneously, and separating the weakly connected clusters. In this work, we propose a clustering algorithm named CDC by measuring direction centrality locally, which contributes to handling data with heterogeneous density and weak connectivity. The core idea is to detect the boundary points of clusters firstly, and then connect the internal points within the enclosed cages generated by surrounding boundary points. Specifically, an internal point of clusters tends to be surrounded by its KNNs in all directions, while a boundary point only includes neighboring points within a certain directional range. Taking advantage of this difference, we measure the local centrality by calculating the directional uniformity of KNNs to distinguish internal and boundary points. Hence, CDC can avoid the cross-cluster connections and separate weakly-connected clusters effectively. Meanwhile, it can preserve the completeness of sparse clusters, since it utilizes KNN to search the neighboring points that is irrelevant to the point density. 

                      

Fig. 1 Illustration of CDC algorithmic principle

To validate the effectiveness, we compared CDC with totally 38 specialized and versatile baselines on 47 datasets derived from different fields, including 15 scRNA-seq, two CyTOF, two speaker corpuses, eight UCI, one handwritten image, one face image and 17 synthetic datasets. Results demonstrated that CDC attains superior clustering accuracy and robust outcomes in a time efficient manner, and presented its great potentials in various applications. Moreover, we investigated the dimension expansion and noise elimination methods, analyzed the parameter sensitivity, and designed adaptive methods for parameter settings. 

Fig. 2 Six typical applications of CDC, and overview of the standard preprocessing pipeline and clustering results for the identification of cell types from scRNA-seq datasets

CDC is of  general significance and  has more potentials beyond identifying the cell types, recognizing speaker voices and face images. It can be a promising technique to segment the cell images, explore the spatial living patterns of species, and reveal the aggregation distributions of geographic objectives. However, CDC may be invalid to handle data with manifold structure directly, since the detected boundary points cannot constraint the internal connections in all directions in the feature space. Utilizing dimension reduction techniques such as UMAP to embed the data to a proper dimension can broaden the application of CDC.

The code of CDC in MATLAB, R and Python, and the toolkit with six applications can be downloaded at https://github.com/ZPGuiGroupWhu/ClusteringDirectionCentrality and https://zenodo.org/record/7029720#.YwuFsuxByZw. Digital Object Identifier (DOI) 10.5281/zenodo.7029720. 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biological Techniques
Life Sciences > Biological Sciences > Biological Techniques

Related Collections

With collections, you can get published faster and increase your visibility.

Biology of rare genetic disorders

This cross-journal Collection between Nature Communications, Communications Biology, npj Genomic Medicine and Scientific Reports brings together research articles that provide new insights into the biology of rare genetic disorders, also known as Mendelian or monogenic disorders.

Publishing Model: Open Access

Deadline: Jan 31, 2025

Advances in catalytic hydrogen evolution

This collection encourages submissions related to hydrogen evolution catalysis, particularly where hydrogen gas is the primary product. This is a cross-journal partnership between the Energy Materials team at Nature Communications with Communications Chemistry, Communications Engineering, Communications Materials, and Scientific Reports. We seek studies covering a range of perspectives including materials design & development, catalytic performance, or underlying mechanistic understanding. Other works focused on potential applications and large-scale demonstration of hydrogen evolution are also welcome.

Publishing Model: Open Access

Deadline: Dec 31, 2024