scMoMaT jointly performs single cell mosaic integration and multi-modal bio-marker detection

scMoMaT jointly performs single cell mosaic integration and multi-modal bio-marker detection
Like

Single-cell sequencing technology is able to study the organism from a single-cell perspective. With the advance of such technology, researchers now have the tools to measure the cells within the tissue from multiple data modalities. scATAC-seq profiles the chromatin accessibility within the cell, while scRNA-seq profiles the gene expression within the cell. There are also technologies that measure two or more modalities simultaneously from a cell: CITE-seq jointly measures protein abundance and gene expression, while SNARE-seq jointly measures gene expression and chromatin accessibility. With the abundance of such multi-modality single-cell datasets, computational challenges emerge to integrate different modalities and construct a holistic picture of the cell population within a tissue, which is also termed the challenge of single-cell data integration. 

In recent years, single-cell data integration has been widely studied in the community. The challenge is categorized into three basic integration scenarios: horizontal integration, vertical integration, and diagonal integration. Real-world integration scenarios tend to be a complex combination of the three integration scenarios, which is also termed mosaic integration. However, existing methods mainly work on the three basic cases, with very few methods generalized to mosaic integration tasks. Motivated by the generalization issue, we propose scMoMaT, a mosaic integration method using matrix tri-factorization. Given the multi-batch multi-modality single-cell dataset, scMoMaT is able to learn a unified cell representation, and cell-type-specific bio-markers across modalities at the same time. 

Figure 1. Graphical illustration of scMoMaT. a. Given multiple single-cell data matrices, scMoMaT learns the unified cell factors and feature factors that can be used for bio-marker detection. b. For each data matrix, scMoMaT factorized it into cell and feature factors, association matrix, and cell and feature biases.

For each single-cell data matrix, scMoMaT factories it into a cell factor, a feature factor, and an association matrix. When multiple data matrices are provided, scMoMaT encodes the cross-matrices relationship by forcing the same cell or feature identity across matrices to share the same factor. The learned cell factor encodes cross-modality information and is removed from batch effect. The learned feature factors, on the other hand, output high-scoring features that correspond to bio-markers of different cell type identities.

To validate the performance and generalization ability of scMoMaT, we test scMoMaT on various real-world data integration scenarios. We tested scMoMaT on one human PBMC dataset, which includes 4 batches of cells measured from 3 data modalities (chromatin accessibility, gene expression, and protein abundance). The UMAP visualization of the cell factor shows that scMoMaT successfully removes the batch effect and keeps the cell type identity information. We further reannotate the cell types using the bio-marker scMoMaT discovered, and the result shows a higher annotation resolution and better annotation consistency across data matrices compared to the annotation obtained in the original data paper.

Figure 2. Test results of scMoMaT on human PBMC dataset. a. Layout of data matrices in the dataset. b,c. UMAP visualization of cell factors, where cells are colored by (b) cell type in the original data paper, (c) batches. d. KNN agreement scores of scMoMaT annotation compared to the annotation in the original data paper. e. scMoMaT cell type annotation. 

We further validate scMoMaT on the mouse brain cortex dataset, human bone marrow dataset, and mouse spleen dataset. In addition, we also quantitatively measure the performance of scMoMaT and compare it with other baseline methods on simulated datasets. The testing datasets cover various types of integration scenarios,  and the result jointly shows the superior performance of scMoMaT compared to other existing methods.

With more and more multi-modality single-cell data available, we envision that scMoMaT becomes an effective tool in harnessing the multi-modality information and learning a unified view of the cell population. The exploration of detecting bio-marker jointly with learning cell representation also shows the possibility of jointly learning cross-modality relationships and understanding cell regulation mechanisms in the integration tasks.

Full text is available at: https://www.nature.com/articles/s41467-023-36066-2

The scMoMaT package is available at: https://github.com/PeterZZQ/scMoMaT

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Biotechnology
Life Sciences > Biological Sciences > Biotechnology

Related Collections

With collections, you can get published faster and increase your visibility.

Applied Sciences

This collection highlights research and commentary in applied science. The range of topics is large, spanning all scientific disciplines, with the unifying factor being the goal to turn scientific knowledge into positive benefits for society.

Publishing Model: Open Access

Deadline: Ongoing

Clinical research

Nature Communications is interested in publishing high-quality clinical research in all areas of clinical medicine. In this collection, we highlight some of the latest clinical research papers published in the journal.

Publishing Model: Open Access

Deadline: Ongoing