Single-cell sequencing technology is able to study the organism from a single-cell perspective. With the advance of such technology, researchers now have the tools to measure the cells within the tissue from multiple data modalities. scATAC-seq profiles the chromatin accessibility within the cell, while scRNA-seq profiles the gene expression within the cell. There are also technologies that measure two or more modalities simultaneously from a cell: CITE-seq jointly measures protein abundance and gene expression, while SNARE-seq jointly measures gene expression and chromatin accessibility. With the abundance of such multi-modality single-cell datasets, computational challenges emerge to integrate different modalities and construct a holistic picture of the cell population within a tissue, which is also termed the challenge of single-cell data integration.
In recent years, single-cell data integration has been widely studied in the community. The challenge is categorized into three basic integration scenarios: horizontal integration, vertical integration, and diagonal integration. Real-world integration scenarios tend to be a complex combination of the three integration scenarios, which is also termed mosaic integration. However, existing methods mainly work on the three basic cases, with very few methods generalized to mosaic integration tasks. Motivated by the generalization issue, we propose scMoMaT, a mosaic integration method using matrix tri-factorization. Given the multi-batch multi-modality single-cell dataset, scMoMaT is able to learn a unified cell representation, and cell-type-specific bio-markers across modalities at the same time.
For each single-cell data matrix, scMoMaT factories it into a cell factor, a feature factor, and an association matrix. When multiple data matrices are provided, scMoMaT encodes the cross-matrices relationship by forcing the same cell or feature identity across matrices to share the same factor. The learned cell factor encodes cross-modality information and is removed from batch effect. The learned feature factors, on the other hand, output high-scoring features that correspond to bio-markers of different cell type identities.
To validate the performance and generalization ability of scMoMaT, we test scMoMaT on various real-world data integration scenarios. We tested scMoMaT on one human PBMC dataset, which includes 4 batches of cells measured from 3 data modalities (chromatin accessibility, gene expression, and protein abundance). The UMAP visualization of the cell factor shows that scMoMaT successfully removes the batch effect and keeps the cell type identity information. We further reannotate the cell types using the bio-marker scMoMaT discovered, and the result shows a higher annotation resolution and better annotation consistency across data matrices compared to the annotation obtained in the original data paper.
We further validate scMoMaT on the mouse brain cortex dataset, human bone marrow dataset, and mouse spleen dataset. In addition, we also quantitatively measure the performance of scMoMaT and compare it with other baseline methods on simulated datasets. The testing datasets cover various types of integration scenarios, and the result jointly shows the superior performance of scMoMaT compared to other existing methods.
With more and more multi-modality single-cell data available, we envision that scMoMaT becomes an effective tool in harnessing the multi-modality information and learning a unified view of the cell population. The exploration of detecting bio-marker jointly with learning cell representation also shows the possibility of jointly learning cross-modality relationships and understanding cell regulation mechanisms in the integration tasks.
Full text is available at: https://www.nature.com/articles/s41467-023-36066-2
The scMoMaT package is available at: https://github.com/PeterZZQ/scMoMaT