Introducing scCross: a deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in-silico exploration

Introducing scCross: a deep generative model for integrating single-cell multi-omics with seamless cross-modal generation and exploration. scCross enables efficient cross-modal data generation, multi-omic data simulation, and in-silico perturbations within and across different modalities.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The introduction of single-cell sequencing technology marks a new era in biological research, allowing scientists to analyze cellular heterogeneity with unprecedented detail. This advancement reveals complex cellular dynamics and has had significant impacts on fields such as cancer biology, neurobiology, and drug discovery. However, the data generated by these technologies are often highly complex and diverse, leading many existing computational tools to provide a limited perspective focused on specific data modalities. This limitation hinders a comprehensive understanding of the cellular landscape.

Challenges in Multi-Omics Data Integration and Generation

Integrating single-cell multi-omics data effectively remains a significant challenge in the field. Many existing methods depend on matched multi-omics datasets, which are often difficult to obtain, limiting the scope of analyses. These limitations result in insufficient integration of unmatched data and difficulties in managing noise and information loss. Even methods designed to handle multiple data modalities face persistent challenges, such as extracting common features across modalities and managing nonlinear transformations. The imbalance in the availability of different omics data types further complicates this issue; for instance, single-cell epigenomics data is often far less accessible compared to its transcriptomics counterparts. This scarcity not only hinders multi-omics analysis but also limits the potential for discovering comprehensive biological insights. These challenges highlight the need for more robust and flexible approaches to multi-omics data integration and generation, capable of overcoming the existing gaps and limitations in the field.

Developing Integrated Methods for Integration, Generation, Perturbation, and Downstream Analysis

To address these challenges, we propose scCross. This method excels in integrating single-cell multi-omics data and is particularly unique for its ability to generate cross-modal single-cell data. This capability bridges rich and scarce data modalities, allowing for a more comprehensive depiction of cellular states. Another key feature of scCross is its high-fidelity simulation of single-cell multi-omics data and support for computational perturbations. This enables virtual experiments of cellular interventions based on data integration, exploring potential strategies for cellular manipulation. By offering deep insights into cross-modal cellular dynamics, scCross not only enhances the utility of single-cell multi-omics research but also drives innovation and development in the field.

Integrating Multi-Omics Using Deep Generative Frameworks

The scCross model for integrating and generating single-cell multi-omics data leverages a deep generative framework that combines variational autoencoders (VAEs) and generative adversarial networks (GANs). This framework facilitates the seamless integration of single-cell multi-omics data, cross-modal data generation, multi-omics data simulation, and computational perturbations within and across modalities. The process begins by training VAEs for each modality to capture low-dimensional cell embeddings, enriched with gene set vectors for additional informational depth. These embeddings are then integrated into a common latent space, with a Jensen-Shannon (JS) divergence loss applied to minimize differences in data distributions across various omics. GANs are subsequently employed to fuse the modalities within this joint latent space. To further refine the integration, mutually nearest neighbor (MNN) cell pairs are used as anchors, guiding the alignment process and ensuring that embeddings of the same or similar cells across different modalities remain close in the joint latent space. This MNN-guided alignment results in a coordinated integration and distribution of modal data, ensuring robust and accurate multi-omics data integration.

Cross-Modal Generation Using Bidirectional Alignment

Beyond the integration of single-cell multi-omics data, the model also enables cross-modal single-cell data generation and perturbations. The bidirectional aligner is essential for this process, decoding shared latent embeddings into different modalities. Once trained, the model can generate single-cell data across modalities by encoding data from one modality into the latent space and then decoding it into another. Additionally, it simulates multi-omics data generation and performs computational perturbations both within and across modalities, uncovering potential regulatory changes in cellular states. By consolidating single-cell multi-omics data into a unified latent space and supporting cross-modal integration, scCross lays the foundation for a wide range of single-cell multi-omics applications, particularly in scenarios where certain omics data are limited or unavailable. 

Fig. 1: Overview of the scCross method. scCross employs variational autoencoders for each modality to capture latent cell embeddings for different single-cell omics. During single-cell data integration, the method incorporates biological priors, such as gene set matrices, as additional features. It then uses additional variational autoencoders and a bidirectional aligner to merge these enriched embeddings into a shared latent space z. The bidirectional aligner is crucial for cross-modal generation, with brown arrows indicating the transition from scRNA-seq to scATAC-seq. Mutual nearest neighbor priors ensure alignment accuracy. A discriminator maintains integration across omics while ensuring the generated data’s completeness and consistency. scCross provides a robust toolkit for single-cell data integration, supporting cross-modal data generation, single-cell data enhancement, multi-omics simulation, and computational perturbations, offering great flexibility in addressing various single-cell multi-omics challenges.

Validation of scCross

We validated scCross across diverse datasets encompassing various single-cell omics. The results indicate that scCross performs effectively in single-cell multi-omics data integration, cross-modal generation, multi-modal simulation, and computational perturbation tasks, as confirmed by multiple metrics and downstream analyses. These findings suggest that scCross is a valuable tool for facilitating single-cell multi-omic explorations and enhancing data utilization, supporting researchers in gaining deeper insights into single-cell multi-omics and cross-modal cellular dynamics.

Conclusion

The scCross method offers significant potential for the single-cell research community, addressing challenges that may be difficult to overcome with existing approaches. Its unique features and reliable performance make it a valuable tool for researchers engaged in single-cell multi-omics analysis. scCross facilitates the integration of different modalities, supports comprehensive data generation, and enables detailed simulation and perturbation, which could advance the study of complex biological systems. We encourage researchers to explore scCross and consider its application in their studies. For further details, please refer to our paper in Genome Biology (https://doi.org/10.1186/s13059-024-03338-z). 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biomedical Engineering and Bioengineering
Technology and Engineering > Biological and Physical Engineering > Biomedical Engineering and Bioengineering
Genetics and Genomics
Life Sciences > Biological Sciences > Genetics and Genomics
Biotechnology
Life Sciences > Biological Sciences > Biotechnology
Computational Intelligence
Technology and Engineering > Mathematical and Computational Engineering Applications > Computational Intelligence

Related Collections

With Collections, you can get published faster and increase your visibility.

Application of large language models in genome analysis

Genome Biology is calling for submissions to our Collection on the applications of large language models (LLMs) in genome analysis.

The integration of machine learning, particularly LLMs built using transformers, into biological research has opened up new avenues for understanding complex biological data. These advanced computational models have shown promise in processing vast datasets, from genomic sequences to protein structures, enabling researchers to extract meaningful insights and identify patterns that were previously unattainable.

The employment of LLMs in biology has enhanced our capabilities in areas such as genomics, transcriptomics, and proteomics. For instance, LLMs have been successfully applied to predict the functional consequences of genetic variants in both coding and non-coding genomes, facilitating targeted precision medicine and personalized therapies. Furthermore, these models enable researchers to automate the annotation of DNA and protein sequences, accelerating the pace of discovery and innovation in biotechnology and pharmaceutical research. Additionally, LLMs are increasingly being used to explore the regulation of gene expression, including the identification and characterization of regulatory elements such as enhancers, promoters, and transcription factor binding sites.

Future advancements may lead to the creation of more sophisticated models capable of integrating multi-omic data, facilitating the understanding of complex biological systems. Such innovations could potentially enable real-time analysis and prediction of biological responses, transforming our approach to disease modeling, drug discovery, and synthetic biology. The incorporation of regulatory elements into machine learning models will be crucial for uncovering the mechanisms that govern cellular behavior and tissue-specific gene regulation.

Topics accepted for submission include, but are not limited to, the following:

Applications of LLMs in genomic data analysis

Transformer models in protein sequence prediction

Comparative studies of LLMs and traditional methods

LLMs for functional prediction and annotation of proteins

Machine learning approaches to the study of regulatory elements in gene expression

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Feb 28, 2026

Alternative splicing in human variation and disease

Genome Biology is calling for submissions to our Collection on the role of alternative splicing in human variation and disease.

Alternative splicing is a crucial mechanism in transcriptional regulation, significantly enriching transcriptome content and promoting diversity of both transcriptome and proteome. This process allows a single gene to produce multiple mRNA variants, which can lead to different protein isoforms with unique functions. In humans, alternative splicing plays a pivotal role in tissue development, differentiation, and various physiological activities. However, alterations in splicing patterns are associated with numerous diseases, including cancer, neurodegenerative disorders, and autoimmune diseases.

Understanding the mechanisms and consequences of alternative splicing is essential for advancing precision medicine and developing targeted therapies. Research in this area can provide insights into the complexity of gene expression and the impact of splicing variations on health and disease.

Our collection aims to highlight research that explores the genetic, epigenetic, and epitranscriptomic mechanisms underlying alternative splicing. We welcome submissions that investigate regulatory factors involved in alternative splicing, its implications in human disease, the evolutionary dynamics of splicing variation, and how these insights can inform medical research and therapeutic development.

All manuscripts submitted to this journal, including those submitted to collections and special issues, are assessed in line with our editorial policies and the journal’s peer review process. Reviewers and editors are required to declare competing interests and can be excluded from the peer review process if a competing interest exists.

Publishing Model: Open Access

Deadline: Mar 20, 2026