A unified model for interpretable latent embedding of multi-sample, multi-condition single-cell data

The advent of single-cell technologies has allowed us not only to generate comprehensive atlases of cell types in their 'normal' states, but also to understand how these cells change under various conditions. Here, we introduce GEDI, a framework to enable the analysis of these datasets.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

I remember that one of the first articles shared with me by my supervisor, Dr. Hamed S. Najafabadi, was the PLIER paper1. This work demonstrates how to incorporate prior information from various gene sets, for example, pathways or gene-regulatory networks, to provide interpretability to the latent factors obtained from gene-expression analysis. Inspired by this, we wanted to extend some of these ideas to the analysis of single cell RNA-seq data, but we soon realized that multiple challenges lay ahead.

One of these obstacles is the presence of technical or biological variability among samples, as shown by several groups that developed computational methods for the integration of single-cell RNA-seq data2,3. Although integration was possible, we realized that applying it impacted other analyses, such as the estimation of pathway activities. This influence could also be seen in other downstream tasks, for example, differential expression across conditions involves integration and cell type/cluster identification, but their interplay is typically ignored, and the analysis is limited to the discrete clusters identified. We reasoned that having a model that could unify these multiple concepts and perform all these tasks in a single step could be useful, and that is how Gene Expression Decomposition and Integration (GEDI) was born! 

Fig. 1. Schematic representation of a two-sample single-cell analysis: cells from each sample are distributed near a unique manifold determined by the sample-specific decoder functions ψ1 and ψ2 (each dot represents one cell, with coordinates representing gene expression measurements). These invertible functions provide a mapping (represented by grey arrows) from the biological state of each cell (b) to the observed gene expression profile of the cell in each sample.

Our work starts from the premise that in a multi-sample, multi-condition single-cell dataset, cells are embedded in manifolds that can vary across samples due to technical or biological variations (Fig. 1). In GEDI, the gene expression manifold of each sample is modeled as a hyperplane or hyperellipsoid, defined by a reference set of principal axes and sample-specific transformations of these axes. The variations to the reference frame can be expressed as a probabilistic function of sample-level variables, such as a change in disease status. This formulation enables us to quantify how changes in sample-level covariates influence the gene expression of any given cell state, leading to a transcriptomic vector field for a sample-level variable (Fig. 2).

Fig. 2. The derivative of ψ with respect to the sample-level variable h forms a vector field, representing the change in expression of each cell at the biological state b as h changes (differential expression).

The reference set of principal axes can also be expressed as a probabilistic function of gene-level variables, such as gene-set membership of pathways, cell signatures and regulatory factor targets, providing interpretability to the axes identified. If using prior information about regulatory networks, GEDI can calculate the activity gradient of regulatory factors and compare them to the transcriptomic vector fields of sample-level variables.

 In the paper, we illustrate the different capabilities of GEDI. Using different datasets, we show that GEDI can capture biological or technical variability and is competitive with other top-performing integration tools. When analyzing a single-cell atlas of PBMCs that include healthy individuals as well as mild and severe COVID-19 cases, GEDI performs cluster-free differential gene expression analysis along the continuum of cell states, obtaining a transcriptomic vector field that describes the differences between COVID-19 and healthy individuals. When also including prior information about gene regulatory networks, GEDI identifies transcription factors whose activity gradient is in the same direction as the transcriptomic vector of COVID-19 in specific cell subpopulations.

 Finally, we showcase that GEDI can also be applied to modalities where the biological quantity of interest is a ratio between two observations, such as splicing or mRNA stability. This capability allows us to perform the analyses described above, including dimensionality reduction, sample harmonization, and gene-regulatory network analysis on the latent space of ratio-based modalities.

To learn more about these and other results, you can find the paper here: https://www.nature.com/articles/s41467-024-50963-0

And you can find GEDI here: https://github.com/csglab/GEDI/tree/main

References: 

  1. Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C., & Chikina, M. (2019). Pathway-level information extractor (PLIER) for gene expression data. Nature methods, 16(7), 607-610.
  2. Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology, 36(5), 411-420.
  3.  Haghverdi, L., Lun, A. T., Morgan, M. D., & Marioni, J. C. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature biotechnology, 36(5), 421-427.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Transcriptomics
Life Sciences > Biological Sciences > Biological Techniques > Gene Expression Analysis > Transcriptomics
Data Integration
Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Data Integration
RNA splicing
Life Sciences > Biological Sciences > Genetics and Genomics > Molecular Genetics > Gene Expression > RNA splicing
Gene Expression Analysis
Life Sciences > Biological Sciences > Genetics and Genomics > Molecular Genetics > Gene Expression > Gene Expression Analysis

Related Collections

With collections, you can get published faster and increase your visibility.

Advances in catalytic hydrogen evolution

This collection encourages submissions related to hydrogen evolution catalysis, particularly where hydrogen gas is the primary product. This is a cross-journal partnership between the Energy Materials team at Nature Communications with Communications Chemistry, Communications Engineering, Communications Materials, and Scientific Reports. We seek studies covering a range of perspectives including materials design & development, catalytic performance, or underlying mechanistic understanding. Other works focused on potential applications and large-scale demonstration of hydrogen evolution are also welcome.

Publishing Model: Open Access

Deadline: Sep 30, 2024

Cancer epigenetics

With this cross-journal Collection, the editors at Nature Communications, Communications Biology, Communications Medicine, and Scientific Reports invite submissions covering the breadth of research carried out in the field of cancer epigenetics. We will highlight studies aiming at the improvement of our understanding of the epigenetic mechanisms underlying cancer initiation, progression, response to therapy, metastasis and tumour plasticity as well as findings that have the potential to be translated into the clinic.

Publishing Model: Open Access

Deadline: Oct 31, 2024