Behind the Paper

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters

In this paper, we provide a framework for (1) identifying data distortions in projection from a high-dimensional to two-dimensional (2D) space and (2) optimizing hyperparameter settings in a 2D dimension-reduction method.

Published in Protocols & Methods, Genetics & Genomics, and Statistics

Feb 29, 2024

Jingyi Jessica Li and Christy Lee

2 contributors

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters

Liked by India Ambler

Explore the Research

Consider a 3-dimensional (3D) globe vs a 2-dimensional (2D) map. It is impossible to represent an entire globe accurately in only 2D; distance may not be accurate, and the size of some countries may be distorted. Typically, land masses at the edge of the map, like Antarctica, are the most changed. Despite these distortions, 2D maps are useful for everyday use; students or the common traveler to the main continents will not be affected by the distortion in Antarctica, but an intrepid traveler to the poles will certainly require a different map.

Similarly, the representation of single-cell genomics data often requires moving from a high-dimensional to 2D space, so-called 2D embedding. As with the conversion of the globe, this can induce distortions. The 2D post-embedding space may not accurately represent the pre-embedding space. Adding to the problem, popular 2D embedding methods, like t-SNE and UMAP, are sensitive to hyperparameter selection. While general guidelines exist to tailor hyperparameters like perplexity and n.neighbors to the size of the dataset, these guidelines do not help answer the underlying question– what parts of the visualization are misleading?

Similar to cartographers selecting which landmasses to recreate faithfully and which to distort, researchers must prioritize which aspects of the pre-embedding space are most important to preserve post-embedding. Common uses of 2D visualization include annotation and analysis of cell trajectories and clusters. Although cell trajectories and clusters are generally calculated in the high-dimensional space, their results are often visualized through 2D embeddings, in which cells with similar gene expression are expected to be close to each other. Therefore, we concluded that the most important aspect of preservation is the position of cells relative to each other.

These ideas formed the motivation for scDEED, a single-cell dubious embedding detector (Fig. 1). The key idea is that a cell’s pre- and post-embedding neighbors should be similar. It is worth noting that the pre-embedding space is typically 20- to 50-dimensional in single-cell data analysis, usually the principal component space. For each cell, we calculate a reliability score that reflects the visual agreement between the neighbors found in the 2D-embedding space and the pre-embedding space. Cells whose 2D embedding neighbors have been drastically changed through the embedding process are called ‘dubious’; the cell’s relative location is misleading and does not reflect where the cell should be based on the pre-embedding space. Identification of these cells provides a mechanism to optimize hyperparameters by selecting the settings that result in the least amount of dubious cell embeddings.

**Fig. 1. Illustration of the two functionalities of scDEED.** Functionality I decides whether each cell has a *trustworthy* or *dubious* embedding by calculating a *reliability score*, which is defined as the Pearson correlation between the cell’s distances to its closest 50% neighboring cells in the 2D-embedding space and the same cell’s distances to its closest 50% neighboring cells in the pre-embedding space. Enabled by functionality I, functionality II optimizes the hyperparameter setting of an embedding method (e.g. t-SNE and UMAP) by minimizing the number of dubious embeddings.

In our paper, we use a variety of datasets to show how the identification of dubious cells and optimization of hyperparameters can aid analysis. For example, in the original visualization of the single-cell RNA-seq Hydra dataset [1], the neuron ectodermal 1 (neuron ec1) cells are split into two clusters, one that scDEED marked as dubious and the other trustworthy (Fig. 2a). As confirmed by the similarity in gene expression (Fig. 2c) and the singular cell type assigned by the authors, these two clusters are not biologically distinct, making their separation in the t-SNE misleading. Further, if we compare the neuron ec1 cells to its neighboring clusters, like the highlighted ectodermal epithelial cells (ecEP_sc), the gene expression is very different, which is counterintuitive given their proximity in the visualization. However, under the optimized perplexity found by scDEED (Fig. 2b), the neuron ec1 cell type is now unified, further supporting that the original split of the cell type into clusters was a result of hyperparameter settings. Additionally, the neuron ec1 and ecEP_sc cells are now far apart, which is more appropriate given their differences in gene expression. This highlights two uses for scDEED: identification of dubious cells can help discern cells whose embedding positions are misleading, and optimization of hyperparameters can result in a more trustworthy visualization.

**Fig. 2. Evaluation of t-SNE embeddings optimized by scDEED on the Hydra dataset. a–b,** Comparative t-SNE plots with the *ecEP_sc (ectodermal epithelial_single cell)*, *trustworthy cell embeddings in neuron ec1*, and *dubious cell embeddings in neuron ec1* highlighted, at the original perplexity 40 (a) and the perplexity 230 optimized by scDEED (b). C, Gene expression heatmap of the highlighted cells in a and b, where the cells are ordered by the default hierarchical clustering found by the R function heatmap.2().

An interesting application is RNA velocity [2], a downstream analysis task that relies on visualization. RNA velocity uses the amount of unspliced and spliced mRNA to estimate gene velocity– the change in gene expression. The estimated gene velocity can be used to calculate predicted gene expression for a future time point, which can be visualized with an arrow from the cell to the cell’s predicted state. For large datasets, it is not reasonable to plot each cell’s velocity vector; rather, cells are grouped based on their 2D embeddings, and their velocity vectors are aggregated. Changes to the 2D embedding will not affect the estimated gene velocities or predicted expression for the individual cells, but it will change the cell grouping for vector field calculations, and therefore affect the visualized RNA velocities and analysis. Using scDEED to optimize the hyperparameter perplexity of t-SNE (Fig. 3a) greatly enhanced the agreement among neighboring cells, and provided clearer RNA velocity results than using the default hyperparameter value (Fig. 3b). Additionally, the vectors are not exaggerated for the mature granules, an expected result because the cells are fully differentiated. Optimization of the hyperparameter enhanced only existing cell trajectories.

Fig. 3. Velocity analysis of the dentate gyrus dataset. — **Fig. 3. Velocity analysis of the dentate gyrus dataset** [3]**. a-b** Velocity visualization using the embeddings at the original t-SNE perplexity of 30 (a) and the perplexity 450 optimized by scDEED (b) with the default Velocyto [2] settings. Abbreviations for cell types are as follows: Neuro1: Neuroblast1; Neuro2: Neuroblast 2; nIPC: neuronal intermediate progenitor cells.

Recent work [4,5] has highlighted geometric qualities, like geodesics, manifolds, and distance, that cannot be fully recreated because the pre- and post-embedding spaces are not homeomorphic. scDEED can help reduce the inconsistencies by finding hyperparameter settings that accurately capture mid-range cell-cell relationships for the most number of possible cells and identifying cells whose mid-range neighbors have drastically changed. We hope that scDEED can be used as an add-on to existing analysis pipelines to provide a more trustworthy 2D visualization. It is worth pointing out that scDEED does not measure the preservation of all aspects of data; as cartographers deemed it most important to preserve the 5 main continents, we chose to prioritize the relative location of cells. With some adjustments to the definition of the reliability score (one per cell embedding), researchers interested in preserving other qualities of the pre-embedding space may still find the framework of scDEED useful.

References

Multiple Contributors

Jingyi Jessica Li and Christy Lee

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Data and Information Visualization

Mathematics and Computing > Statistics > Statistics and Computing > Data and Information Visualization

Genomics

Life Sciences > Biological Sciences > Genetics and Genomics > Genomics

Transcriptomics

Life Sciences > Biological Sciences > Biological Techniques > Gene Expression Analysis > Transcriptomics

Transcriptomics

Life Sciences > Biological Sciences > Genetics and Genomics > Molecular Genetics > Gene Expression > Gene Expression Analysis > Transcriptomics

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Advances in neurodegenerative diseases

This Collection aims to bring together research from various domains related to neurodegenerative conditions, encompassing novel insights into disease pathophysiology, diagnostics, therapeutic developments, and care strategies. We welcome the submission of all papers relevant to advances in neurodegenerative disease.

Publishing Model: Hybrid

Deadline: Mar 24, 2026

Explore this Collection

Latest Content

A fuzzy set-based hybrid SWARA-CoCoSo-William Fine framework for safety risk assessment in a ceramic granule preparation unit

Digital Nano-Plastic Science (DNPS) Paradigm: Computational Intelligence and Proteostasis Disruptions

Polybiome Systems Medicine: Conceptual Architecture, Methodological Foundations, and Translational Applications — Volume I: Vision and Foundational Methodology

Comprehensive risk profiling of occupational harmful factors in the ceramic industry: a case study from Iran

How to select the best candidate or the key factors? Hierarchical topological clustering can help

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters

Share this post

Share with...

...or copy the link