Behind the paper: Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer
Published in Cancer, Computational Sciences, and General & Internal Medicine

The past decade witnessed the foray of deep learning into applications historically considered to be solely within the realm of human intelligence. From board games to media generation, our perception of the power of artificial intelligence (AI) has been revolutionized. The full potential of AI in clinical genomics has yet to be realized. While AI has been useful in discovery for the purpose of identifying individual biomarkers, its usage for the purpose of diagnosis has been limited. Linear and more explainable machine learning algorithms have remained the chosen toolkit for most molecular diagnostic applications. Sparse biomarkers such as oncRNAs, fragmentomics, and DNA methylation patterns measure a large number of molecules that have extensive variability and low signal to noise ratio. The pattern recognition power of deep learning and AI can overcome these sparsity barriers.
While regularized linear models are sufficiently powerful in the presence of a small number of biomarkers, they fail to generalize when a non-linear combination of biomarkers distinguishes the phenotype or in the presence of major differences among the samples (e.g. technical variations). Methods such as XGBoost can discover the presence of non-linear patterns; yet they still fail to resolve the aforementioned technical variations. In liquid biopsy where the biomarkers have a sparse representation (especially in early stage disease), and blood samples have significant technical variations, a robust deep learning AI can improve model performance and generalizability.
We previously described a group of cancer-emerging small RNAs, orphan non-coding RNAs (oncRNAs) which are present in cancer tissues and are largely absent in healthy normal tissues (Fish et al. 2018, Nature Medicine). OncRNAs emerge as a result of chromatin rearrangements in tumorigenesis and therefore represent a novel opportunity for early detection of cancer. The dynamics of transcription and blood based measurements, however, requires a method capable of resolving sparsity by leveraging non-linear relationship and disentangling biological from technical variations.
Here, we leveraged the power of generative AI in a custom-built framework of a semi-supervised multi-objective twin variational auto-encoder called “Orion”. Orion learns the parameters of a zero-inflated negative binomial distribution to model the dataset through a variational Bayes framework. A contrastive loss, triplet margin loss, allows us to constrain the low-dimensional representation to remove technical variations while emphasizing biological variations. The supervised component leverages the generative process that can sample infinitely from a low-dimensional representation of the data. This process allows us to train a robust classifier generalizable to samples with low-frequency deviations from the mean.
We used a large small RNA-seq dataset collected from the blood of 1,050 individuals with non-small cell lung cancer (NSCLC) and controls. We annotated oncRNAs using the independent datasets of non-small cell lung cancer and healthy controls from the cancer genome atlas. Our deep generative AI model, using the same samples and same oncRNAs, demonstrated superior performance compared to other methods. These methods include the k-nearest-neighbor classifier, support vector machine classifier, ElasticNet, and XGBoost. Our model not only excelled in overall performance but also showed better threshold generalizability to held-out datasets and improved analytical properties.
For example, while on the cross-validated dataset Orion had a sensitivity of 94% (95% C.I. 91%–96%) at 90% specificity, XGBoost had a sensitivity of 84% (95% C.I. 80%–88%). On the held-out validation set, Orion was better calibrated and both the specificity and the sensitivity of Orion remained similar (specificity of 87% and sensitivity of 93%) while other methods were skewed towards a higher specificity (98%–99%) and significantly lower sensitivity (24%–71%). Orion scores for control samples sourced from different suppliers remained less variant than other methods, suggesting a successful removal of technical variations. In addition, Orion showed more favorable properties with respect to variations in sequencing depth and in the limit of detection benchmarks. We also showed that Orion can detect different subtypes of NSCLC, distinguishing squamous cell carcinoma from adenocarcinoma using serum samples of patients. Given the possibility of tumor subtype transition in response to different therapies, monitoring tumors with liquid histology may allow for more patients to benefit from targeted therapy of emerging tumor populations.
Our results reveal that our generative AI framework surpasses commonly used methods for clinical genomics applications such as liquid biopsy and liquid histology. We showed how multi-objective neural networks can resolve the limitations of existing off-the-shelf solutions, and our approach could be extended to other modalities as well as joint embedding of multiple modalities.

We annotated lung cancer oncRNAs using the small RNA-seq datasets of NSCLC and adjacent normal tissues from the cancer genome atlas. We quantified the expression of these oncRNAs in a large datasets of serum samples obtained from 1,050 individuals with NSCLC or without cancer history. We used a generative AI model to detect the presence of NSCLC and its subtype. Our generative AI model uses variational Bayes to model oncRNA expression. Through generative sampling, we train a robust classifier with superior performance and analytical properties compared to existing off-the-shelf models.
Follow the Topic
-
Nature Communications
An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.
Related Collections
With collections, you can get published faster and increase your visibility.
Health in Africa
Publishing Model: Open Access
Deadline: Dec 31, 2025
Reproductive Health
Publishing Model: Hybrid
Deadline: Sep 30, 2025
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in