Behind the paper: Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer

Our recent study in Nature Communications highlights the success of a custom generative AI framework in detecting non-small cell lung cancer and its subtypes from serum-derived orphan non-coding RNAs.
Behind the paper: Deep generative AI models analyzing circulating orphan non-coding RNAs enable detection of early-stage lung cancer
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The past decade witnessed the foray of deep learning into applications historically considered to be solely within the realm of human intelligence. From board games to media generation, our perception of the power of artificial intelligence (AI) has been revolutionized. The full potential of AI in clinical genomics has yet to be realized. While AI has been useful in discovery for the purpose of identifying individual biomarkers, its usage for the purpose of diagnosis has been limited. Linear and more explainable machine learning algorithms have remained the chosen toolkit for most molecular diagnostic applications. Sparse biomarkers such as oncRNAs, fragmentomics, and DNA methylation patterns measure a large number of molecules that have extensive variability and low signal to noise ratio. The pattern recognition power of deep learning and AI can overcome these sparsity barriers.

While regularized linear models are sufficiently powerful in the presence of a small number of biomarkers, they fail to generalize when a non-linear combination of biomarkers distinguishes the phenotype or in the presence of major differences among the samples (e.g. technical variations). Methods such as XGBoost can discover the presence of non-linear patterns; yet they still fail to resolve the aforementioned technical variations. In liquid biopsy where the biomarkers have a sparse representation (especially in early stage disease), and blood samples have significant technical variations, a robust deep learning AI can improve model performance and generalizability.

We previously described a group of cancer-emerging small RNAs, orphan non-coding RNAs (oncRNAs) which are present in cancer tissues and are largely absent in healthy normal tissues (Fish et al. 2018, Nature Medicine). OncRNAs emerge as a result of chromatin rearrangements in tumorigenesis and therefore represent a novel opportunity for early detection of cancer. The dynamics of transcription and blood based measurements, however, requires a method capable of resolving sparsity by leveraging non-linear relationship and disentangling biological from technical variations.

Here, we leveraged the power of generative AI in a custom-built framework of a semi-supervised multi-objective twin variational auto-encoder called “Orion”. Orion learns the parameters of a zero-inflated negative binomial distribution to model the dataset through a variational Bayes framework. A contrastive loss, triplet margin loss, allows us to constrain the low-dimensional representation to remove technical variations while emphasizing biological variations. The supervised component leverages the generative process that can sample infinitely from a low-dimensional representation of the data. This process allows us to train a robust classifier generalizable to samples with low-frequency deviations from the mean.

We used a large small RNA-seq dataset collected from the blood of 1,050 individuals with non-small cell lung cancer (NSCLC) and controls. We annotated oncRNAs using the independent datasets of non-small cell lung cancer and healthy controls from the cancer genome atlas. Our deep generative AI model, using the same samples and same oncRNAs, demonstrated superior performance compared to other methods. These methods include the k-nearest-neighbor classifier, support vector machine classifier, ElasticNet, and XGBoost. Our model not only excelled in overall performance but also showed better threshold generalizability to held-out datasets and improved analytical properties.

For example, while on the cross-validated dataset Orion had a sensitivity of 94% (95% C.I. 91%–96%) at 90% specificity, XGBoost had a sensitivity of 84% (95% C.I. 80%–88%). On the held-out validation set, Orion was better calibrated and both the specificity and the sensitivity of Orion remained similar (specificity of 87% and sensitivity of 93%) while other methods were skewed towards a higher specificity (98%–99%) and significantly lower sensitivity (24%–71%). Orion scores for control samples sourced from different suppliers remained less variant than other methods, suggesting a successful removal of technical variations. In addition, Orion showed more favorable properties with respect to variations in sequencing depth and in the limit of detection benchmarks. We also showed that Orion can detect different subtypes of NSCLC, distinguishing squamous cell carcinoma from adenocarcinoma using serum samples of patients. Given the possibility of tumor subtype transition in response to different therapies, monitoring tumors with liquid histology may allow for more patients to benefit from targeted therapy of emerging tumor populations.

Our results reveal that our generative AI framework surpasses commonly used methods for clinical genomics applications such as liquid biopsy and liquid histology. We showed how multi-objective neural networks can resolve the limitations of existing off-the-shelf solutions, and our approach could be extended to other modalities as well as joint embedding of multiple modalities.

We annotated lung cancer oncRNAs using the small RNA-seq datasets of NSCLC and adjacent normal tissues from the cancer genome atlas. We quantified the expression of these oncRNAs in a large datasets of serum samples obtained from 1,050 individuals with NSCLC or without cancer history. We used a generative AI model to detect the presence of NSCLC and its subtype. Our generative AI model uses variational Bayes to model oncRNA expression. Through generative sampling, we train a robust classifier with superior performance and analytical properties compared to existing off-the-shelf models.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Cancer Genetics and Genomics
Life Sciences > Biological Sciences > Cancer Biology > Cancer Genetics and Genomics
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning
Diagnosis
Life Sciences > Health Sciences > Clinical Medicine > Diagnosis
Non-small-cell Lung Cancer
Life Sciences > Biological Sciences > Cancer Biology > Cancers > Lung Cancer > Non-small-cell Lung Cancer

Related Collections

With collections, you can get published faster and increase your visibility.

Health in Africa

We aim to promote high-quality research that advances our understanding of health issues in Africa, and advocates for better healthcare on the continent in line with the UN’s SDGs.

Publishing Model: Open Access

Deadline: Dec 31, 2025

Reproductive Health

This Collection welcomes submissions related to a broad range of topics within reproductive health care and medicine related to reproductive well-being.

Publishing Model: Hybrid

Deadline: Sep 30, 2025