The past decade witnessed the foray of deep learning into applications historically considered to be solely within the realm of human intelligence. From board games to media generation, our perception of the power of artificial intelligence (AI) has been revolutionized. The full potential of AI in clinical genomics has yet to be realized. While AI has been useful in discovery for the purpose of identifying individual biomarkers, its usage for the purpose of diagnosis has been limited. Linear and more explainable machine learning algorithms have remained the chosen toolkit for most molecular diagnostic applications. Sparse biomarkers such as oncRNAs, fragmentomics, and DNA methylation patterns measure a large number of molecules that have extensive variability and low signal to noise ratio. The pattern recognition power of deep learning and AI can overcome these sparsity barriers.
While regularized linear models are sufficiently powerful in the presence of a small number of biomarkers, they fail to generalize when a non-linear combination of biomarkers distinguishes the phenotype or in the presence of major differences among the samples (e.g. technical variations). Methods such as XGBoost can discover the presence of non-linear patterns; yet they still fail to resolve the aforementioned technical variations. In liquid biopsy where the biomarkers have a sparse representation (especially in early stage disease), and blood samples have significant technical variations, a robust deep learning AI can improve model performance and generalizability.
We previously described a group of cancer-emerging small RNAs, orphan non-coding RNAs (oncRNAs) which are present in cancer tissues and are largely absent in healthy normal tissues (Fish et al. 2018, Nature Medicine). OncRNAs emerge as a result of chromatin rearrangements in tumorigenesis and therefore represent a novel opportunity for early detection of cancer. The dynamics of transcription and blood based measurements, however, requires a method capable of resolving sparsity by leveraging non-linear relationship and disentangling biological from technical variations.
Here, we leveraged the power of generative AI in a custom-built framework of a semi-supervised multi-objective twin variational auto-encoder called “Orion”. Orion learns the parameters of a zero-inflated negative binomial distribution to model the dataset through a variational Bayes framework. A contrastive loss, triplet margin loss, allows us to constrain the low-dimensional representation to remove technical variations while emphasizing biological variations. The supervised component leverages the generative process that can sample infinitely from a low-dimensional representation of the data. This process allows us to train a robust classifier generalizable to samples with low-frequency deviations from the mean.
We used a large small RNA-seq dataset collected from the blood of 1,050 individuals with non-small cell lung cancer (NSCLC) and controls. We annotated oncRNAs using the independent datasets of non-small cell lung cancer and healthy controls from the cancer genome atlas. Our deep generative AI model, using the same samples and same oncRNAs, demonstrated superior performance compared to other methods. These methods include the k-nearest-neighbor classifier, support vector machine classifier, ElasticNet, and XGBoost. Our model not only excelled in overall performance but also showed better threshold generalizability to held-out datasets and improved analytical properties.
For example, while on the cross-validated dataset Orion had a sensitivity of 94% (95% C.I. 91%–96%) at 90% specificity, XGBoost had a sensitivity of 84% (95% C.I. 80%–88%). On the held-out validation set, Orion was better calibrated and both the specificity and the sensitivity of Orion remained similar (specificity of 87% and sensitivity of 93%) while other methods were skewed towards a higher specificity (98%–99%) and significantly lower sensitivity (24%–71%). Orion scores for control samples sourced from different suppliers remained less variant than other methods, suggesting a successful removal of technical variations. In addition, Orion showed more favorable properties with respect to variations in sequencing depth and in the limit of detection benchmarks. We also showed that Orion can detect different subtypes of NSCLC, distinguishing squamous cell carcinoma from adenocarcinoma using serum samples of patients. Given the possibility of tumor subtype transition in response to different therapies, monitoring tumors with liquid histology may allow for more patients to benefit from targeted therapy of emerging tumor populations.
Our results reveal that our generative AI framework surpasses commonly used methods for clinical genomics applications such as liquid biopsy and liquid histology. We showed how multi-objective neural networks can resolve the limitations of existing off-the-shelf solutions, and our approach could be extended to other modalities as well as joint embedding of multiple modalities.