Machine learning interpretability for cancer transcriptome analysis
Published in Computational Sciences
Abstract
I performed differential gene expression (DEG) analysis with transcriptome data of acinar cell carcinoma of prostate gland from male patients (tumor n=197 , control n=49). 1314 putative biomarkers (log2FC ≥1.5 and padj ≤ 0.05) were identified. In complement, I reviewed commercial molecular assays for prostate cancer diagnosis and listed 91 known biomarkers. From these, GRP (ERG log2FC=2.66, padj=1.60e-17) is the only gene among our putative biomarkers. I also identified as potential biomarkers 20 genes whose fold change (FC) was ≥50 and average TPM of control samples ≤ 10. Finally, I showed decision tree can be used for biomarker selection, and moreover, can assist in the interpretability of results. This because the learnt rules are intuitive to read.
Keywords : prostate cancer, transcriptome, biomarkers, decision trees, interpretability.
Introduction
Early detection of prostate cancer significantly improves treatment outcomes and survival rates, emphasizing the need for diagnostic biomarkers that can be applied in different cohorts. Several commercial assays already use gene expression to diagnose prostate cancer [1], e.g. Decipher, Oncotype DX (GPS), Prolaris (CCP / CCR), PTEN, ProMark and TMPRSS2-ERG Fusion. In this work, I proposed to obtain the DEG’s from prostate cancer data. Finally, I compared my DEG’s with the list of biomarkers already experimentally in use.
Methods
The cohort builder of GDC portal (https://portal.gdc.cancer.gov/) accessed on 2026-04-04 was used to obtain data from prostate cancer cases (Experimental Strategy = RNA-Seq, Data Format = tsv, Data Category = transcriptome profiling, Workflow Type = STAR - Counts, Sex at Birth = male, Primary Diagnosis = acinar cell carcinoma, Primary Site = prostate gland). RNA-seq data only were downloaded. Among the 246 samples, 49 and 197 were from normal and tumor tissue type, respectively.
R package DESeq2 [8] was used to obtain differentially expressed genes (log2FC ≥1.5 and padj ≤ 0.05) for samples of Normal versus Tumor tissue types. A github with data and scripts to reproduce the results was made available at : https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/
Results
Tumor Genes
We identified DEG’s 1314 tumor genes (Tumor versus Normal) (log2FC ≥1.5 and padj ≤ 0.05, see Supplemental Table S1). From these, Principal component analysis (PCA) plot shows good separation of tumor versus normal samples, with 15% and 12% variance explained by PC1 and PC2 (See FIgure 1).

Figure 1. PCA plot of RNA-seq samples with DEGs for tumor and normal samples.
Among the DEG’s, GRP (ERG log2FC=2.66, padj=1.60e-17) is the only gene among our putative biomarkers that is already used for prostate cancer diagnosis (see Table 1).

Figure 2. Boxplot with normalized read counts (transcripts Per Million TPM) for Tumor versus Normal samples. Cutpoints = [0, 0.0001, 0.001, 0.01, 0.05, Inf], symbols ["****", "***", "**", "*", "ns"]
Table 1 - Biomarker genes for prostate cancer diagnosis assays.
|
Biomarker |
Assay/Type |
Genes |
Publication |
|
Decipher |
Genomic classifier (22-gene RNA) |
LASP1, IQGAP3, NFIB, S1PR4, THBS2, ANO7, PCDH7, MYBPC1, EPPK1, TSBP, PBX1, NUSAP1, ZWILCH, UBE2C, CAMK2N1, RABGAP1, PCAT-32, GLYATL1P4/PCAT-80, TNFRSF19 |
[2] |
|
Oncotype DX (GPS) |
RT-PCR score (17 genes: 12 cancer + 5 reference) |
ARF1, ATP5E, AZGP1, BGN, CLTC, COL1A1, FAM13C1, FLNC, GPS1, GSN, GSTM2, KLK2, PGK1, SFRP4, SRD5A2, TPM2, TPX2 |
[3] |
|
Prolaris (CCP / CCR) |
Cell-cycle gene panel (31 genes; 16-gene version) |
ASPM, CDC2, CDCA8, CDKN3, DTL, FOXM1, KIAA0101, NUSAP1, PRC1, TK1, CLTC, PSMA1, RPL4, RPS29, SLC25A3, UBA52, ASF1B, BIRC5, BUB1B, C18orf24, CDC20, CDCA3, CENPF, CENPM, CEP55, DLGAP5, KIF11, KIF20A,MCM10 ,ORC6L, PBK, PLK1, PTTG1, RAD51, RAD54L, RRM2, TOP2A, MMADHC, MRFAP1, PPP2CA, PSMC1, RPL8, RPL13A, RPL37, RPL38, TXNL1 |
[4] |
|
PTEN |
Protein/gene loss (IHC, FISH, NGS) |
PTEN |
[5] |
|
ProMark |
Proteomic assay (8-protein panel) |
DERL1, CUL2, SMAD4, PDSS2, HSPA9, FUS, pS6, YBX1 |
[6] |
|
TMPRSS2-ERG Fusion |
Genomic fusion (TMPRSS2::ERG) |
TMPRSS2, ERG |
[7] |
Biomarkers
In addition to DEGs, biomarkers whose fold change (FC) was ≥50 and average TPM of control samples ≤ 10 were identified as potential treatment targets (see Table 2).
|
Gene name |
fold change |
Avg normal |
Std normal |
Avg tumor |
Std tumor |
Reference |
|
SCN1A |
58.60 |
0.00 |
0.01 |
0.25 |
2.18 |
|
|
GC |
53.93 |
0.06 |
0.15 |
3.21 |
20.61 |
[9] |
|
ANKRD30A |
107.03 |
0.01 |
0.04 |
1.58 |
7.72 |
|
|
OOSP2 |
51.12 |
0.00 |
0.01 |
0.04 |
0.22 |
|
|
FEZF2 |
89.64 |
0.01 |
0.02 |
0.54 |
4.78 |
|
|
CDC20B |
52.09 |
0.12 |
0.26 |
6.15 |
74.30 |
|
|
DEFA5 |
1,959.56 |
0.10 |
0.24 |
189.00 |
2,480.52 |
|
|
DEFA6 |
932.12 |
0.06 |
0.13 |
55.49 |
693.42 |
|
|
DEFA6 |
55.71 |
3.28 |
5.30 |
182.98 |
1,385.58 |
|
|
SNORA74A |
132.91 |
0.22 |
0.42 |
29.16 |
243.45 |
[10] |
|
RN7SKP9 |
63.39 |
0.03 |
0.08 |
2.04 |
15.88 |
|
|
SNORD17 |
137.17 |
1.86 |
2.23 |
255.81 |
2,368.35 |
|
|
SNORA74B |
86.41 |
0.37 |
0.56 |
31.75 |
278.76 |
|
|
CCNJP2 |
64.18 |
0.00 |
0.01 |
0.18 |
1.39 |
|
|
AC004485.1 |
66.68 |
0.01 |
0.02 |
0.37 |
3.08 |
|
|
OR52Y1P |
83.32 |
0.01 |
0.02 |
0.47 |
3.03 |
|
|
LINC00993 |
185.85 |
0.04 |
0.12 |
6.57 |
31.61 |
|
|
VN1R53P |
69.26 |
0.00 |
0.03 |
0.34 |
1.71 |
|
|
TUSC7 |
66.01 |
0.00 |
0.01 |
0.16 |
1.19 |
|
|
Y_RNA |
130.07 |
0.02 |
0.12 |
3.17 |
18.08 |
[11] |
* foldChange = Tumor/Normal, Avg normal = Mean(Normal), Std normal = Std(Normal), Avg tumor = Mean (Tumor), Std tumor = Std(Normal). Reference from pubmed search = gene_name + "prostate cancer"[Title/Abstract] + biomarker.
Interpretability studies
Decision tree
A decision tree was constructed with data from all tumor genes. Discrete categories (Low, Medium and High) were used for the gene expressions. The discrete categories were used to fit a decision tree model for the prediction of Tissue Type (Tumor/Normal). From Fig. 3 we can read that initial distribution of the n=256 of samples are as follows : 20% for the Normal and 80% Tumor. Moreover, from the rules it is possible to intuitively read that IF the expression of GSTP1 is high or medium THEN the model classifies the sample as Tumor. However, if the GSTP1 is high or medium together with the observation that TRGC1 expression is low, the model classifies the sample as Normal. The complete set of rules can be read on Table 3.

Figure 3. Decision tree models built from prostate cancer data for the prediction of Tissue Type (Tumor/Normal). ENSG00000084207.18 (GSTP1), ENSG00000211689.7 (TRGC1), ENSG00000124233.12(SEMG1)
Table 3. Decision tree rules for the prediction oe efficiency
|
Rule |
Tissue Type prediction |
n |
loss |
Normal |
Tumor |
leaf node |
|
root |
Tumor |
256 |
49 |
0,20 |
0,80 |
|
|
IF GSTP1=high|medium THEN |
Normal |
38 |
14 |
0,63 |
0,37 |
|
|
IF GSTP1=high|medium AND TRGC1=low THEN |
Tumor |
27 |
6 |
0,77 |
0,23 |
* |
|
IF GSTP1=high|medium AND TRGC1=high,medium THEN |
Tumor |
11 |
3 |
0,27 |
0,73 |
* |
|
IF GSTP1=low THEN |
Tumor |
208 |
25 |
0,12 |
0,88 |
* |
|
IF GSTP1=low AND SEMG1=high,medium THEN |
Normal |
12 |
2 |
0,83 |
0,17 |
* |
|
IF GSTP1=low AND SEMG1=low THEN |
Tumor |
196 |
15 |
0,08 |
0,92 |
* |
The decision tree performance was assessed by calculating the confusion matrix constructed from a model fitted from all data, predicting the tissue type also from the all data [Accuracy : 0.90, 95% CI : (0.85, 0.93), No Information Rate : 0.80, P-Value [Acc > NIR] : 6.12e-05 ]. In addition, confusion matrix was also constructed from training set (75% randomly selected samples) versus predictions on testing set (25% remaining data) [Accuracy : 0.97, 95% CI : (0.89, 0.99), No Information Rate : 0.85, P-Value [Acc > NIR] : 0.004] (See Table 4).
Table 4. Confusion matrix from predicted versus actual data. Whole dataset and training versus testing set were assessed.
|
Data set |
Reference |
|
|
Whole data |
Normal |
Tumor |
|
Prediction |
31 |
8 |
|
18 |
189 |
|
|
Trainning versus testing sets |
Normal |
Tumor |
|
Prediction |
31 |
8 |
|
18 |
189 |
Conclusions
The challenge of identifying biomarkers for cancer treatment seem to lie on the construction of cohorts from a bioinformatics perspective. The important however for individualized treatment is to have information that can move the research forward. The differentially expressed genes and selected biomarkers can drive the construction of cohorts for further research. I also showed decision tree rules are intuitive to read, can be used for biomarker selection, and moreover, assist in the interpretability of results.
I searched in the pubmed the name of the genes automatically selected by the decision tree in relation to prostate cancer (gene_name + "prostate cancer"[Title/Abstract] + biomarker) AND (("2026"[Date - Publication] : "2026"[Date - Publication])) and found the GSTP1 gene have publications with implication to cancer prostate in the year of 2026 [11], [12], [13],
Declarations
Ethics approval and consent to participate
This research used publicly available, non-identifiable, transcriptome of human data, already published on the GDC portal (https://portal.gdc.cancer.gov/) accessed on 2026-04-04.
Declaration of interest statement
The author declares he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Consent for publication
I hereby provide consent for the publication of the manuscript “A decision tree model for intuitive interpretation of transcriptome analysis of prostate cancer”, including any accompanying images or data contained within the manuscript.
Availability of data and materials
The data that support the findings of this study are openly available in GDC portal.A github with data and scripts to reproduce the results was made available at : https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/
Authors' contributions
Felipe Leal Valentim (FLV) conceived and implemented the project
Funding declaration
FLV had the post-doc funded by FUSP but this work was performed independently.
References
[1] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.
[2] Erho N, Crisan A, Vergara IA, Mitra AP, Ghadessi M, Buerki C, Bergstralh EJ, Kollmeyer T, Fink S, Haddad Z, Zimmermann B, Sierocinski T, Ballman KV, Triche TJ, Black PC, Karnes RJ, Klee G, Davicioni E, Jenkins RB. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS One. 2013 Jun 24;8(6):e66855. doi: 10.1371/journal.pone.0066855. PMID: 23826159; PMCID: PMC3691249.
[3] Knezevic D, Goddard AD, Natraj N, Cherbavaz DB, Clark-Langone KM, Snable J, Watson D, Falzarano SM, Magi-Galluzzi C, Klein EA, Quale C. Analytical validation of the Oncotype DX prostate cancer assay - a clinical RT-PCR assay optimized for prostate needle biopsies. BMC Genomics. 2013 Oct 8;14:690. doi: 10.1186/1471-2164-14-690. PMID: 24103217; PMCID: PMC4007703.
[4] Kuhl V, Clegg W, Meek S, Lenz L, Flake DD 2nd, Ronan T, Kornilov M, Horsch D, Scheer M, Farber D, Zalaznick H, Cussenot O, Compérat E, Cancel-Tassin G, Wild PJ, Chun FK, Mandel P, Moinfar F, Cohen T, Delee S, Kronenwett R, Doedt J. Development and validation of a cell cycle progression signature for decentralized testing of men with prostate cancer. Biomark Med. 2022 Apr;16(6):449-459. doi: 10.2217/bmm-2021-0479. Epub 2022 Mar 24. PMID: 35321552.
[5] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.
[6] Roth JA, Ramsey SD, Carlson JJ. Cost-Effectiveness of a Biopsy-Based 8-Protein Prostate Cancer Prognostic Assay to Optimize Treatment Decision Making in Gleason 3 + 3 and 3 + 4 Early Stage Prostate Cancer. Oncologist. 2015 Dec;20(12):1355-64. doi: 10.1634/theoncologist.2015-0214. Epub 2015 Oct 19. PMID: 26482553; PMCID: PMC4679086.
[7] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.
[8] Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281; PMCID: PMC4302049.
[9] Trummer O, Langsenlehner U, Krenn-Pilko S, Pieber TR, Obermayer-Pietsch B, Gerger A, Renner W, Langsenlehner T. Vitamin D and prostate cancer prognosis: a Mendelian randomization study. World J Urol. 2016 Apr;34(4):607-11. doi: 10.1007/s00345-015-1646-9. Epub 2015 Jul 25. PMID: 26209090.
[10] Sanders I, Holdenrieder S, Walgenbach-Brünagel G, von Ruecker A, Kristiansen G, Müller SC, Ellinger J. Evaluation of reference genes for the analysis of serum miRNA in patients with prostate cancer, bladder cancer and renal cell carcinoma. Int J Urol. 2012 Nov;19(11):1017-25. doi: 10.1111/j.1442-2042.2012.03082.x. Epub 2012 Jul 12. PMID: 22788411.
[11] Schut IC, Waterfall PM, Ross M, O'Sullivan C, Miller WR, Habib FK, Bayne CW. MUC1 expression, splice variant and short form transcription (MUC1/Z, MUC1/Y) in prostate cell lines and tissue. BJU Int. 2003 Feb;91(3):278-83. doi: 10.1046/j.1464-410x.2003.03062.x. PMID: 12581019.
[12] De Vrieze M, Zhang N, Seibold P, Gerhäuser C, Albers P, Krilaviciute A. Clinical validity of circulating tumor DNA as a diagnostic biomarker for prostate cancer: a systematic review. Cancer Epidemiol Biomarkers Prev. 2026 Mar 13. doi: 10.1158/1055-9965.EPI-25-1820. Epub ahead of print. PMID: 41824537.
[13] Huang Y, Mao J, Li X. Emerging biomarkers in prostate cancer diagnosis and treatment: Insights into genetic, RNA and metabolic markers (Review). Int J Oncol. 2026 Feb;68(2):15. doi: 10.3892/ijo.2025.5828. Epub 2025 Dec 5. PMID: 41347816; PMCID: PMC12716904.
[14] Ren Z, Liu X, Zhang J, Song M, Yang Q, Li C, Liu D. Network pharmacology research integrating LC-MS/MS, machine learning, molecular docking, and dynamics simulation: key biomarkers and potential mechanisms of Phellinus igniarius against prostate cancer. In Silico Pharmacol. 2026 Feb 17;14(1):65. doi: 10.1007/s40203-025-00511-5. PMID: 41717432; PMCID: PMC12913839.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in