Machine learning interpretability for cancer transcriptome analysis

In this brief report I intended to demonstrate how machine learning methods can enhance interpretability and assist decision-making around biomarker selection. I used decision trees on the prostate cancer transcriptome data from the Genomic Data Commons (GDC) Cancer Portal.

Published in Computational Sciences

Machine learning interpretability for cancer transcriptome analysis
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Abstract

I performed differential gene expression (DEG) analysis with transcriptome data of acinar cell carcinoma of prostate gland from male patients (tumor n=197 , control n=49). 1314 putative biomarkers (log2FC ≥1.5 and  padj  ≤ 0.05) were identified. In complement, I reviewed commercial molecular assays for prostate cancer diagnosis and listed 91 known biomarkers. From these, GRP (ERG log2FC=2.66, padj=1.60e-17) is the only gene among our putative biomarkers. I also identified as potential biomarkers 20 genes whose fold change (FC) was ≥50 and average TPM of control samples ≤ 10.  Finally, I showed decision tree can be used for biomarker selection, and moreover, can assist in the interpretability of results. This because the learnt rules are intuitive to read.

Keywords : prostate cancer, transcriptome, biomarkers, decision trees, interpretability.

Introduction

Early detection of prostate cancer significantly improves treatment outcomes and survival rates, emphasizing the need for diagnostic biomarkers that can be applied in different cohorts.  Several commercial assays already use gene expression to diagnose prostate cancer [1], e.g. Decipher, Oncotype DX (GPS), Prolaris (CCP / CCR), PTEN, ProMark and TMPRSS2-ERG Fusion. In this work, I proposed to obtain the DEG’s from prostate cancer data. Finally,  I compared my DEG’s with the list of biomarkers already experimentally in use.

Methods

The cohort builder of GDC portal (https://portal.gdc.cancer.gov/) accessed on 2026-04-04 was used to obtain data from prostate cancer cases (Experimental Strategy = RNA-Seq,  Data Format = tsv,  Data Category = transcriptome profiling, Workflow Type = STAR - Counts, Sex at Birth = male,  Primary Diagnosis = acinar cell carcinoma, Primary Site = prostate gland). RNA-seq data only were downloaded. Among the 246 samples, 49 and 197 were from normal and tumor tissue type, respectively.


R package DESeq2 [8] was used to obtain differentially expressed genes (log2FC ≥1.5 and  padj  ≤ 0.05) for samples of Normal versus Tumor tissue types. A github with data and scripts to reproduce the results was made available at :
https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/

Results 

Tumor Genes

We identified DEG’s 1314  tumor genes (Tumor versus Normal) (log2FC ≥1.5 and  padj  ≤ 0.05, see Supplemental Table S1). From these, Principal component analysis (PCA) plot shows good separation of tumor versus normal samples, with 15% and 12% variance explained by PC1 and PC2 (See FIgure 1).

Figure 1. PCA plot of RNA-seq samples with DEGs  for tumor and normal samples. 

 

Among the DEG’s, GRP (ERG log2FC=2.66, padj=1.60e-17) is the only gene among our putative biomarkers that is already used for prostate cancer diagnosis (see Table 1).

 

Figure 2. Boxplot with normalized read counts (transcripts Per Million TPM) for Tumor versus Normal samples. Cutpoints = [0, 0.0001, 0.001, 0.01, 0.05, Inf], symbols ["****", "***", "**", "*", "ns"]




Table 1 - Biomarker genes for prostate cancer diagnosis assays.

Biomarker

Assay/Type

Genes

Publication

Decipher

Genomic classifier (22-gene RNA)

LASP1, IQGAP3, NFIB, S1PR4, THBS2, ANO7, PCDH7, MYBPC1, EPPK1, TSBP, PBX1, NUSAP1, ZWILCH, UBE2C, CAMK2N1, RABGAP1, PCAT-32, GLYATL1P4/PCAT-80, TNFRSF19

[2]

Oncotype DX (GPS)

RT-PCR score (17 genes: 12 cancer + 5 reference)

ARF1, ATP5E, AZGP1, BGN, CLTC, COL1A1, FAM13C1, FLNC, GPS1, GSN, GSTM2, KLK2, PGK1, SFRP4, SRD5A2, TPM2, TPX2

[3]

Prolaris (CCP / CCR)

Cell-cycle gene panel (31 genes; 16-gene version)

ASPM, CDC2, CDCA8, CDKN3, DTL, FOXM1, KIAA0101, NUSAP1, PRC1, TK1, CLTC, PSMA1, RPL4, RPS29, SLC25A3, UBA52, ASF1B, BIRC5, BUB1B, C18orf24, CDC20, CDCA3, CENPF, CENPM, CEP55, DLGAP5, KIF11, KIF20A,MCM10 ,ORC6L, PBK, PLK1, PTTG1, RAD51, RAD54L, RRM2, TOP2A, MMADHC, MRFAP1, PPP2CA, PSMC1, RPL8, RPL13A, RPL37, RPL38, TXNL1

[4]

PTEN

Protein/gene loss (IHC, FISH, NGS)

PTEN

[5]

ProMark

Proteomic assay (8-protein panel)

DERL1, CUL2, SMAD4, PDSS2, HSPA9, FUS, pS6, YBX1

[6]

TMPRSS2-ERG Fusion

Genomic fusion (TMPRSS2::ERG)

TMPRSS2, ERG

[7]



Biomarkers

In addition to DEGs, biomarkers whose fold change (FC) was ≥50 and average TPM of control samples ≤ 10 were identified as potential treatment targets (see Table 2).

Gene name

fold change

Avg normal

Std normal

Avg 

tumor

Std 

tumor

Reference

SCN1A

58.60

0.00

0.01

0.25

2.18

GC

53.93

0.06

0.15

3.21

20.61

[9] 

ANKRD30A

107.03

0.01

0.04

1.58

7.72

OOSP2

51.12

0.00

0.01

0.04

0.22

FEZF2

89.64

0.01

0.02

0.54

4.78

CDC20B

52.09

0.12

0.26

6.15

74.30

DEFA5

1,959.56

0.10

0.24

189.00

2,480.52

DEFA6

932.12

0.06

0.13

55.49

693.42

DEFA6

55.71

3.28

5.30

182.98

1,385.58

SNORA74A

132.91

0.22

0.42

29.16

243.45

[10]

RN7SKP9

63.39

0.03

0.08

2.04

15.88

SNORD17

137.17

1.86

2.23

255.81

2,368.35

SNORA74B

86.41

0.37

0.56

31.75

278.76

CCNJP2

64.18

0.00

0.01

0.18

1.39

AC004485.1

66.68

0.01

0.02

0.37

3.08

OR52Y1P

83.32

0.01

0.02

0.47

3.03

LINC00993

185.85

0.04

0.12

6.57

31.61

VN1R53P

69.26

0.00

0.03

0.34

1.71

TUSC7

66.01

0.00

0.01

0.16

1.19

Y_RNA

130.07

0.02

0.12

3.17

18.08

[11] 

* foldChange = Tumor/Normal, Avg normal = Mean(Normal), Std normal = Std(Normal), Avg tumor = Mean (Tumor), Std tumor = Std(Normal). Reference from pubmed search  = gene_name + "prostate cancer"[Title/Abstract] + biomarker.

 

Interpretability studies

Decision tree

A decision tree was constructed with data from all tumor genes. Discrete categories (Low, Medium and High) were used for the gene expressions. The discrete categories were used to fit a decision tree model for the prediction of Tissue Type (Tumor/Normal). From Fig. 3 we can read that initial distribution of the n=256 of samples are as follows : 20% for the Normal and 80% Tumor. Moreover, from the rules it is possible to intuitively read that IF the expression of GSTP1 is high or medium THEN the model classifies the sample as Tumor. However, if the GSTP1 is high or medium together with the observation that TRGC1 expression is low, the model classifies the sample as Normal. The complete set of rules can be read on Table 3.

Figure 3. Decision tree models built from prostate cancer data for the prediction of Tissue Type (Tumor/Normal). ENSG00000084207.18 (GSTP1), ENSG00000211689.7 (TRGC1), ENSG00000124233.12(SEMG1)

 

Table 3. Decision tree rules for the prediction oe efficiency

Rule

Tissue Type prediction

n

loss

Normal

Tumor

leaf node

root

Tumor

256

49

0,20

0,80

IF GSTP1=high|medium THEN

Normal

38

14

0,63

0,37

IF GSTP1=high|medium AND TRGC1=low THEN

Tumor

27

6

0,77

0,23

*

IF GSTP1=high|medium AND TRGC1=high,medium THEN

Tumor

11

3

0,27

0,73

*

IF GSTP1=low THEN

Tumor

208

25

0,12

0,88

*

IF GSTP1=low AND SEMG1=high,medium THEN

Normal

12

2

0,83

0,17

*

IF GSTP1=low AND SEMG1=low THEN

Tumor

196

15

0,08

0,92

*

 

The decision tree performance was assessed by calculating the confusion matrix constructed from a model fitted from all data, predicting the tissue type also from the all data [Accuracy : 0.90, 95% CI : (0.85, 0.93), No Information Rate : 0.80,  P-Value [Acc > NIR] : 6.12e-05 ]. In addition, confusion matrix was also constructed from training set (75% randomly selected samples) versus predictions on testing set (25% remaining data) [Accuracy : 0.97,   95% CI : (0.89, 0.99),  No Information Rate : 0.85,  P-Value [Acc > NIR] : 0.004] (See Table 4). 

Table 4. Confusion matrix from predicted versus actual data. Whole dataset and training versus testing set were assessed.

Data set

Reference

Whole data

Normal

Tumor

Prediction

31

8

18

189

Trainning versus testing sets

Normal

Tumor

Prediction

31

8

18

189

Conclusions

The challenge of identifying biomarkers for cancer treatment seem to lie on the construction of cohorts from a bioinformatics perspective. The important however for individualized treatment is to have information that can move the research forward. The differentially expressed genes and selected biomarkers can drive the construction of cohorts for further research.  I also showed decision tree rules are intuitive to read, can be used for biomarker selection, and moreover, assist in the interpretability of results.

I searched in the pubmed the name of the genes automatically selected by the decision  tree in relation to prostate cancer (gene_name + "prostate cancer"[Title/Abstract] + biomarker) AND (("2026"[Date - Publication] : "2026"[Date - Publication])) and found the GSTP1 gene have publications with implication to cancer prostate in the year of 2026 [11], [12], [13],   

Declarations

Ethics approval and consent to participate

This research used publicly available, non-identifiable, transcriptome of human data, already published on the GDC portal (https://portal.gdc.cancer.gov/) accessed on 2026-04-04.

Declaration of interest statement

The author declares he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. 

Consent for publication

I hereby provide consent for the publication of the manuscript “A decision tree model for intuitive interpretation of transcriptome analysis of prostate cancer”, including any accompanying images or data contained within the manuscript. 

Availability of data and materials 

The data that support the findings of this study are openly available in GDC portal.A github with data and scripts to reproduce the results was made available at : https://github.com/datasciencebioinformatics/BiomarkerIdentification_ProstateCancer/

Authors' contributions

Felipe Leal Valentim (FLV) conceived and implemented the project

Funding declaration

FLV had the post-doc funded by FUSP but this work was performed independently.

References

[1] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.


[2] Erho N, Crisan A, Vergara IA, Mitra AP, Ghadessi M, Buerki C, Bergstralh EJ, Kollmeyer T, Fink S, Haddad Z, Zimmermann B, Sierocinski T, Ballman KV, Triche TJ, Black PC, Karnes RJ, Klee G, Davicioni E, Jenkins RB. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS One. 2013 Jun 24;8(6):e66855. doi: 10.1371/journal.pone.0066855. PMID: 23826159; PMCID: PMC3691249.


[3] Knezevic D, Goddard AD, Natraj N, Cherbavaz DB, Clark-Langone KM, Snable J, Watson D, Falzarano SM, Magi-Galluzzi C, Klein EA, Quale C. Analytical validation of the Oncotype DX prostate cancer assay - a clinical RT-PCR assay optimized for prostate needle biopsies. BMC Genomics. 2013 Oct 8;14:690. doi: 10.1186/1471-2164-14-690. PMID: 24103217; PMCID: PMC4007703.


[4] Kuhl V, Clegg W, Meek S, Lenz L, Flake DD 2nd, Ronan T, Kornilov M, Horsch D, Scheer M, Farber D, Zalaznick H, Cussenot O, Compérat E, Cancel-Tassin G, Wild PJ, Chun FK, Mandel P, Moinfar F, Cohen T, Delee S, Kronenwett R, Doedt J. Development and validation of a cell cycle progression signature for decentralized testing of men with prostate cancer. Biomark Med. 2022 Apr;16(6):449-459. doi: 10.2217/bmm-2021-0479. Epub 2022 Mar 24. PMID: 35321552.


[5] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.


[6] Roth JA, Ramsey SD, Carlson JJ. Cost-Effectiveness of a Biopsy-Based 8-Protein Prostate Cancer Prognostic Assay to Optimize Treatment Decision Making in Gleason 3 + 3 and 3 + 4 Early Stage Prostate Cancer. Oncologist. 2015 Dec;20(12):1355-64. doi: 10.1634/theoncologist.2015-0214. Epub 2015 Oct 19. PMID: 26482553; PMCID: PMC4679086.


[7] Hamed NW, Elbeljihy HS, Hussin SA, Fouda RM, Oy EK, W Magar R. From prostate specific antigen to genomic signatures: Advances in biomarkers for prostate cancer diagnosis and prognosis. Transl Oncol. 2026 Apr;66:102719. doi: 10.1016/j.tranon.2026.102719. Epub 2026 Feb 27. PMID: 41762538; PMCID: PMC12963913.


[8] Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281; PMCID: PMC4302049.  


[9] Trummer O, Langsenlehner U, Krenn-Pilko S, Pieber TR, Obermayer-Pietsch B, Gerger A, Renner W, Langsenlehner T. Vitamin D and prostate cancer prognosis: a Mendelian randomization study. World J Urol. 2016 Apr;34(4):607-11. doi: 10.1007/s00345-015-1646-9. Epub 2015 Jul 25. PMID: 26209090. 


[10] Sanders I, Holdenrieder S, Walgenbach-Brünagel G, von Ruecker A, Kristiansen G, Müller SC, Ellinger J. Evaluation of reference genes for the analysis of serum miRNA in patients with prostate cancer, bladder cancer and renal cell carcinoma. Int J Urol. 2012 Nov;19(11):1017-25. doi: 10.1111/j.1442-2042.2012.03082.x. Epub 2012 Jul 12. PMID: 22788411. 


[11] Schut IC, Waterfall PM, Ross M, O'Sullivan C, Miller WR, Habib FK, Bayne CW. MUC1 expression, splice variant and short form transcription (MUC1/Z, MUC1/Y) in prostate cell lines and tissue. BJU Int. 2003 Feb;91(3):278-83. doi: 10.1046/j.1464-410x.2003.03062.x. PMID: 12581019.


[12] De Vrieze M, Zhang N, Seibold P, Gerhäuser C, Albers P, Krilaviciute A. Clinical validity of circulating tumor DNA as a diagnostic biomarker for prostate cancer: a systematic review. Cancer Epidemiol Biomarkers Prev. 2026 Mar 13. doi: 10.1158/1055-9965.EPI-25-1820. Epub ahead of print. PMID: 41824537.


[13] Huang Y, Mao J, Li X. Emerging biomarkers in prostate cancer diagnosis and treatment: Insights into genetic, RNA and metabolic markers (Review). Int J Oncol. 2026 Feb;68(2):15. doi: 10.3892/ijo.2025.5828. Epub 2025 Dec 5. PMID: 41347816; PMCID: PMC12716904.


[14] Ren Z, Liu X, Zhang J, Song M, Yang Q, Li C, Liu D. Network pharmacology research integrating LC-MS/MS, machine learning, molecular docking, and dynamics simulation: key biomarkers and potential mechanisms of Phellinus igniarius against prostate cancer. In Silico Pharmacol. 2026 Feb 17;14(1):65. doi: 10.1007/s40203-025-00511-5. PMID: 41717432; PMCID: PMC12913839.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Bioinformatics
Mathematics and Computing > Computer Science > Computer and Information Systems Applications > Bioinformatics