Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning

Published in Computational Sciences
Colorectal cancer risk stratification on histological slides based on survival curves predicted by deep learning

Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Current risk stratification is imperfect 

Colorectal cancer (CRC) still is one of the most frequently occurring cancer types. Ideally, one would treat every CRC patient with the exactly right amount of therapy, to achieve a very good cure rate, but at the same time avoid unnecessary side effects. Unfortunately, the current risk stratification methods are far from perfect. While they provide a basic level of risk assessment that can be used for therapy selection, they often overlook the nuanced nature of the disease. Some tumors have the capacity to spread even in the early stages, necessitating aggressive treatments, for instance chemotherapy. Conversely, some tumors exhibit limited spreading potential, and patients may be able to forgo adjuvant chemotherapy even when diagnosed at later stages. This calls for more precise and individualized risk stratification methods utilizing specific biomarkers.

The potential of deep learning

Deep Learning (DL) has emerged as a promising tool for discovering additional risk factors in histological images.  Several studies have already explored DL's potential for estimating the risk of CRC patients using whole-slide images, suggesting this could work. While these studies share a common approach, they also differ in variations within that approach. Most predict a risk score that serves as the basis for stratifying risk groups. However, these studies often fall short in assessing how well their approaches generalize to new, independent cohorts, making it hard to conclude which approach is the most promising for an accurate and robust prognosis prediction - a prerequisite for a future successful clinical implementation, since risk stratification must work in every clinic, of course.

A novel approach: predicting survival curves

Instead of predicting a single risk score, we extended previous studies by predicting a five-year survival curve (Figure 1). This curve, we believed, could convey more information about an individual's likely disease course compared to a single risk score, potentially refining CRC risk stratification within the clinically high-risk (CHR) and low-risk (CLR) subgroups that are currently used in the clinic. 

Figure 1: Image analysis pipeline. The pipeline results in an image-based mortality score using H&E slides with deep learning survival curve prediction. In step I, after an H&E slide is segmented into image tiles, a subtyper assigns each tile to one of nine colorectal tissue type classes and only tiles of tissue type(s) of interest are analyzed further. In step II the image tiles are reduced to simplified tile features by a pre-trained feature extractor. In step III all tile features are aggregated to slide features by an attention mechanism. In step IV the slide features are used to predict the patient’s survival curve. The mortality score then aggregates the survival curve in one single value.

The strengths of our study 

Our study leveraged four cohorts, enabling comprehensive testing of our approach. We employed a modular pipeline building on those constructed in studies so far. This allowed us an in-depth analysis of the impact of variations within the pipeline including input tissue (e.g. tumor vs non-tumorous tissue) and feature extractors. We compared our strategy directly with the binary approach that predicts a single risk score.

Take-home message

Our predicted survival curves were similarly good at predicting patient survival as the binary approach, although calibration was a challenge (Figure 2). In general, they showed a clearer difference between patients with good or worse prognosis in the CHR subcohorts and in the larger cohorts (DACHS, MCO). These findings were roughly independent on the variations we tested and moreover, independent on whether we based the risk refinement on our survival curves or on a single risk score . 

 Figure 2: Comparison of the survival curve (curve) and binary approach (binary). a) C-indices on the four CHR test sets b) IBS/BS of the four CHR test sets c) C-indices on the four CLR test sets d) IBS/BS of the four CLR test sets for all investigated feature extractors and the ensembles. Tumor tissue was used as input tissue in all cases. Note that in case of the single risk score prediction (binary), time independent C-indices and the Brier score were calculated. Arrows indicate whether high or low values are better. The dashed line in each sub figure represents a random performance. In case of the C-index, a value above 0.5 is better than random, in case of the IBS/BS a value below 0.25 is better than random. 95% confidence intervals are shown. BS Brier score, CHR clinical high risk, CI confidence interval, CLR clinical low risk, IBS integrated Brier score.

Finding 1: complexity vs. benefit

Our central motivation for predicting survival curves was their potential for capturing time-dependent hazard functions. A single risk score treats the hazard of dying within the first five years as constant, while our survival curve based approach was trained to predict monthly hazards. Our results, however, showed that our approach learned a more or less constant hazard within the first five years. The greater complexity of the proposed method was, therefore, of no direct benefit for this particular task. This could be different for other tasks, however. 

Finding 2: generalization challenges

Both the survival curve approach and the binary approach exhibited limited generalization. Especially regarding the feature extractor, we think that this is a valuable finding. We included feature extractors that were similarly used in previous studies as well as models trained on large amounts of histological data in a self-supervised manner, usually considered to have good generalization capabilities even though or especially because they are not fine-tuned. They did not generalize significantly better than for instance a ResNet18 pre-trained on ImageNet. We also observed that many models performed worse on specific cohorts even if they worked very well for others. The performance strongly depends on the individual cohort. Largely independent of the input tissue, feature extractor or survival network we employ, the models generally performed good on two, and worse on the other two cohorts, pointing to systematic differences between the cohorts. 

Finding 3: ensemble models for robustness

Triggered by these observations, we included ensembles of models with different feature extractors in our analysis. They showed a comparatively robust performance across cohorts and may therefore be a way to compensate for outliers in individual models/cohort constellations. 

Finding 4: comparison and combination with clinical data

Compared to the performance of the image models, Cox proportional hazard models fitted on known clinical risk factors resulted in better-generalizing biomarkers with broadly the same performance. A combination of both did not result in significantly better performance. These results assume that relevant prognostic information content of the clinical data and the histological image features may overlap and should be investigated in more detail.

Finding 5: putting our study in context

Although DL-based image analysis approaches in CRC survival studies have been diverse, our results seem to converge, suggesting that survival is influenced not only by tumor biology but also by various factors unrelated to tumor tissue morphology.


Our results suggest that using DL-based image analysis on histopathological slides and prediction of patient survival curves can further stratify CRC patient prognosis within the risk groups that are currently used to a similar extent as when employing a standard binary risk classification. However, as opposed to a clinical classifier, none of the investigated DL image analysis models or ensembles performed equally well on all cohorts.  Further attempts must be made to improve model generalization. In our experience, although this also poses a challenge, such studies must be conducted in an interdisciplinary working group, since they profit greatly from different levels of expertise that are contributed by data scientists on the one hand and clinicians on the other hand.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Computer Science
Mathematics and Computing > Computer Science

Related Collections

With collections, you can get published faster and increase your visibility.

AI in precision oncology

This Collection is a partnership between npj Precision Oncology and npj Breast Cancer. It will bring together articles on all facets of AI in cancer research.

Publishing Model: Open Access

Deadline: Apr 19, 2024

Innovations in cancers of the central nervous system

This Collection invites research on tumors involving the central nervous system, including primary glial tumors, meningiomas and metastatic tumors in adults.

Publishing Model: Open Access

Deadline: Mar 01, 2024