Current risk stratification is imperfect
Colorectal cancer (CRC) still is one of the most frequently occurring cancer types. Ideally, one would treat every CRC patient with the exactly right amount of therapy, to achieve a very good cure rate, but at the same time avoid unnecessary side effects. Unfortunately, the current risk stratification methods are far from perfect. While they provide a basic level of risk assessment that can be used for therapy selection, they often overlook the nuanced nature of the disease. Some tumors have the capacity to spread even in the early stages, necessitating aggressive treatments, for instance chemotherapy. Conversely, some tumors exhibit limited spreading potential, and patients may be able to forgo adjuvant chemotherapy even when diagnosed at later stages. This calls for more precise and individualized risk stratification methods utilizing specific biomarkers.
The potential of deep learning
Deep Learning (DL) has emerged as a promising tool for discovering additional risk factors in histological images. Several studies have already explored DL's potential for estimating the risk of CRC patients using whole-slide images, suggesting this could work. While these studies share a common approach, they also differ in variations within that approach. Most predict a risk score that serves as the basis for stratifying risk groups. However, these studies often fall short in assessing how well their approaches generalize to new, independent cohorts, making it hard to conclude which approach is the most promising for an accurate and robust prognosis prediction - a prerequisite for a future successful clinical implementation, since risk stratification must work in every clinic, of course.
A novel approach: predicting survival curves
Instead of predicting a single risk score, we extended previous studies by predicting a five-year survival curve (Figure 1). This curve, we believed, could convey more information about an individual's likely disease course compared to a single risk score, potentially refining CRC risk stratification within the clinically high-risk (CHR) and low-risk (CLR) subgroups that are currently used in the clinic.
The strengths of our study
Our study leveraged four cohorts, enabling comprehensive testing of our approach. We employed a modular pipeline building on those constructed in studies so far. This allowed us an in-depth analysis of the impact of variations within the pipeline including input tissue (e.g. tumor vs non-tumorous tissue) and feature extractors. We compared our strategy directly with the binary approach that predicts a single risk score.
Our predicted survival curves were similarly good at predicting patient survival as the binary approach, although calibration was a challenge (Figure 2). In general, they showed a clearer difference between patients with good or worse prognosis in the CHR subcohorts and in the larger cohorts (DACHS, MCO). These findings were roughly independent on the variations we tested and moreover, independent on whether we based the risk refinement on our survival curves or on a single risk score .
Finding 1: complexity vs. benefit
Our central motivation for predicting survival curves was their potential for capturing time-dependent hazard functions. A single risk score treats the hazard of dying within the first five years as constant, while our survival curve based approach was trained to predict monthly hazards. Our results, however, showed that our approach learned a more or less constant hazard within the first five years. The greater complexity of the proposed method was, therefore, of no direct benefit for this particular task. This could be different for other tasks, however.
Finding 2: generalization challenges
Both the survival curve approach and the binary approach exhibited limited generalization. Especially regarding the feature extractor, we think that this is a valuable finding. We included feature extractors that were similarly used in previous studies as well as models trained on large amounts of histological data in a self-supervised manner, usually considered to have good generalization capabilities even though or especially because they are not fine-tuned. They did not generalize significantly better than for instance a ResNet18 pre-trained on ImageNet. We also observed that many models performed worse on specific cohorts even if they worked very well for others. The performance strongly depends on the individual cohort. Largely independent of the input tissue, feature extractor or survival network we employ, the models generally performed good on two, and worse on the other two cohorts, pointing to systematic differences between the cohorts.
Finding 3: ensemble models for robustness
Triggered by these observations, we included ensembles of models with different feature extractors in our analysis. They showed a comparatively robust performance across cohorts and may therefore be a way to compensate for outliers in individual models/cohort constellations.
Finding 4: comparison and combination with clinical data
Compared to the performance of the image models, Cox proportional hazard models fitted on known clinical risk factors resulted in better-generalizing biomarkers with broadly the same performance. A combination of both did not result in significantly better performance. These results assume that relevant prognostic information content of the clinical data and the histological image features may overlap and should be investigated in more detail.
Finding 5: putting our study in context
Although DL-based image analysis approaches in CRC survival studies have been diverse, our results seem to converge, suggesting that survival is influenced not only by tumor biology but also by various factors unrelated to tumor tissue morphology.
Our results suggest that using DL-based image analysis on histopathological slides and prediction of patient survival curves can further stratify CRC patient prognosis within the risk groups that are currently used to a similar extent as when employing a standard binary risk classification. However, as opposed to a clinical classifier, none of the investigated DL image analysis models or ensembles performed equally well on all cohorts. Further attempts must be made to improve model generalization. In our experience, although this also poses a challenge, such studies must be conducted in an interdisciplinary working group, since they profit greatly from different levels of expertise that are contributed by data scientists on the one hand and clinicians on the other hand.