The last two decades have been a golden age for genomics, particularly in oncology. Foundational tumor sequencing projects, such as The Cancer Genome Atlas, illuminated the molecular drivers of cancer and provided a basis for rationally designed medicines to target those drivers. Molecularly targeted therapies are flourishing and have reduced mortality rates in several lethal cancer types.
However, significantly less research at scale is available correlating these genomic data with clinical data from the electronic health record (EHR). A principal reason for this gap is that EHR records are often stored in free text notes and siloed datasets, requiring significant manual effort to curate these data for research. Although several studies have used MSK-IMPACT and similar assays to uncover mechanisms of resistance to therapy and understand genomic alterations associated with phenotypes like progression on specific therapies or metastasis to specific organs, these studies have generally been limited in size because of the need for manual curation of clinical data from free text notes.
To study the relationship between tumor genotypes and phenotypes from the EHR at scale for tens of thousands of patients or more, automated methods are needed. Natural language processing (NLP) in oncology has a complex history, with once-promising platforms such as IBM Watson having mixed results at analyzing patient charts. However, we thought a fresh attempt was warranted for two reasons: 1. the application of neural networks to EHR text, and especially with the recent success of transformers such as BERT and GPT for general NLP, represented a major improvement in the field and 2. a recent large investment in a manually curated dataset provided a rich corpus that could be used for training and validation of NLP models.
We evaluated several NLP methods and found that transformers, particularly those pretrained on clinical text, performed best at annotating key features such as metastatic sites of disease, progression, and prior treatment, with accuracy of the best methods approaching that of manual curation. We then applied our annotations to records of patients with tumor genomic profiling with MSK-IMPACT. We combined NLP annotations and genomic data with other normally siloed data from tumor registry, pharmaceutical prescription, self-reported demographic, and institutional outcome data to create a clinicogenomic, harmonized, oncologic real-world dataset (MSK-CHORD).
It's easy to write, in retrospect, that we made MSK-CHORD, but this was a massive team effort that took several years without certainty of success. Chris Fong, the lead engineer on our clinical data mining team, worked with IT, the privacy office, and clinician collaborators to create the data pipelines from our EHR to a secure high performance computing environment and then, on the data delivery end, with members of the cBioPortal team to make the data easily accessible and visualizable. Karl Pichotta, our lead NLP scientist, created sophisticated systems to test multiple NLP methods on the same data in parallel and troubleshot the many problems we faced along the way. Thinh Tran, a graduate student (who has since successfully completed her PhD), and Anisha Luthra, a data scientist, made many of our initial models and benchmarked their performance. Many other members of the Schultz lab helped create pipelines and models, use the data to make models of patient outcomes, or make the data more accessible.
Beyond our core group, an army of curators, organized and funded by a multimillion-dollar investment from the AACR Project GENIE Biopharma Collaborative produced the curated data used for NLP training and validation. Clinical, biostatistical, and scientific colleagues across MSK helped us design downstream analyses using MSK-CHORD. Colleagues in academia and industry outside MSK played a major role. When we found a promising finding regarding SETD2 mutations and better immunotherapy response, Kenneth Kehl, a leader in the field of clinical NLP at Dana Farber Cancer Institute (DFCI), helped us corroborate our results using DFCI data. Collaborators at Caris Life Sciences did the same with their massive real-world dataset.
Even more generally, all this research was only possible because of oncologists and patients at MSK and around the world contributing tissue for sequencing and data for analysis, in the hopes of better treatments today and for the possibility of even better treatments tomorrow. It was possible because of private and public funding and government regulations regarding digitization of health records and growing acceptance that real-world data is a valuable substrate for making scientific discoveries
With MSK-CHORD, we made several discoveries not possible with smaller datasets. We also showed, using multiple lines of analysis, the importance of multimodal data for predicting overall survival and the impact of genomics on metastasis. These are covered in detail in the paper.
However, these findings are just the beginning of what is possible. We have made a 24,950-patient instance of MSK-CHORD publicly available, and an over 90,000-patient instance, updated daily, is available within MSK. General medical research cohorts such as All of Us and the UK Biobank have already transformed our understanding of disease using real-world data. It is our hope that as AI annotation for cancer phenotypes becomes more common, MSK-CHORD will be just one of many oncologic datasets that can be used to study the nuances of cancer in diverse populations and treatment settings.
So here’s to the next two decades—a golden age of multimodal, AI-powered real-world data science to more fully understand how cancer afflicts its host, and strategies to overcome it.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in