How large language models can help scale citizen science in chronic disease research

We have observed a consistent desire among persons with chronic diseases to tell their story and be heard. Engaging them more in health research may demand new ways for capturing and analyzing such free-text narratives. Large language models can streamline the analysis and lead to novel insights.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Living with a chronic disease

Multiple Sclerosis (MS) is a chronic illness that affects people at a relatively young age. The majority of those affected are women. With a typical onset age between 20 to 40 years, MS impacts daily life beyond health and healthcare. Living with MS means adjusting nearly all life aspects to the disease, including work, partnership, leisure activities, parenthood, and living situation. Therefore, living with and managing a chronic illness such as MS requires a long-term perspective and comprehensive view of different aspects of life.

Providing such a comprehensive perspective on living with a chronic illness is one of the key aims of the Swiss Multiple Sclerosis Registry (SMSR).1 The SMSR was launched in 2016 through the initiative of persons with MS and the Swiss Multiple Sclerosis Society. The SMSR is a nationwide, citizen science project that includes adult persons with MS who reside or receive care in Switzerland. This study intends to examine life with MS and to conduct research together with those who live with this chronic disease. So far, 2,800 persons with MS out of an estimated 18,000 who live in Switzerland have contributed their data. 

Giving people with MS a voice 

In line with the SMSR’s citizen science strategy, our team of researchers regularly engages with persons with MS, who contribute to the science by proposing topics, by sharing their expertise and experiences, and by co-shaping the research strategy. Throughout our interactions, we have seen a consistent desire from participants to share their story and to be heard. In fact, many participants cited these two factors as strong motivators for contributing to the SMSR, despite the effort involved. We ask our participants to complete fairly extensive surveys twice a year. 

Our team took the participants’ desire to "tell their own story" and "be heard" to heart and began exploring ways how data collection could better meet these wishes. Specifically, we started experimenting with different approaches for collecting free-text information about living with the disease. In an initial study, we analyzed  free-text diaries collected during a dedicated "A week in the lives of people with MS" campaign.2 The success of this diary campaign and the team’s growing confidence and expertise in analyzing such free-text data provided further motivation to leverage open-ended questions in surveys about experiences of persons with MS during the first Covid-19 lockdown in Switzerland3 and important life events ("My life with MS")4

The "My life with MS" survey4 marked a particularly important milestone in the evolution of the SMSR. This survey was co-created in two workshops involving 30 persons with MS. In this study, participants could select up to six important events since their MS onset. They could further describe these events and their impact on their lives through a series of guided, open-ended questions. The event types were broadly defined to complement the predominantly clinical view of the disease course that focuses on diagnosis, treatment, and disease progression.

Building on life course descriptions of more than 1,000 participants, our analysis identified other daily-life topics in addition to the clinical milestones, namely work, partnership, and childbirth. In this sense, the "My life with MS" study not only established proof-of-principle for large-scale, citizen science-based research, it also generated important new insights about the lived experience of persons with chronic conditions. Yet, the pre-processing of these free-text descriptions was challenging, mainly because of the study participants’ extensive use of abbreviations and their tendency to write short, note-like sentences to describe their life events.   

The present study: theories of persons with MS about the causes of their disease

The positive feedback of the "My life with MS" participants provided the necessary motivation to further refine and evaluate the methodological concept of collecting and analyzing free-text data at scale. The present research article, published in Communications Medicine - Nature5, approached a difficult topic that was also inspired by discussions with SMSR participants: How do persons with MS explain the onset of their own disease? Despite recent scientific progress, the cause of MS remains elusive. From the perspective of persons with MS, this leaves much room for speculation. The SMSR devised a new "theories of MS" study that included open-ended questions concerning people's theories about the cause of MS and its individual risk factors, respectively.

The launch of this "Theories of MS" study coincided with the increasing power and accessibility of large language models. An excellent example of this is the popular "Hugging Face" machine learning platform for sharing models, data, and applications. These readily-available, powerful large language models have recently been incorporated into the established natural language processing technique topic modeling, to provide more interpretable results while providing adaptability to different types and sizes of text data. For the analysis, we used the novel Python library BERTopic.6 By employing several large language models under the hood, BERTopic first converts text into numbers and then groups similar text together to find common topics. Next, it uses a clustering technique to group similar pieces of text together, facilitating the identification of underlying topics. 

In our “Theories of MS” study we analyzed free-text responses from 486 persons with MS about what they thought had caused their MS. As expected, persons with MS mentioned numerous potential personal theories about their MS. These theories were further classified into four high-level categories: physical health, mental health, risk factors established in the scientific literature, and fate/coincidence.

Two aspects  stood out from the study results. First, there was a large diversity of risk factors mentioned by persons with MS, often in combination. This finding is clearly a reflection of the lingering uncertainty about the causal pathways for MS onset. To date, the most convincing scientific support for specific risk factors are infections with the Epstein-Barr virus (especially during later adolescence, when an infection often manifests itself as a mononucleosis)7, and familial aggregation8, i.e., a genetic contribution. These two established risk factors were mentioned by 10.9% participants for Epstein-Barr virus and 27.4% for familial aggregation. 

Second, our study found that the two most frequently mentioned theories concerned mental health topics, namely mental distress (31.5%) and stress (29.8%). The frequent mention of mental health aspects as potential MS risk factors, often in combination with other topics, was quite surprising because the scientific evidence for a mental health influence on MS onset is rather limited. This finding clearly emphasizes the importance of communication between healthcare professionals and persons with MS on the pathogenesis of MS, the scientific evidence base, and the importance of having an open dialogue about mental health.

Convergence of quantitative and qualitative research: a vision for future health studies

The research examples from the SMSR demonstrate how free text-based survey studies can build a bridge between qualitative research aimed at describing the diversity of topics ("what exists") with quantitative analyses ("and how often"). When study participants are guided by specific questions, natural language processing tools can help to extract meaningful patterns and provide indications for their frequency. However, the pre-processing and analysis is still quite burdensome. All mentioned studies involved a significant effort in terms of cleaning and preparing the free-text information, as extensively described in the respective methods sections. 

At the same time, typing lengthy survey responses can be very tiring for participants, especially when they use mobile devices or have difficulty with fine motor skills. This often leads to very short descriptions or the extensive use of abbreviations, which in turn increases the burden of data pre-processing. However, recent technological developments offer a way to mitigate these challenges. The emergence of publicly available large language models for speech-to-text transcription (most notably OpenAI's Whisper models) has made it easier and more convenient for participants to complete open-ended questions. Pilot research by our group has also revealed that transcribed text can be analyzed more efficiently because spelling errors are automatically corrected and participants refrain from using abbreviations in their responses. State-of-the-art large language models are also increasingly adept at transcribing various languages or medical terminology - although manual checks and post-transcription corrections may still be necessary, for example if the speaker has a strong accent, or in the case of very rare expressions.

Conclusion

To conclude, recent technological innovations show a way forward to innovate survey-based health studies by combining traditional surveys with structured open-ended questions and speech-to-text transcription. Some challenges remain, including data privacy and security, a possible digital divide, or the substantial effort still required for data pre-processing. The risk factor study, as well as the previous, open-text investigations conducted in the SMSR clearly allude to the scientific potential of collecting free-text data at scale.

References

  1. Puhan, M. A. et al. A digitally facilitated citizen-science driven approach accelerates participant recruitment and increases study population diversity. Swiss Medical Weekly Preprint at https://doi.org/10.4414/smw.2018.14623 (2018).
  2. Sieber, C. et al. Electronic Health Diary Campaigns to Complement Longitudinal Assessments in Persons With Multiple Sclerosis: Nested Observational Study. JMIR Mhealth Uhealth 10, e38709 (2022).
  3. Chiavi, D. et al. The Real-World Experiences of Persons With Multiple Sclerosis During the First COVID-19 Lockdown: Application of Natural Language Processing. JMIR Med Inform 10, e37945 (2022).
  4. Haag, C. et al. Blending citizen science with natural language processing and machine learning: Understanding the experience of living with multiple sclerosis. PLOS Digit Health 2, e0000305 (2023).
  5. Haag, C. et al. Natural language processing analysis of the theories of people with multiple sclerosis about causes of their disease. Commun. Med. 4, 122 (2024).
  6. Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv [cs.CL] (2022).
  7. Bjornevik, K. et al. Longitudinal analysis reveals high prevalence of Epstein-Barr virus associated with multiple sclerosis. Science 375, 296–301 (2022).
  8. Ascherio, A. & Munger, K. L. Epidemiology of Multiple Sclerosis: From Risk Factors to Prevention—An Update. Semin. Neurol. 36, 103–114 (2016).

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Multiple sclerosis
Life Sciences > Health Sciences > Clinical Medicine > Diseases > Immunological Disorders > Autoimmune Diseases > Multiple sclerosis
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning
Natural Language Processing (NLP)
Mathematics and Computing > Computer Science > Artificial Intelligence > Natural Language Processing (NLP)
Public Health
Life Sciences > Health Sciences > Public Health
Research Data
Research Communities > Community > Research Data
Mixed Methods
Humanities and Social Sciences > Society > Sociology > Sociological Methods > Mixed Methods

Related Collections

With collections, you can get published faster and increase your visibility.

Liquid biopsy

This Collection welcomes clinical and translational research on liquid biopsy approaches in cancer.

Publishing Model: Open Access

Deadline: Nov 13, 2024

Health in Africa

We aim to promote high-quality research that advances our understanding of health issues in Africa, and advocates for better healthcare on the continent in line with the UN’s SDGs.

Publishing Model: Open Access

Deadline: Dec 31, 2024