Estimating COVID-19 prevalence using web search activity

Web search activity has been used to estimate the prevalence of infectious diseases, such as influenza. However, developing models for a novel disease is a different and perhaps more challening task.
Estimating COVID-19 prevalence using web search activity

In our paper "Tracking COVID-19 using online search" published in npj Digital Medicine we show how Google search data can be used to develop complementary public health surveillance methods for COVID-19. We present results for a multilingual and multicultural selection of 8 countries: United States (US), United Kingdom (UK), Australia, Canada, France, Italy, Greece, and South Africa. Our analysis covers the period from October 2019 to the end of May 2020, i.e. the first wave(s) of the COVID-19 pandemic.

Using the symptom profile of COVID-19, we identify related web searches and compute a COVID-19 score (Figure 1, blue line). We also compare it with a historical average (2011-2018; Figure 1, dashed line). As web searches can be influenced by public interest, which is often reflected in the media coverage of a topic, we also develop a method that attempts to reduce this effect (Figure 1, black line). This latter scoring function provides on average an early warning of 16.7 (10.2-23.2) and 22.1 (17.4-26.9) days compared to confirmed COVID-19 cases and deaths, respectively.

Figure 1. Unsupervised COVID-19 scores based on web search activity.

We then transfer a COVID-19 incidence model from one country to another. We first train a regression model for Italy (one of the first major hotspots in Europe) using web searches and confirmed cases and then transfer it to the other countries (Figure 2), similarly to previous work focusing on influenza-like illness. The transfer learning approach is not affected (as much) by media coverage as it is based on supervised learning. It corroborates our previous findings from the unsupervised approach albeit with a further delay of about 5 days.

Figure 2. COVID-19 incidence scores (standardised) based on a transfer learning method. The source model is based on data from Italy.

We then conduct a regression analysis to uncover important search terms using on a joint data set from 4 English speaking countries (US, UK, Australia, and Canada) in an attempt to reduce clinical reporting bias, estimating confirmed cases based on web searches. We were among the first to indicate that there is a relationship between clinical COVID-19 indicators and the symptoms of anosmia (loss of the sense of smell), ageusia (loss of the sense of taste), and skin rash.

A limitation of our study is that in contrast to past efforts it was hard to evaluate the accuracy of our approach as clinical indicators were (are) not necessarily representative of disease prevalence. However, when we compared our COVID-19 scores in England with prevalence estimates obtained from a COVID-19 swabbing scheme (Royal College of General Practitioners), which was based on non-COVID-19 cases as well and could therefore provide a more representative statistic, we found strong correlations (>.80). We also assessed the hypothesis that the COVID-19 outbreak in Italy, which was the first major outbreak in Europe, might have Granger-caused an increase in the frequency of search terms elsewhere. We concluded that more than 70% of the search terms we used were not affected by the events in Italy. In addition, we: (a) reduced news coverage influence, and (b) need to consider that Granger-causality might in this case be misleading because COVID-19 could have emerged at the same time in Italy and other locations (especially in Europe).

Since March 2020, we have been sending our COVID-19 scores to Public Health England (PHE) on a weekly basis. These are included in PHE’s syndromic surveillance reports and have been used as a complementary early-warning resource for epidemiological monitoring and planning.

To find out more details about our methodology and outcomes, read our open-access article.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Health Care
Life Sciences > Health Sciences > Health Care
  • npj Digital Medicine npj Digital Medicine

    An online open-access journal dedicated to publishing research in all aspects of digital medicine, including the clinical application and implementation of digital and mobile technologies, virtual healthcare, and novel applications of artificial intelligence and informatics.

Related Collections

With collections, you can get published faster and increase your visibility.

Clinical applications of AI in mental health care

This joint venture Collection between npj Mental Health Research and npj Digital Medicine highlights how AI can be safely, ethically, & impactfully utilized to advance our understanding of mental illnesses & improve patient care.

Publishing Model: Open Access

Deadline: Jun 22, 2024