In early 2020, as COVID-19 began its relentless spread across the globe, the urgency to understand and contain the virus was palpable. Governments and researchers raced against time to gather data, develop models, and implement policies to mitigate the spread of the disease. However, much of the available data was aggregated at the population level, lacking the granularity needed to inform more nuanced epidemiological models. It became clear to us that a different approach was needed—one that could provide a more detailed, individual-level perspective on the pandemic.
Our project began in April 2020. Our goal was to develop a COVID-19 forecasting model to help estimate future epidemic trends and assess individuals’ risk of infection and critical illness. We tried multiple classic models, such as ARIMA, SIR, SEIR, etc., based on the daily number of cases and recoveries, but none of them yielded good results. With the limited data available in the early days of the pandemic, we realized it was unlikely that we could make reliable forecasts. Starting in July 2020, we decided to focus on Taiwanese data—believing it contained the most informative, detailed individual-level data, since we could not find better data from any other country. However, the Taiwanese information was spread across sources, so we needed to collect and organize it into a structured format.
Our recently published article, “A Structured Course of Disease Dataset with Contact Tracing Information in Taiwan for COVID-19 Modelling,” represents the culmination of this effort. It offers a comprehensive, structured dataset of COVID-19 cases in Taiwan, meticulously compiled from open sources and daily reports. The dataset spans 579 confirmed cases from January to November 2020, covering a period when Taiwan was managing to keep the virus largely at bay while the rest of the world struggled with surging cases.
The creation of this dataset was born out of necessity. Early epidemiological models, though useful, often fell short in accuracy due to their reliance on population-level data. These models struggled to capture the complexity of the virus’s spread and the varied responses of different individuals. We recognized that by collecting and analyzing data at the individual level—tracking the course of the disease for each patient, including their symptoms, contact history, and outcomes—we could offer a more precise tool for predicting and managing the pandemic’s trajectory.
However, assembling such a dataset was far from easy. The challenges were multifaceted, ranging from manpower to technical issues with data collection and integration, as well as how to organize the data. The Taiwan CDC’s daily news reports presented varying information depending on the stage of the outbreak. For instance, in the early days, the Taiwan CDC focused on reporting patients’ course of disease, such as confirmation and recovery dates, as well as their travel histories. After the first community outbreak in April 2020, the CDC began emphasizing contact tracing information. We had to iteratively design a structure that could capture all this useful information. We spent countless hours parsing reports, standardizing formats, and ensuring data accuracy. Our efforts were guided by the principle that the dataset should be both comprehensive and accessible, providing a valuable resource for researchers and policymakers alike. Along the way we also used the data for forecasting, resulting in a paper published in Scientific Reports titled “Taiwan ended third COVID-19 community outbreak as forecasted”. This resulted in it taking us until January 2024 to complete and report on this dataset.
One of the unique features of our dataset is the detailed contact tracing information it includes. This data allowed us to map interactions between infected and uninfected individuals, offering insights into how the virus spread within specific communities. These contact networks are more than just data points; they tell the story of transmission, providing a visual and analytical tool that could help design more effective interventions for future outbreaks.
In addition to the contact networks, our dataset includes detailed information on each patient’s course of disease—when symptoms first appeared, when they were confirmed, and their eventual outcomes. We also collected daily summary data at the population level, providing broader context for the individual cases. Together, these elements form a rich tapestry of information that can be used to refine epidemiological models and inform public health decisions.
The implications of our work extend beyond COVID-19. By making this dataset publicly available, we hope to empower researchers to develop new models and analyses that could be applied to future pandemics or other public health crises. Our dataset adheres to the FAIR principles—Findable, Accessible, Interoperable, and Reusable—ensuring that it can be easily accessed and used by researchers worldwide.
The pandemic has underscored the critical role of data in public health, and our dataset represents a step forward in the ongoing effort to understand and control infectious diseases. Our hope is that future policymakers and researchers can use the models and tools developed from this dataset to quickly implement evidence-based interventions during health crises. As we continue to battle the consequences of the pandemic and prepare for future public health challenges, we believe that datasets like ours will be invaluable.
Acknowledgments go to all who contributed to this dataset and to the broader scientific community, which will continue to carry this work forward. Together, we can face future challenges with the confidence that comes from shared knowledge and collaboration.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in