Behind the Paper

Behind the scenes of the massive COVID-19 case report dataset – lessons from the unreadiness of artificial intelligence

Published in Research Data

Mar 02, 2021

Xiaofan Liu

Associate Professor, City University of Hong Kong

Behind the scenes of the massive COVID-19 case report dataset – lessons from the unreadiness of artificial intelligence

Like Liked by Teroj

Explore the Research

We’ve just published a massive line-list data on Scientific Data (https://doi.org/10.1038/s41597-021-00844-8), containing 14k+ COVID-19 cases with their detailed mobility and epidemiological information (28 data fields). The dataset is still being constantly updated, on a bi-weekly basis.

More than 20 research assistants are enrolled in this labor-intensive work, curating the 28 different features from a paragraph of report for each case disclosed by the health authorities online.

This is not an easy job.

Some of the features, e.g., the interim transits of the cases’ mobility, the close contact scenario, and when the symptoms were discovered, have proven quite difficult to be accurately understood. Disputes happen between coders, and an extra help is often needed to resolve the disputes.

We thought about seeking help from artificial intelligence for extracting the information in the first place. But only until now, i.e., a whole year after the outbreak, have we developed a first version of usable algorithm, achieving ~80% of accuracy for some easy fields. The machine learnt results are still far from immediately publishable, but they are already helping to speed up the human coding process.

Don’t get us wrong. We (and our collaborators) are not amateurs in NLP algorithms. Actually, our collaborators in Dalian University of Technology are among the best NLP scientists on Chinese language. Now, why artificial intelligence doesn’t help, then?

We think the reasons are two folds.

First, coding these online reports is a highly non-standard task. Although the nature of the tasks is just to extract entities and time, the problem is not that simple. Different types of entities are mixed, the coupling of entities and time (e.g., the mobility trajectory and epidemiological timeline) are complex, and the usage of languages by authorities in different regions are largely ... different. The algorithms trained from and performed well on standard datasets are just “unacclimatized” on these non-standard data.

Second, machine needs to learn from human. Humans have to feed the machine with enough samples for them to learn. In our case, 14k case reports seem like a massive number for human but are just an appetizer for the algorithm.

Now that we (luckily) have our first version algorithm running, the research assistants can finally catch a breath. But what we really hope is that they can easily sleep at night, without worrying about more jobs to be done when they wake up in the morning.

Xiaofan Liu

Associate Professor, City University of Hong Kong

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Genomics in freshwater and marine science

This Scientific Data collection of articles focuses on transcriptomic datasets and genome assemblies from freshwater and marine taxa.

Publishing Model: Open Access

Deadline: Jul 23, 2026

Explore this Collection

Genomes of endangered species

This Scientific Data Collection of articles focuses on genome assemblies of endangered or threatened species.

Publishing Model: Open Access

Deadline: Jul 01, 2026

Explore this Collection

Mountain glacier extents at the Last Glacial Maximum

Behind the Paper

Mapping Private Wealth: A New Data Warehouse for Researchers and Public Use

News and Opinion

Quarterly Highlights from the Health & Clinical Research and Life Sciences Communities

Behind the Paper

Uncovering unpublished radiocarbon data from Late Quaternary megafauna fossils

News and Opinion

March Highlights from Mathematics, Physical and Applied Sciences Communities

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Behind the scenes of the massive COVID-19 case report dataset – lessons from the unreadiness of artificial intelligence

Share this post

Share with...

...or copy the link