The COVID-19 Connection: Breathing, cough, and speech audio dataset for respiratory healthcare

Can COVID-19 be detected through breathing, cough, and speech sounds? The Coswara dataset is designed to answer this and other key questions, accelerating research on developing point-of-care respiratory healthcare
Published in Research Data
The COVID-19 Connection:  Breathing, cough, and speech audio dataset for respiratory healthcare

During the outbreak of the pandemic in March-April 2020, when humanity faced unprecedented challenges, our group pondered over the potential application of speech and audio signal processing to help address the growing number of challenges in the health sector.  Early health reports of COVID-19 patients highlighted impairments in the respiratory system, leading to dry cough, breathlessness, and fatigue as symptoms indicating onset of COVID-19 infection. Visible lung damage in radiographic imaging of the chest using computed tomography and X-ray scans were also observed in COVID-19 patients. Connecting the dots, we hypothesized that COVID-19 induced infection in the respiratory system might manifest as distinct changes in the different sounds produced by it. The art of listening to respiratory sound signals, such as lung sounds, is not new. In the 1820s, R. T. H. Laennec, a French physician, described the methodology of using a stethoscope like equipment in his thesis -“A treatise on the diseases of the chest and on mediate auscultation”. The methodology has stood the test of time, and is practiced even now. The respiratory ailments such as chronic obstructive pulmonary disease (COPD), asthma, and pneumonia can cause inflammation or obstruction in the bronchial tubes of the lungs. This results in distinct sounds, such as wheezing or crackling noises, while breathing in and out. When the doctor presses the stethoscope to our chest they try to listen to these sounds. Intrigued by this, we chose to employ computational methods to examine the respiratory sounds – including breathing, coughs, and speech – in individuals both with and without COVID-19 infection.  This required analysis of a large dataset of sound samples drawn from individuals with and without COVID-19. There was no such dataset available during March-April 2020.

As per a 2019 World Health Organization report, chronic obstructive pulmonary disease (COPD) is the third leading cause of deaths world-wide, accounting for 6% of the global death toll. Further,  lower respiratory infections, and lung cancer occupies fourth and sixth spot in this list, respectively. India, which shares close to 18% of the global population, in 2016 featured approximately 32% of the global disability adjusted life years (DALY, an estimate of overall disease burden) from these diseases. As we head into the future, it is evident that regular monitoring of respiratory health will be essential for living a healthy life amidst the concerns of air pollution, and pandemics. But how accessible are respiratory health monitoring methodologies? Often the preliminary screening with a stethoscope is not sufficient, and follow-ups like radiographic imaging of the chest using computed tomography (CT), x-ray (CTX) or ultrasound are recommended. For the final diagnosis, specific molecular tests have been designed to detect the unique constituents of specific bacteria or viruses in blood, sputum, or nasal swabs. Essentially, all this requires that the patient visits the clinic or hospital. 

Identifying the scope for innovation, a few research studies have experimented with designing computer algorithms, which take the cough sounds recorded via mobile phones as the input and offer predictions on the nature of the individual’s illness such as asthma, wet and dry cough, tuberculosis etc. Interestingly, these methods show some promise for pursuing detailed clinical trials.

In April 2020, we launched a study featuring three stages (Fig. 1). In stage-1, we created a large audio dataset composed of breathing, cough, and speech sound samples and symptoms data. The data was collected via crowdsourcing. Anyone with an internet connection could go to a website, fill a form and record their respiratory sound samples using a microphone, easily accessible in a smartphone or a laptop. The website link was shared in social media platforms, hospitals, and academic institutions. With the physical distancing norms in place and the lack of knowhow on COVID-19 disease, it was challenging to persuade COVID-19 patients to volunteer for recording their data. Obtaining ethics approval from hospitals required us to understand several new medical documentations, which we rarely come across otherwise when designing an engineering experiment! 

Thanks to the interest and support of doctors, we were able to receive the ethics approval from multiple hospitals. This helped in increasing the count of COVID-19 patients participating in the data recording exercise, gaining popularity in local circles in several cities thereby reaching out to individuals spanning a broad range of age groups, respiratory health conditions, and ethnicity. A special mention to the media coverages which continually helped spread the information about the study.

By April 2022, we had collected data from ~2600 individuals, amounting to ~60 hrs of respiratory sound recordings. The collected data was curated via automated checks and human listening, and subsequently, has now been released as open-access, enabling its use for non-commercial research across the world. The dataset is referred to as the Coswara Dataset and it collates  respiratory sound samples, health symptoms and demographic data of human subjects drawn majorly from India and a few subjects from multiple other countries.

Illustration of the three stages in the Coswara Dataset creation and exploration project.
Fig. 1: Illustration of the three stages in the Coswara Dataset creation and exploration project.

The stage-2 of the study focused on designing computational methods to analyze the interplay between COVID-19 infection and the respiratory sound signals. To this end, we analyzed the acoustic features associated with the sound signal samples. The acoustic features help quantify the spectral content and temporal variations in sound signal, as the illustration shown in Fig. 2. We implemented algorithms which process the acoustic features and carry out the task of classifying the associated sound signals into certain categories of interest. These algorithms are drawn from the machine learning field which offers immensely powerful tools for data analysis. The implemented methods take breathing, cough, speech sound samples of a human subject as input and output a COVID-19 probability score. This probability suggests how likely the individual has COVID-19. The methods were validated on multiple held out test sets and to our pleasant surprise, we found a significantly better than chance COVID-19 detection performance. 

With a goal to encourage further research on analyzing COVID-19 respiratory sounds, we launched two global challenges - Diagnosing COVID-19 using Acoustics (DiCOVA), on designing computational models for detecting COVID-19 from respiratory sound samples. As part of the challenge, we released development and blind test datasets, and baseline computational models and invited the global research community to design approaches to supersede the baseline performance on the blind test dataset. The challenges garnered interest in the speech and audio research community working, and a special session was organized in the Interspeech Conference, a flagship conference of the International Speech Communication Association, to present the findings from multiple teams. The special session also witnessed interest from industry participants.

Illustration of spectrograms of respiratory sound samples.
Fig. 2: Illustration of (a) speech (vowel [a] as in made), (b) cough, and (c) breathing sound signals recorded using a microphone. The top row shows the time-domain signal and the bottom row shows the spectrogram. The red regions in the spectrogram highlight frequencies and time instants of high intensity.

In Sept. 2020, the World Health Organization stressed on the need for innovative point-of-care testing (POCT) solutions which can provide timely deployable, geographically scalable, and cost-effective methodologies to screen/test for COVID-19. In a POCT methodology, the “data does the traveling” instead of the patient. In stage-3, the focus was to design such a solution based on respiratory sound and symptom analysis. With the help of a mobile phone, connected to the internet, a user can go to a website, record their respiratory sound samples and symptoms data. In a few seconds, this data is analyzed on a server and a notification stating the COVID-19 probability score for the user is displayed on their phone. A pictorial illustration is provided in Fig. 3. We have now designed a similar solution and this is made available here. The tool is publicly accessible however, a clinical trial remains to be completed. At a broader level, success in POCT can revolutionize the way we monitor and diagnose respiratory health. Imagine, next time you speak on your phone and the phone advises you to gargle tonight because it has detected infection in the respiratory tract by analyzing your voice, when you talked to your friend.

Point-of-care testing strategy for respiratory health screening.
Fig. 3.: Point-of-care testing strategy for respiratory health screening.

A detailed presentation on the Coswara dataset creation and validation process, and results is now available as a data descriptor paper here. We encourage you to read it and hope this dataset will help answer key questions in the quest to design POCT solutions for analyzing respiratory health conditions.

A video demonstrating the developed POCT solution is posted here.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Medical imaging data for digital diagnostics

This Collection presents a series of articles describing annotated datasets of medical images and video. All medical specialities are considered and data can be derived from study participants, tissue samples, electronic health records (EHRs) or other sources.

Publishing Model: Open Access

Deadline: Dec 20, 2023

Ecological data for tracking biological diversity and environmental change

This collection presents data contributions addressing topics in biodiversity and ecology.

Publishing Model: Open Access

Deadline: Jan 31, 2024