Behind the Paper

The COVID-19 Connection: Breathing, cough, and speech audio dataset for respiratory healthcare

Can COVID-19 be detected through breathing, cough, and speech sounds? The Coswara dataset is designed to answer this and other key questions, accelerating research on developing point-of-care respiratory healthcare

Published in Research Data

Jul 08, 2023

Neeraj Kumar Sharma

Assistant Professor, Indian Institute of Technology Guwahati, Guwahati

The COVID-19 Connection: Breathing, cough, and speech audio dataset for respiratory healthcare

Liked by India Ambler and 1 other

Explore the Research

During the outbreak of the pandemic in March-April 2020, when humanity faced unprecedented challenges, our group pondered over the potential application of speech and audio signal processing to help address the growing number of challenges in the health sector. Early health reports of COVID-19 patients highlighted impairments in the respiratory system, leading to dry cough, breathlessness, and fatigue as symptoms indicating onset of COVID-19 infection. Visible lung damage in radiographic imaging of the chest using computed tomography and X-ray scans were also observed in COVID-19 patients. Connecting the dots, we hypothesized that COVID-19 induced infection in the respiratory system might manifest as distinct changes in the different sounds produced by it. The art of listening to respiratory sound signals, such as lung sounds, is not new. In the 1820s, R. T. H. Laennec, a French physician, described the methodology of using a stethoscope like equipment in his thesis -“A treatise on the diseases of the chest and on mediate auscultation”. The methodology has stood the test of time, and is practiced even now. The respiratory ailments such as chronic obstructive pulmonary disease (COPD), asthma, and pneumonia can cause inflammation or obstruction in the bronchial tubes of the lungs. This results in distinct sounds, such as wheezing or crackling noises, while breathing in and out. When the doctor presses the stethoscope to our chest they try to listen to these sounds. Intrigued by this, we chose to employ computational methods to examine the respiratory sounds – including breathing, coughs, and speech – in individuals both with and without COVID-19 infection. This required analysis of a large dataset of sound samples drawn from individuals with and without COVID-19. There was no such dataset available during March-April 2020.

As per a 2019 World Health Organization report, chronic obstructive pulmonary disease (COPD) is the third leading cause of deaths world-wide, accounting for 6% of the global death toll. Further, lower respiratory infections, and lung cancer occupies fourth and sixth spot in this list, respectively. India, which shares close to 18% of the global population, in 2016 featured approximately 32% of the global disability adjusted life years (DALY, an estimate of overall disease burden) from these diseases. As we head into the future, it is evident that regular monitoring of respiratory health will be essential for living a healthy life amidst the concerns of air pollution, and pandemics. But how accessible are respiratory health monitoring methodologies? Often the preliminary screening with a stethoscope is not sufficient, and follow-ups like radiographic imaging of the chest using computed tomography (CT), x-ray (CTX) or ultrasound are recommended. For the final diagnosis, specific molecular tests have been designed to detect the unique constituents of specific bacteria or viruses in blood, sputum, or nasal swabs. Essentially, all this requires that the patient visits the clinic or hospital.

Identifying the scope for innovation, a few research studies have experimented with designing computer algorithms, which take the cough sounds recorded via mobile phones as the input and offer predictions on the nature of the individual’s illness such as asthma, wet and dry cough, tuberculosis etc. Interestingly, these methods show some promise for pursuing detailed clinical trials.

In April 2020, we launched a study featuring three stages (Fig. 1). In stage-1, we created a large audio dataset composed of breathing, cough, and speech sound samples and symptoms data. The data was collected via crowdsourcing. Anyone with an internet connection could go to a website, fill a form and record their respiratory sound samples using a microphone, easily accessible in a smartphone or a laptop. The website link was shared in social media platforms, hospitals, and academic institutions. With the physical distancing norms in place and the lack of knowhow on COVID-19 disease, it was challenging to persuade COVID-19 patients to volunteer for recording their data. Obtaining ethics approval from hospitals required us to understand several new medical documentations, which we rarely come across otherwise when designing an engineering experiment!

Thanks to the interest and support of doctors, we were able to receive the ethics approval from multiple hospitals. This helped in increasing the count of COVID-19 patients participating in the data recording exercise, gaining popularity in local circles in several cities thereby reaching out to individuals spanning a broad range of age groups, respiratory health conditions, and ethnicity. A special mention to the media coverages which continually helped spread the information about the study.

By April 2022, we had collected data from ~2600 individuals, amounting to ~60 hrs of respiratory sound recordings. The collected data was curated via automated checks and human listening, and subsequently, has now been released as open-access, enabling its use for non-commercial research across the world. The dataset is referred to as the Coswara Dataset and it collates respiratory sound samples, health symptoms and demographic data of human subjects drawn majorly from India and a few subjects from multiple other countries.

Fig. 1: Illustration of the three stages in the Coswara Dataset creation and exploration project.

The stage-2 of the study focused on designing computational methods to analyze the interplay between COVID-19 infection and the respiratory sound signals. To this end, we analyzed the acoustic features associated with the sound signal samples. The acoustic features help quantify the spectral content and temporal variations in sound signal, as the illustration shown in Fig. 2. We implemented algorithms which process the acoustic features and carry out the task of classifying the associated sound signals into certain categories of interest. These algorithms are drawn from the machine learning field which offers immensely powerful tools for data analysis. The implemented methods take breathing, cough, speech sound samples of a human subject as input and output a COVID-19 probability score. This probability suggests how likely the individual has COVID-19. The methods were validated on multiple held out test sets and to our pleasant surprise, we found a significantly better than chance COVID-19 detection performance.

With a goal to encourage further research on analyzing COVID-19 respiratory sounds, we launched two global challenges - Diagnosing COVID-19 using Acoustics (DiCOVA), on designing computational models for detecting COVID-19 from respiratory sound samples. As part of the challenge, we released development and blind test datasets, and baseline computational models and invited the global research community to design approaches to supersede the baseline performance on the blind test dataset. The challenges garnered interest in the speech and audio research community working, and a special session was organized in the Interspeech Conference, a flagship conference of the International Speech Communication Association, to present the findings from multiple teams. The special session also witnessed interest from industry participants.

Illustration of spectrograms of respiratory sound samples. — Fig. 2: Illustration of (a) speech (vowel [a] as in made), (b) cough, and (c) breathing sound signals recorded using a microphone. The top row shows the time-domain signal and the bottom row shows the spectrogram. The red regions in the spectrogram highlight frequencies and time instants of high intensity.

In Sept. 2020, the World Health Organization stressed on the need for innovative point-of-care testing (POCT) solutions which can provide timely deployable, geographically scalable, and cost-effective methodologies to screen/test for COVID-19. In a POCT methodology, the “data does the traveling” instead of the patient. In stage-3, the focus was to design such a solution based on respiratory sound and symptom analysis. With the help of a mobile phone, connected to the internet, a user can go to a website, record their respiratory sound samples and symptoms data. In a few seconds, this data is analyzed on a server and a notification stating the COVID-19 probability score for the user is displayed on their phone. A pictorial illustration is provided in Fig. 3. We have now designed a similar solution and this is made available here. The tool is publicly accessible however, a clinical trial remains to be completed. At a broader level, success in POCT can revolutionize the way we monitor and diagnose respiratory health. Imagine, next time you speak on your phone and the phone advises you to gargle tonight because it has detected infection in the respiratory tract by analyzing your voice, when you talked to your friend.

Fig. 3.: Point-of-care testing strategy for respiratory health screening.

A detailed presentation on the Coswara dataset creation and validation process, and results is now available as a data descriptor paper here. We encourage you to read it and hope this dataset will help answer key questions in the quest to design POCT solutions for analyzing respiratory health conditions.

A video demonstrating the developed POCT solution is posted here.

Neeraj Kumar Sharma

Assistant Professor, Indian Institute of Technology Guwahati, Guwahati

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Invertebrate omics

This Scientific Data Collection welcomes Data Descriptors documenting the curation, validation, and open sharing of genomic, transcriptomic, and proteomic datasets for invertebrate species.

Publishing Model: Open Access

Deadline: Feb 08, 2026

Explore this Collection

Computed Tomography (CT) Datasets

This Scientific Data Collection highlights a series of articles that describe CT imaging datasets.

Publishing Model: Open Access

Deadline: Feb 21, 2026

Explore this Collection

Latest Content

Behind the Paper

Beyond Contamination: How Antibiotic Resistant Bacteria are Spreading in Nigeria's Drinking Water

From Awareness to Assurance: What Two Saudi Studies Reveal About AI’s Next Chapter in Accounting

Foreign Sector & Foreign Exchange Regime Nexus

From Bedside to 《Nature Medicine》: Why We Need Evidence for Ancient Remedies in Modern Veins

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

The COVID-19 Connection: Breathing, cough, and speech audio dataset for respiratory healthcare

Share this post

Share with...

...or copy the link