Behind the Paper

Giving autonomous vehicles the ability to hear

Cameras and radar are vital, but what about hearing? Discover how our US8K_AV dataset allows autonomous vehicles to "hear" and interpret urban sounds efficiently on embedded systems. From sirens to children playing, this enhances situational awareness and safety for next-gen smart city mobility.

Published in Electrical & Electronic Engineering, Computational Sciences, and Mechanical Engineering

Jul 15, 2025

Andre Luiz Florentino, Plinio Thomaz Aquino Junior & Eva Laussac Diniz

3 contributors

Giving autonomous vehicles the ability to hear

Liked by India Ambler and 5 others

Explore the Research

How it started: Sound, autonomy, and urban intelligence

When we imagine autonomous vehicles, most of us picture cameras, radars, and LIDAR sensors scanning the road ahead. What we rarely consider is hearing — a crucial sensory modality in human driving. As humans, in specific situations, we would also rely on sounds like honking, sirens, barking dogs, or children playing to make safe driving decisions.

This project was born from a simple yet powerful question:

Can autonomous vehicles benefit from environmental sound recognition, especially in urban environments within smart cities?

And if so, could we make it work on low-cost embedded systems, like the Raspberry Pi?

The motivation: Why a new dataset was necessary

During my master’s research, we found a clear gap in the available data: while there are many datasets for environmental sound recognition (ESR), none were designed for real-world autonomous vehicle applications.

Most existing datasets were:

Too general (not vehicle-specific);
Not representative of urban noise conditions within the context of smart cities;
Not structured for edge deployment on devices with limited computing power.

Subjective evaluation of the classes within the datasets ESC-10, BDLib2, and US8K related to autonomous vehicles. Proposal for the classes within the US8K_AV dataset based on the US8K classes

At the same time, sound recognition had proven effective in areas like smart homes, healthcare, and wildlife monitoring. It was time to bring hearing to mobility.

Building the US8K_AV dataset

We started by adapting the well-known UrbanSound8K (US8K) dataset — a standard benchmark in ESR — to better fit the needs of autonomous driving.

Step 1: Filtering irrelevant classes

Classes like 'air_conditioner' and 'gun_shot' were considered irrelevant in the context of urban mobility and removed from the dataset.

Step 2: Creating the ‘background’ class

Some classes (e.g., 'drilling', 'engine_idling', 'jackhammer', 'street_music') were merged into a new class called 'background', representing general urban noise.

Step 3: Adding a new class — ‘silence’

Instead of treating silence as just low-volume audio, we sourced real-world silence samples from diverse locations, annotated them, and included them as a new class. This allows models to actively recognize and react to quiet environments — useful for event segmentation, baseline calibration, and power-saving mechanisms.

Step 4: Preserving structure and preventing data leakage

The dataset was carefully split into 10 folds, ensuring that all slices from a single audio source were placed in the same fold. This prevents overly optimistic results due to duplicated samples in both training and testing sets.

The result is the dataset US8K_AV:

4,908 annotated WAV files ;
4.94 hours of audio;
6 meaningful sound classes;
Designed for embedded systems and real-world use.

Classes distribution among the folds of the US8K_AV

Model results and real-time testing

We benchmarked several classifiers — including traditional machine learning algorithms (SVM, Logistic Regression, Random Forest) and deep learning architectures (ANN, CNN 1D, CNN 2D).

We found that a 2D Convolutional Neural Network (CNN 2D) trained on log-mel spectrograms (with their derivatives) yielded the best trade-off between accuracy, memory usage and speed, even when deployed on a Raspberry Pi 4.

Representation of the CNN 2D architecture. — Representation of the CNN 2D architecture

Key highlights:

CNN 2D achieved >80% accuracy on real-world data;
Response time <50 ms on Raspberry Pi;
Significant F1-score improvements over the original US8K for relevant classes.

👉🏼So far, the performance improved across all relevant categories, validating our methodology of merging and adding classes.

Real-world applications

This dataset was designed with practical use cases in mind, especially for urban autonomous vehicles in smart cities, like the inovation project Citybot.

Use cases include:

🚸 Children playing behind a fence — a camera can’t see them, but a microphone can hear them;
🐕 Dog barking or 🚗 honking — useful when approaching intersections or blind spots;
🚨 Siren detection — allows earlier response to emergency vehicles than vision-based sensors alone.

The inclusion of a silence class also enables systems to identify periods of inactivity, helping with energy efficiency and segmentation.

Challenges and lessons learned

One unexpected challenge?

Finding real silence...

True silence in urban settings is rare, and collecting well-documented, noise-free recordings took substantial time and curation.

Another lesson: balancing scientific rigor with practical deployment. Our goal was not just to publish another dataset, but to create something usable and replicable — something that can run on a Raspberry Pi and still be meaningful in the real world.

An invitation to the community

We see US8K_AV not as a final product, but as a foundation for future work.

We invite researchers to:

Add new classes relevant to other vehicle types;
Expand the dataset with recordings from other regions;
Use it in different edge-computing environments;
Explore sensor fusion combining acoustic and visual data.

🔗 All source code, thesis and dataset are publicly available:

🖥️ GitHub
📰 Master thesis of André Luiz Florentino at Centro Universitário FEI
💽 US8K_AV available at Harvard Dataverse

Final thoughts: Why hearing matters

Autonomous vehicles are becoming increasingly capable. But without the ability to hear, they are still missing an essential sense — one that humans use every day to stay safe, avoid accidents, and make informed decisions.

Our hope is that this dataset inspires others to think beyond vision and radar — and consider sound as a rich, underutilized source of environmental context.

Because sometimes...

the most important thing to know... is what you hear!

Multiple Contributors

Andre Luiz Florentino, Plinio Thomaz Aquino Junior & Eva Laussac Diniz

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Embedded Systems

Technology and Engineering > Electrical and Electronic Engineering > Electronic Circuits and Systems > Embedded Systems

Vehicle Engineering

Technology and Engineering > Mechanical Engineering > Vehicle Engineering

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Data for crop management

This Scientific Data Collection welcomes submissions of Data Descriptors associated with datasets for crop management, which are essential for optimising agricultural productivity, sustainability, and food security.

Publishing Model: Open Access

Deadline: Jan 17, 2026

Explore this Collection

Computed Tomography (CT) Datasets

This Scientific Data Collection highlights a series of articles that describe CT imaging datasets.

Publishing Model: Open Access

Deadline: Feb 21, 2026

Explore this Collection

Latest Content

cfDNA dynamics decipher mechanism of action of Ibrutinib in patients with MCL

Events

Palgrave Macmillan Annual Lecture Series - Decolonising and Diversifying the Humanities and Social Sciences: Lessons from the New Palgrave Dictionary of Economics

Behind the Paper, News and Opinion, Psychedelics Hub

How Physical Activity Can Protect Young Minds From Climate Anxiety

News and Opinion, Life in Research

India’s growing role in global research and publishing: Why it matters

Life in Research

Kolkata's Climate Crisis: A Heritage City's Blueprint for Urban Resilience

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.