Behind the Paper

Where Will All The Data Come From?

All Artificial Intelligence and Machine Learning need is more data, or at least so we hear. Some of that data will need to be collected from the physical world, and this paper with Ahalya Prabhakar discusses methods to automate that data collection, optimizing for both quality and efficiency.

Published in Electrical & Electronic Engineering

Jul 15, 2022

Todd Murphey

Professor, Northwestern University

Liked by Evelina Satkevic and 1 other

Explore the Research

We hear almost daily that soon we will have enough data for artificial intelligence (AI) to learn everything we need it to know. When researchers assert this, they of course mean that we will have as many images and sound bites as we could possibly want, that we will have completely mapped the surface of the earth, and that we will have specialized datasets (e.g., medical imagery) for highly prioritized applications. That is, we can expect data in instances where people already happen to care about that data (e.g., imagery of cats doing cute things) and in settings where heavy investment has an estimated high payoff. These datasets will prioritize systems in their common configuration, versus uncommon configurations, since we observe common states of the world (e.g., cars going down the street) much more than we observe uncommon ones (e.g., cars crashing into each other). Moreover, this near-tautology is confounded by the fact that data collected by people will inherit the biases of those people—certainly caring about cute cats more than insects, but also biases in data representation (e.g., cats being centered rather than being partially obscured). In this Nature Communications paper we ask how should autonomy reason about its learning goals and automatically collect data to support its learning.

How good is our data now, and how good can we expect it to be? Even learning for images has its limits when working with pre-existing datasets. One of the unacknowledged contributors to the success of machine learning in the context of vision is that we have a limitless supply of people who provide images and who potentially label both their own images and other images. If they don’t do this for free, we can pay them to do it, as many companies do, so labeling datasets does not need to happen at the time of collection. But vision datasets are then limited by the arbitrary preferences we have—images are taken with a still camera (robots will generally be in motion), subjects are centered or at least entirely contained in the image (rarely true for robots as they move), and images are taken in good illumination, often chosen specifically for the subject (rarely true for robots). These same considerations affect other datasets such as audio datasets, that by being anthropomorphic (modeled after human sensing), also are subject to human aesthetic choices. Other, non-anthropomorphic sensing (e.g., electromyography and electroencephalogram measuring human biological signals, electrosense capabilities in electric fish) all require significant investment to create datasets. Moreover, labeling these datasets has to happen at the time of collection, increasing overhead.

What does our paper accomplish? Our paper automates data collection given a learning model. That is, rather than solving a learning-for-control problem, where data-driven reasoning improves a feedback control process, we solve a control-for-learning problem, where control synthesis reasons about the state of the learning and optimizes its evolution, often called “active learning”. Our algorithmic/software pipeline is equally effective for vision (a far and mid field sensor) and electrosense (a near-field sensor), demonstrating the approach is independent of sensor physics. An opportunity that this work creates is the development of non-anthropomorphic sensors—any device that records signals and for which this pipeline can create a generative model is a sensor, whether it maps to our sensor modalities (sight,hearing,touch) or not.

Active Learning in Autonomy and Safety. Another opportunity in our software is understanding rare events, such as tripping in an exoskeleton or crashing a car. These are examples of rare data, but are also a warning—we cannot use active learning without thinking about safety. `Closing the loop’ on any dynamic process is simultaneously powerful and dangerous, as has been true since the early days of operational amplifiers, and this danger is no less true for systems that need to collect their own data during autonomous driving, minimally-invasive surgery, and other data-dependent robotic applications.

Future of Data Collection. If we want to obtain high-quality, unbiased data as machines learn about the world, we will need to automate that process. This paper makes progress on doing so, enabling machines to reason about what to look at and how to look at it.

Prabhakar, A., Murphey, T. Mechanical intelligence for learning embodied sensor-object relationships. Nat Commun 13, 4108 (2022). https://doi.org/10.1038/s41467-022-31795-2

Web: nature.com/articles/s4146

ePDF: rdcu.be/cRIOq

Todd Murphey

Professor, Northwestern University

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Electrical and Electronic Engineering

Technology and Engineering > Electrical and Electronic Engineering

Nature Communications

Nature Communications

An open access, multidisciplinary journal dedicated to publishing high-quality research in all areas of the biological, health, physical, chemical and Earth sciences.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Women's Health

A selection of recent articles that highlight issues relevant to the treatment of neurological and psychiatric disorders in women.

Publishing Model: Hybrid

Deadline: Ongoing

Explore this Collection

Tumor Microenvironment Crosstalk and Therapeutic Implications

With this cross-journal Collection, the editors at Nature Immunology, Nature Communications, Communications Medicine and Scientific Reports invite manuscripts that highlight cutting-edge research on TME crosstalk and its therapeutic implications. Topics of interest include immune modulation and checkpoint pathways, cancer-associated fibroblasts and stromal remodeling, angiogenesis and vascular normalization, metabolic reprogramming within the TME, and the role of microbiota in tumor-immune dynamics. We also welcome studies on novel therapeutic approaches that exploit TME vulnerabilities to advance cancer treatment.

Publishing Model: Hybrid

Deadline: Nov 02, 2026

Explore this Collection

Latest Content

News and Opinion

The Governance Layer: Turning Imaging, Pathology, and Free Text into One Computable Language

From the Editors

Discover's 5th Anniversary: Discover journals shine in the 2025 indexing metrics

Behind the Paper

What Hawaiian Seaweeds Can Teach Us About Speciation

Behind the Paper

Beyond drinking water: What else can water from air do for global health?

When metabolic inflexibility becomes a cancer's Achilles heel

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Where Will All The Data Come From?

Share this post

Share with...

...or copy the link