Behind the Paper

Citizen science and traditional research are allies

A neural network to recognise butterfly species was trained using over 500,000 photos of butterflies collected by more than 25,000 volunteers, and critically evaluated the results. The data set, scripts and models used are now available to the interested public.

Published in Ecology & Evolution, Protocols & Methods, and Computational Sciences

Aug 06, 2025

Johannes Rüdisser

Ecologist, University of Innsbruck

Citizen science and traditional research are allies

Liked by India Ambler and 4 others

Explore the Research

Citizen science and traditional university research are often perceived and portrayed as opposites. In reality, however, they have much more in common than they have differences. Ultimately, both citizen science projects and traditional professional research are subject to the same quality criteria of good scientific practice, such as objectivity, correctness or accuracy, transparency, verifiability, integrity, and relevance.

In fact, citizen science projects and traditional research are in fact increasingly supporting and enriching each other. Current research results from the Viel-Falter Monitoring project illustrate this clearly. A neural network was trained to recognise butterfly species using over 500,000 butterfly photos collected by over 25,000 volunteers as part of a broad-based citizen science project, a neural network was trained to recognize butterfly species and the results were critically evaluated. This was done using one of the most powerful computers in Europe. The dataset, scripts, and models used are now available to the interested public.

Automated identification of species using machine learning models (so-called artificial intelligence) is familiar to users of apps such as iNaturalist, Flora Inkognita, and many others. However, few people consider what is required for the successful development and subsequent application of such identification tools. Many are also unaware of the amount of training data and computing power required to develop effective identification models.

Friederike Barkmann, a doctoral student in the Austrian Butterfly Monitoring Viel-Falter, addressed these questions in her second Master's thesis in the field of data science. She also critically evaluated the accuracy of identification for each species. To achieve this, she used a very large data set compiled by volunteers have compiled over the last ten years as part of the Butterflies of Austria project run by the Billa Foundation Blühendes Österreich. Based on over 500,000 images, a neural network was trained to recognise butterfly species. Training such models requires not only a lot of data but also a great deal of computing power, so access to powerful computers was essential. The high-performance computer at the University of Innsbruck, LEO5, initially served this purpose well. However, as the model runs took several hours even on this supercomputer, the process was first optimized with the help of supercomputing expert Andreas Lindner from EuroCC Austria through parallelization. This means that several processors (GPUs) are connected to each other to solve a computing task.

Ultimately, the EuroCC Austria project ultimately provided access to the LEONARDO supercomputer - one of the most powerful in Europe - and supported the implementation with expertise in the field of high-performance computing. This enabled the first models to be trained that could correctly identify 97% of all images. This high level of identification accuracy demonstrates that such models are well suited to providing app users with feedback on their observations. Accuracy can also be increased by removing images with uncertain identifications. These images could then be re-identified by experts, for example. This approach could save considerable time in re-identification and quality-controlling of citizen science data. At the same time, it ensures high data quality. It has also been documented that some species are easier to identify than others. Species groups that can be challenging even for experts, such as the family of skippers and the genus Erebia, are also more difficult for the computer model to identify.

The dataset, which includes butterfly photographs, computer scripts, and models, was published as part of a data paper. In the spirit of open (citizen) science, the dataset has been made available to the general public. It is significantly larger than those datasets used in similar studies to date. It is a valuable resource for further research and can contribute to further improving identification algorithms such as those used in iNaturalist among others.

This closes the circle: citizen science initiatives support and expand scientific research, which in turn develops methods and techniques that further expand the possibilities of citizen science. However, this cross-fertilization is only possible if the various stakeholders collaborate and solve problems together. Given the global biodiversity and climate crises, there are plenty of problems to solve.

Barkmann, F., Lindner, A., Würflinger, R., Höttinger, H., Rüdisser, J. (2025) Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels. Sci Data 12 (1), 1369. https://doi.org/10.1038/s41597-025-05708-z

Johannes Rüdisser

Ecologist, University of Innsbruck

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biodiversity

Life Sciences > Biological Sciences > Ecology > Biodiversity

Artificial Intelligence

Mathematics and Computing > Computer Science > Artificial Intelligence

Biological Taxonomy

Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Biological Taxonomy

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Data for crop management

This Scientific Data Collection welcomes submissions of Data Descriptors associated with datasets for crop management, which are essential for optimising agricultural productivity, sustainability, and food security.

Publishing Model: Open Access

Deadline: Apr 17, 2026

Explore this Collection

Data to support drug discovery

This Scientific Data collection aims to gather data descriptors on high-quality, reusable datasets relevant to the drug discovery and development process.

Publishing Model: Open Access

Deadline: Apr 22, 2026

Explore this Collection

Bridging the Data Gap in Orthopedic AI: The Story of the PlaTiF Dataset

Behind the Paper

Improving Watershed Models with Tile and Rotation-Enhanced Cropland (TREC) dataset

Opportunities, From the Editors

Call for papers: Datasets for language sciences Collection

Opportunities, From the Editors

Call for papers: Trophic webs Collection

News and Opinion

Highlights from Mathematics, Physical and Applied Sciences Communities   

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Citizen science and traditional research are allies

Share this post

Share with...

...or copy the link