Citizen science and traditional research are allies

A neural network to recognise butterfly species was trained using over 500,000 photos of butterflies collected by more than 25,000 volunteers, and critically evaluated the results. The data set, scripts and models used are now available to the interested public.
Citizen science and traditional research are allies
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Citizen science and traditional university research are often perceived and portrayed as opposites. In reality, however, they have much more in common than they have differences. Ultimately, both citizen science projects and traditional professional research are subject to the same quality criteria of good scientific practice, such as objectivity, correctness or accuracy, transparency, verifiability, integrity, and relevance.

In fact, citizen science projects and traditional research are in fact increasingly supporting and enriching each other. Current research results from the Viel-Falter Monitoring project illustrate this clearly. A neural network was trained to recognise butterfly species using over 500,000 butterfly photos collected by over 25,000 volunteers as part of a broad-based citizen science project, a neural network was trained to recognize butterfly species and the results were critically evaluated. This was done using one of the most powerful computers in Europe. The dataset, scripts, and models used are now  available to the interested public.

Automated identification of species using machine learning models (so-called artificial intelligence) is familiar to users of apps such as iNaturalist, Flora Inkognita, and many others. However, few people consider what is required for the successful development and subsequent application of such identification tools. Many are also unaware of the amount of training data and computing power required to develop effective identification models.

Friederike Barkmann, a doctoral student in the Austrian Butterfly Monitoring Viel-Falter, addressed these questions in her second Master's thesis in the field of data science. She also critically evaluated the accuracy of identification for each species. To achieve this, she used a very large data set compiled by volunteers have compiled over the last ten years as part of the Butterflies of Austria project run by the Billa Foundation Blühendes Österreich. Based on over 500,000 images, a neural network was trained to recognise butterfly species. Training such models requires not only a lot of data but also a great deal of computing power, so access to powerful computers was essential. The high-performance computer at the University of Innsbruck, LEO5, initially served this purpose well. However, as the model runs took several hours even on this supercomputer, the process was first optimized with the help of supercomputing expert Andreas Lindner from EuroCC Austria through parallelization. This means that several processors (GPUs) are connected to each other to solve a computing task.

Ultimately, the EuroCC Austria project ultimately provided access to the LEONARDO supercomputer - one of the most powerful in Europe - and supported the implementation with expertise in the field of high-performance computing. This enabled the first models to be trained that could correctly identify 97% of all images. This high level of identification accuracy demonstrates that such models are well suited to providing app users with feedback on their observations. Accuracy can also be increased by removing images with uncertain identifications. These images could then be re-identified by experts, for example. This approach could save considerable time in re-identification and quality-controlling of citizen science data. At the same time, it ensures high data quality. It has also been documented that some species are easier to identify than others. Species groups that can be challenging even for experts, such as the family of skippers and the genus Erebia, are also more difficult for the computer model to identify.

The dataset, which includes butterfly photographs, computer scripts, and models, was published as part of a data paper. In the spirit of open (citizen) science, the dataset has been made available to the general public. It is significantly larger than those datasets used in similar studies to date. It is a valuable resource for further research and can contribute to further improving identification algorithms such as those used in iNaturalist among others.

This closes the circle: citizen science initiatives support and expand scientific research, which in turn develops methods and techniques that further expand the possibilities of citizen science. However, this cross-fertilization is only possible if the various stakeholders collaborate and solve problems together. Given the global biodiversity and climate crises, there are plenty of problems to solve.

Barkmann, F., Lindner, A., Würflinger, R., Höttinger, H., Rüdisser, J. (2025) Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels. Sci Data 12 (1), 1369. https://doi.org/10.1038/s41597-025-05708-z

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Biodiversity
Life Sciences > Biological Sciences > Ecology > Biodiversity
Artificial Intelligence
Mathematics and Computing > Computer Science > Artificial Intelligence
Biological Taxonomy
Life Sciences > Biological Sciences > Biological Techniques > Computational and Systems Biology > Biological Taxonomy

Related Collections

With Collections, you can get published faster and increase your visibility.

Data for crop management

This Scientific Data Collection welcomes submissions of Data Descriptors associated with datasets for crop management, which are essential for optimising agricultural productivity, sustainability, and food security.

Publishing Model: Open Access

Deadline: Jan 17, 2026

Computed Tomography (CT) Datasets

This Scientific Data Collection highlights a series of articles that describe CT imaging datasets.

Publishing Model: Open Access

Deadline: Feb 21, 2026