Behind the Paper

An fMRI Dataset for Concept Representation with Semantic Feature Annotations

A large-scale vocabulary comprehension dataset of the human brain to explore the language mechanism of brain.

Published in Research Data

Nov 25, 2022

Zhang Yunhao

Ph.D. candidate, Institute of Automation, Chinese Academy of Sciences

An fMRI Dataset for Concept Representation with Semantic Feature Annotations

Liked by Evelina Satkevic and 3 others

Explore the Research

Language cognition is the most significant manifestation of human intelligence, and yet we know little about the language mechanism of brain: how the brain represents lexical meanings in a multimodal environment, how to integrate lexical meanings into the meanings of larger-grained language units, and how to carry out knowledge memory and reasoning. The newly emerging cognitive neuroscience research, combined with neuroimaging and computer modelling methods, has been able to infer the observed objects and the thought words from the patterns of human brain activity to a certain extent. These studies bring hope for researchers to reveal the neural mechanism of semantic memory and decode the mapping relationship between thinking and brain neural activities. Although current research has made some breakthroughs on the semantic representation of objects and nouns, other types of semantic representation (such as verbs, abstract words and function words) still need further research and exploration. Moreover, most existing studies focus on English, and there are few studies concentrating on neural coding in Chinese. The similarity and difference of brain coding between different languages is still unclear. One of the most important reasons is the lack of corresponding neuroimaging dataset for large-scale vocabulary comprehension.

Figure 1: Schematic overview of the study procedure — **Figure 1**: Schematic overview of the study procedure

In response to the above problems, our team (the Natural Language Processing Group of the Institute of Automation, Chinese Academy of Sciences) constructs a neuroimaging (fMRI) dataset including brain understanding of the Chinese words, named an fMRI Dataset for Concept Representation with Semantic Feature Annotations (CRSF), which has been published in Scientific data. We aim to provide a data basis for subsequent in-depth research on the language mechanism of brain. In CRSF, we collected about 58 hours of neuroimaging data, and 126 participants' annotation data for 54 semantic features (Figure 1). Finally, we present an fMRI data in which 11 participants thought of 672 individual concepts, including both concrete and abstract concepts. The concepts were probed using words paired with images in which the words were selected to cover a wide range of semantic categories. Furthermore, according to the componential theories of concept representation, we present the 54 semantic features of the 672 concepts comprising sensory, motor, spatial, temporal, affective, social, and cognitive experiences by crowdsourcing annotations (Figure 2). What’s more, in order to facilitate the research of brain language mechanism using computational models, we present different kinds of embedding on 672 concepts, including static word embeddings, contextual word embeddings and visual embeddings (Figure 3).

Figure 2: Domains and meaning components in the semantic feature annotation data — **Figure 2**: Domains and meaning components in the semantic feature annotation data

Figure 3: Different kinds of embeddings — **Figure 3**: Different kinds of embeddings

Furthermore, our team (the Natural Language Processing Group of the Institute of Automation, Chinese Academy of Sciences) also presents a synchronized multimodal neuroimaging dataset for studying brain language processing (SMN4Lang) and has been published in Scientific data, which includes functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) data on the same 12 healthy volunteers who listened to 6 hours of naturalistic stories, as well as high-resolution structural (T1, T2), diffusion MRI and resting-state fMRI data for each participant.

CRSF and SMN4Lang provide a way to explore how the brain mobilizes different brain regions and how different brain regions work together when understanding words, phrases and sentences in real scenes. Using CRSF and SMN4Lang, we can not only study the brain’s cognitive mechanism to understand Chinese but also explore the relationship between language computing models and human brain language processing mechanisms. In addition, we can study how to use neuroimaging data to improve the performance of existing language computing models, and then build more effective language computing models.

Prof. Shaonan Wang is the first author and corresponding author of this article, Yunhao Zhang, Xiaohan Zhang, Jingyuan Sun, Prof. Nan Lin, Prof. Jiajun Zhang and Prof. Chengqing Zong have made important contributions to this article.

Citation:

Wang, S., Zhang, Y., Zhang, X. et al. An fMRI Dataset for Concept Representation with Semantic Feature Annotations. Sci Data 9, 721 (2022). https://doi.org/10.1038/s41597-022-01840-2

Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5

Zhang Yunhao (He/Him)

Ph.D. candidate, Institute of Automation, Chinese Academy of Sciences

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data

Research Communities > Community > Research Data

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Computer vision in plant science and agriculture

This Scientific Data Collection invites Data Descriptors documenting the generation, curation, and validation of datasets that underpin computer vision applications across plant biology, crop science, and agricultural systems.

Publishing Model: Open Access

Deadline: Oct 10, 2026

Explore this Collection

Datasets in education

This Scientific Data Collection invites Data Descriptors that describe the generation, curation, and validation of open datasets related to educational systems, practices, and outcomes across diverse contexts and populations.

Publishing Model: Open Access

Deadline: Nov 19, 2026

Explore this Collection

Latest Content

Events

Rethinking 3D Geometry Compression Through the Lens of Structural Representation

Opportunities, From the Editors

Call for papers: Physiological and biosignals data for stress detection Collection

Behind the Paper

Seeing single action potentials in the brain

Opportunities

Recruitment of Editor-in-Chief for Nutrition & Diabetes

Behind the Paper

The animal microbiome as a partner in sustainable food production

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

An fMRI Dataset for Concept Representation with Semantic Feature Annotations

Share this post

Share with...

...or copy the link