A synchronized multimodal neuroimaging dataset for studying brain language processing

We present a synchronized multimodal neuroimaging dataset for studying brain language processing (SMN4Lang) that contains fMRI and MEG data on the same 12 healthy volunteers while the volunteers listened to 6 hours of naturalistic stories.

When the brain processes language, it needs to mobilize neurons in multiple brain regions to work together in real-time. Therefore, the construction of neuroimaging data with high temporal and spatial resolution is crucial for studying the language-processing mechanism of the brain. Existing open source data are mainly collected for English, only include single modality neuroimaging data, such as high spatial resolution functional MRI (fMRI) or high temporal resolution magnetoencephalography (MEG), and mostly use experimental materials within 1 hour cannot conduct more comprehensive brain research with the help of computational models that require large amounts of data. To this end, we collected and processed the largest and most informative simultaneous multimodal neuroimaging dataset in the world so far. The paper introducing this dataset has been accepted and published by Scientific Data, a sub-journal of Nature (https://rdcu.be/cWDSx).

Figure 1 Schematic overview of the study procedure. a. The participants followed the instructions on the screen and listened to stories while their brain activity was recorded by fMRI and MEG. b. Participants lied in the MRI machine while structural and resting-state MRI data were recorded.

The dataset contains fMRI and MEG collected when 12 subjects listened to stories for about 6 hours, together with T1/T2 weighted structural images of each subject, diffusion MRI and resting-state MRI. The collection process is shown in Figure 1. In order to facilitate the study of brain language processing mechanisms using computational models, all story materials were manually marked with a syntactic structure tree, and the audio time points, word frequencies, and vectors of various words and vocabulary corresponding to each word in the text were calculated, such as Figure 2 shows. All test indicators are beyond or comparable to the existing similar data sets, with sufficient quality assurance. This is by far the largest multimodal neuroimaging data set for brain language processing research in the world, and the first large-scale Chinese multimodal neuroimaging data set. The public release of this data set provides important support for all-around research on scientific issues such as how the brain mobilizes different brain regions and how different brain regions work together when understanding vocabulary, phrases, and sentences in real scenarios. What is especially important is that the data covers nearly 10,000 Chinese words. In order to explore the relationship between the language computing model and the language processing mechanism of the human brain, and explore how to use neuroimaging data to improve the performance of the existing language computing model, so as to build a new generation of Brain-inspired neural language models are of great importance.

Figure 2 An example of annotation information for the stimuli. a. Speech-to-text alignment. b. Linguistic annotations of characters. c. Linguistic annotations of words. d. Part-of-speech tag annotations. e. Constituency tree annotations. f. Dependency tree annotations.


Wang, S., Zhang, X., Zhang, J. et al. A synchronized multimodal neuroimaging dataset for studying brain language processing. Sci Data 9, 590 (2022). https://doi.org/10.1038/s41597-022-01708-5

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Medical imaging data for digital diagnostics

This Collection presents a series of articles describing annotated datasets of medical images and video. All medical specialities are considered and data can be derived from study participants, tissue samples, electronic health records (EHRs) or other sources.

Publishing Model: Open Access

Deadline: Dec 20, 2023

Meteorology and hydroclimate observations and models

This Collection presents a series of articles describing hydroclimate datasets, including data sourced from remote sensing, primary measurements or theoretical models. Datasets are presented without analyses in order to support policy development and further research, with Data Descriptors providing full details of data sources, modelling, and any associated code.

Publishing Model: Open Access

Deadline: Dec 15, 2023