Revealing the picture behind the construction of China’s low-carbon policy intensity dataset

Based on the phrase-oriented NLP algorithm and text-based prompt learning, our new article in Scientific Data presents China’s low-carbon policy intensity dataset, including the nation, 31 provinces, and 334 prefecture cities with sub-intensity for 3 levels, 4 objectives, and 3 instruments.
Published in Research Data
Revealing the picture behind the construction of China’s low-carbon policy intensity dataset
Like

Why did we build the dataset?

Low-carbon policies are essential for facilitating manufacturing industries’ low-carbon transformation and achieving carbon neutrality in China. However, recent studies usually apply proxy variables to quantify low-carbon policies, while composite indices of policy intensity measured by objectives and instruments focus more on the national level. It is deficient in direct and comprehensive quantification for low-carbon policies. Moreover, using traditional 100% manual policy scoring or pre-train and fine-tuning paradigms for a dataset with smaller samples and higher manual labeling costs could result in large human bias and overfitting. Therefore, a new paradigm is needed for low-carbon policy evaluation and quantification.

The contribution of this dataset

Having used the phrase-oriented NLP algorithm and text-based prompt learning, our latest paper published in Scientific Data built a low-carbon policy inventory with 7282 policies and constructed the low-carbon policy intensity for China’s manufacturing industries from 2007 to 2022. The contributions are threefold.

From the methodological perspective, we provide a new paradigm with transparent and repeatable policy evaluation processes. Although 100% manual scoring has been commonly used for policy evaluation with the scoring criteria explained in articles, it is still a “black box” from policy text to policy scores for readers since most of the evaluation processes are conducted in researchers’ own mind. Therefore, based on Python and Stata programming, this study has opened the “black box” using the phrase-oriented NLP algorithm and text-based prompt learning. By breaking it into steps, each policy evaluation stage can be quantified and repeated.

From the data perspective, this paper constructs a comprehensive dataset to reflect policy intensity’s heterogeneity from three policy levels, four policy objectives, and three policy instruments. Previous studies’ concentration on policies from a single policy level or specific regions cannot fully reflect the overall policy changes. This is why 7,282 low-carbon policies promulgated by China’s national level, 31 provincial levels, and 334 prefecture-level cities have been included in this article and dataset.

From the theoretical perspective, the meaning of low-carbon policy intensity has been expanded by adding the factor of policy level to the “policy objective–policy instrument” pattern. Hence, the low-carbon policy intensity in this paper is quantified by multiplying each policy’s objective, instrument, and level.

Stories behind the dataset construction

Compared with the time length for model training, data preparation for low-carbon policy texts as model input took longer. Owing to the application of supervised prompt learning model in this paper, the quality of training dataset directly influenced the accuracy of dataset for prediction. Hence, more than 60% of the time was spent on constructing low-carbon policy inventory, policy structuring, policy classification, and making labelling criteria. At this stage, the most important thing was to visualize the preparation process completed in the human brain through coding. For example, when classifying and scoring policy objectives, we had to consider which subsections in structured policy texts needed to be focused on (i.e., selecting the key subsection in policy texts) and which sentences contained keywords that can measure the intensity of low-carbon policy texts (i.e., identifying essential scoring and classifying phrases).

Furthermore, small and medium “pilot work” was conducted before model training. Because one of the applications of China’s low carbon policy intensity is for quantitative studies, whether the constructed index is reliable and can be used as a policy variable need to be closely verified. Therefore, apart from the validation for the whole dataset, the data validation process and correlation analysis with carbon performances were conducted after finishing the manual labelling for 4 provinces, 8 provinces and 16 provinces, respectively. These verifications for small and medium “pilot work” further guaranteed the quality of policy index construction.

The data record and access to this dataset

Our dataset is valuable for researchers concerned with low-carbon policies and has been organized in two formats (.dta and .xlsx). Apart from the inventory and intensity for each policy, the policy intensity is also aggregated to national-, provincial, and prefecture-level with sub-intensity for four policy objectives (i.e., carbon reduction, energy conservation, capacity utilization, and technology), and three policy instruments (i.e., command-and-control, market-based, composite instruments). All data and code are open to the public in the Figshare repository. Hence, data users are able to merge it with macro- and micro-data for extended analysis and discuss the impact of low-carbon policies from different perspectives.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Epidemiological data

This Collection presents a series of articles describing epidemiological datasets spanning diverse populations, ecosystems, and disease contexts. Data are presented without hypotheses or significant analyses, and can be derived from population surveys, health registries, electronic health records, field sampling, or other sources.

Publishing Model: Open Access

Deadline: Dec 22, 2024

Data for epigenetics research

This Collection presents data within epigenetics research including, but not limited to, data generated through techniques such as ChIP, bisulphite, nanopore and RNA sequencing, single-cell epigenetics/epigenomics, spatial genomics/epigenomics, and the role of non-coding RNAs in epigenetic modulation.

Publishing Model: Open Access

Deadline: Dec 28, 2024