The Portuguese Fisheries dataset - Protecting Data Privacy for research

Accessing fisheries datasets, mainly inspection records, poses a big challenge due to privacy concerns. However, meticulous efforts have been made that led to the compilation of a comprehensive dataset with 10,745 records of fishery inspections spanning 2015 to 2023 in Portuguese waters.
The Portuguese Fisheries dataset - Protecting Data Privacy for research
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Obtaining fisheries datasets can be quite challenging, and this is even more the case for fisheries inspection datasets. The complexity arises because these datasets are collected by governmental institutions that should protect the identities of the inspected vessels, as well as the vessels conducting the inspection. Confidentiality is necessary to protect sensitive information, which unfortunately makes access to such data for research particularly difficult.

In an effort that involved overseeing two master's theses at the Portuguese Naval Academy, it was possible not only to aggregate, pre-process, and cross-reference data to create a comprehensive database from 2015 to 2023, but also to make the necessary modifications for its public release.

The collection of data for the "Fisheries Inspection in Portuguese Waters from 2015 to 2023" dataset1  was  conducted by the Portuguese Navy through standard Fiscalization Reports (FISCREP). These reports includes identification and type of the vessel, fishing gear being used at the time, and compliance with fishing regulations, among other variables. Pre-processing involved extensive data validation and integration with existing datasets from the Directorate-General for Natural Resources, Safety, and Maritime Services, the United Nations Code for Trade and Transport Locations, and the European Union Fleet Register.  This data cross-referencing aimed to create the most comprehensive database possible to facilitate analysis without the need for external databases.

The data protection strategies detailed in the paper included anonymizing the dataset to ensure the privacy of those involved in fisheries inspections, crucial for meeting legal standards and safeguarding sensitive data. Techniques such as rounding values and adding random noise helped anonymize data points. Confidentiality was verified using Sample Frequency Count and Population Frequency Count, specifically looking for unique records (both counts equal to one) which pose the highest disclosure risk. 

Figure 1: Geographical displacement was performed to protect real location
Figure 1: Geographical displacement was performed to protect real location

While protecting data privacy, the authors also assessed the data quality. This involved evaluating the dataset's integrity by examining variable distributions before and after anonymization to ensure they retained similar statistical characteristics. Correlation metrics were also used to evaluate how transformations affected variable relationships. 

Figure 2: Example of approximately preserved distribution of the Power of main engine variable.

In summary, the process of collecting and protecting fisheries inspection data, particularly in Portuguese waters from 2015 to 2023, posed significant challenges due to the need for confidentiality and data accuracy. However, through careful aggregation, pre-processing, and cross-referencing efforts overseen by the authors, a comprehensive database was successfully compiled. This involved rigorous validation and anonymization techniques to safeguard privacy while ensuring the dataset's integrity for analysis. The resulting dataset not only facilitates robust research but also underscores the importance of balancing data protection with scientific inquiry in fisheries management and related fields.

Note: The dataset contributes to the Mar-IA2 project, dedicated to maritime Artificial Intelligence . This platform emphasizes data governance and value extraction through data science and AI techniques. Its goal is to establish a national data governance model and maximize value with the help of stakeholders' collective intelligence in the maritime sector.


1 Moura, R., Pessanha Santos, N., Vala, A. et al. Fisheries Inspection in Portuguese Waters from 2015 to 2023. Sci Data 11, 362 (2024). https://doi.org/10.1038/s41597-024-03088-4

2 For more information and to access the dataset, you can visit the Mar-IA project website: https://mar-ia.pt/ 

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Fisheries
Physical Sciences > Earth and Environmental Sciences > Earth Sciences > Biogeosciences > Ecosystems > Marine Biology > Fisheries
Crime Control and Security
Humanities and Social Sciences > Society > Criminology > Crime Control and Security
Data Mining and Knowledge Discovery
Mathematics and Computing > Computer Science > Database Management System > Data Mining and Knowledge Discovery
Methodology of Data Collection and Processing
Mathematics and Computing > Statistics > Methodology of Data Collection and Processing
Data and Information Visualization
Mathematics and Computing > Statistics > Statistics and Computing > Data and Information Visualization
Data and Information Security
Mathematics and Computing > Computer Science > Data and Information Security

Related Collections

With collections, you can get published faster and increase your visibility.

Data for epigenetics research

This Collection presents data within epigenetics research including, but not limited to, data generated through techniques such as ChIP, bisulphite, nanopore and RNA sequencing, single-cell epigenetics/epigenomics, spatial genomics/epigenomics, and the role of non-coding RNAs in epigenetic modulation.

Publishing Model: Open Access

Deadline: Sep 30, 2024

Neuroscience data to understand human behaviour

This Collection presents descriptions of datasets combining brain imaging or neurophysiological data performed alongside real-world tasks or exposure to different stimuli.

Publishing Model: Open Access

Deadline: Oct 31, 2024