Behind the Paper

Enriching Unstructured Cultural Heritage Data Using NLP

A recently published article introduces a domain-focused pipeline to transform unstructured cultural heritage texts into structured, machine-readable formats.

Madison Leeson Aug 13, 2025

What inspired this research?
Cultural heritage researchers often have to sift through a mountain of data related to the cultural items they study, including reports, museum records, news, and databases. The information in these sources contains a significant amount of unstructured and semi-structured data, including ownership histories ('provenance'), object descriptions, and timelines, which presents an opportunity to leverage automated systems. Recognising the scale and importance of the issue, researchers at the Italian Institute of Technology's Centre for Cultural Heritage Technology have fine-tuned three natural language processing (NLP) models to distill key information from these unstructured texts. This was performed within the scope of the EU-funded RITHMS project, which has built a digital platform for law enforcement to trace illicit cultural goods using social network analysis (SNA). The research team aimed to fill the critical gap: how do we transform complex textual records into clean, structured, analysable data?

What’s the key innovation?
The paper introduces a streamlined pipeline to create custom, domain-specific datasets from textual heritage records, then trained and fine-tuned NLP models (derived from spaCy) to perform named entity recognition (NER) on challenging inputs like provenance, museum registries, and records of stolen and missing art and artefacts. It evaluates zero-shot models such as GLiNER, and employs Meta’s Llama3 (8B) to bootstrap high-quality annotations, minimising the need for manual labelling of the data. The result? Fine-tuned transformer models (especially on provenance data) significantly outperformed out-of-the-box models, highlighting the power of small, curated training sets in a specialised domain.

How was it done?
Three datasets were assembled from public and institutional databases:

Provenance-rich records from the AAMD Object Registry (part of the Association of Art Museum Directors' initiatives) which records under-provenanced objects in North American museums,
Object descriptions from the Romanian Police’s “Obiecte Furate” database of stolen art, and
WWII-looted items catalogued by the Polish Division of Looted Art

Llama3 generated initial entity labels over ~400 entries per dataset; domain experts then refined these to form the gold standards. These annotations fueled the fine-tuning of two spaCy models, and performance was rigorously evaluated across training, validation, and test splits.

What did they find?

Provenance data yielded the strongest model performance. F1-scores were especially high when applying transformer models fine-tuned on the AAMD dataset thanks to its consistent, semi-structured provenance structure.
Zero-shot models and generic spaCy models fell short, especially on descriptive or loosely formatted texts, highlighting the importance of domain adaptation.
Certain tags such as LOC (Locations) or WORK_OF_ART (Artifacts) remained challenging, likely due to sparse examples. Models also occasionally misclassified entities when punctuation or titles were stripped, emphasising the value of preserving structure in data pre=processing steps.

Why does it matter?
This methodology empowers cultural heritage experts and investigators to transform messy textual records into structured entities, enhancing the integration of data in knowledge graphs (KGs) that can help uncover hidden trafficking networks or institutional relations. Indeed, the RITHMS team applied their models to build a KG with over 72,000 entities and 110,000 relationships, which has helped detect influential actors as well as previously overlooked figures connected to illicit trade. The article describes in great detail a scalable approach that bridges AI and cultural heritage protection.

What’s next?
The authors propose refinements like tag consolidation (e.g., combining LOC with GPE), smarter preprocessing (e.g., isolating the most informative text segments), and building relation extraction modules to enrich graphs beyond entity identification. The three fine-tuned models have been made available on GitHub for other researchers and law enforcement who are interested in testing the performance on their own specialised datasets.

Read the article here