Behind the Paper

RE "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset"

Our article presents a new digital image dataset of historical handwritten birth records from Swedish parish archives. This dataset, called SHIBR (Swedish Historical Birth Records), is a valuable resource for researchers in various fields, including document analysis, genealogy, and history.

Published in Computational Sciences

Dec 07, 2024

Abbas Cheddad

RE "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset"

Like Liked by Hendry Izaac Elim

Explore the Research

The paper "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset" presents a significant contribution to the field of historical document analysis and genealogical research. This comprehensive dataset, known as SHIBR (Swedish Historical Birth Records), is the first and largest of its kind for Swedish historical documents, offering open access to 15,000 high-resolution colour images of birth records from 1800 to 1840.

Dataset Overview

SHIBR consists of digitized handwritten birth records from various Swedish parishes, accompanied by detailed metadata. The dataset contains 191,301 indexed rows corresponding to the 15,000 images, divided into three subsets:

Training set: 133,941 indexed rows and 10,500 images
Evaluation set: 28,303 indexed rows and 2,250 images
Test set: 29,057 indexed rows and 2,250 images

Each entry in the dataset is annotated with 17 columns of information, providing a rich source of data for various research purposes. These columns include:

Child's first name
Birth date
Baptism date
Father's first and last name
Mother's first and last name
Parents' occupations
Birthplace
County and parish information
Image identifiers and file paths

Significance and Applications

The SHIBR dataset is significant for several reasons:

Historical Value: It provides researchers and genealogists with access to valuable historical records.
Algorithm Development: The dataset can be used to develop and improve algorithms for analysing handwritten documents.
Challenging Data: The unique handwriting styles in SHIBR make it a challenging dataset for existing deep learning models, which can drive innovation in the field.
Complementary Resource: SHIBR complements the previously published ARDIS dataset, which focuses on numerical handwritten data.

Potential Applications

The SHIBR dataset has numerous potential applications:

Genealogical Research: Helping individuals trace their family history and ancestry.
Historical Studies: Providing insights into Swedish society and demographics in the early 19th century.
Machine Learning: Training and testing algorithms for handwriting recognition and document analysis.
Digital Humanities: Supporting interdisciplinary research combining computer science and historical studies.

Data Mining Insights

We conducted data mining on the dataset to uncover interesting statistics and facts:

Name Popularity: The paper presents the top 10 most common names for newborns, fathers, and mothers during the period covered by the records.
Maternal Age Distribution: A comprehensive analysis of mothers' ages across different Swedish counties is provided, offering insights into historical demographic patterns.
Occupational Trends: The dataset reveals common job titles for both men and women during the early 19th century, with farming being the predominant occupation for men.

Comparative Analysis

The paper includes a comprehensive survey of contemporary datasets in the field, highlighting SHIBR's unique position:

Size and Scope: With 15,000 high-resolution images, SHIBR is significantly larger than many existing historical document datasets, such as the Esposalles database (173 images) or the Saint Gall database (60 images).
Language and Character Set: SHIBR is the first semi-annotated historical document image dataset featuring Swedish characters, filling a gap in the available resources for Nordic language document analysis.
Temporal Coverage: The dataset spans four decades (1800-1840), providing a substantial timeframe for studying evolving handwriting styles and societal changes.

Technical Challenges and Opportunities

We highlight several challenges presented by the SHIBR dataset:

Handwriting Variability: The unique and diverse handwriting styles in the Swedish records pose difficulties for existing deep learning models trained on other datasets.
Historical Context: The archaic language, obsolete occupations, and historical place names require specialized knowledge for accurate interpretation and annotation.
Image Quality: While high-resolution, the historical nature of the documents introduces issues such as ink bleeding, paper degradation, and varying levels of preservation.

These challenges present opportunities for researchers to develop more robust and adaptable algorithms for historical document analysis.

Conclusion and Future Work

The SHIBR dataset represents an advancement in the availability of large-scale, semi-annotated historical document datasets. Its potential applications span multiple disciplines, including computer science, history, genealogy, and linguistics. We suggest that the dataset could be used for competitions focused on various document analysis problems, encouraging further innovation in the field.

By providing this extensive and detailed dataset as open access, we aim to stimulate research in historical document analysis, particularly for Swedish and Nordic language documents. The combination of high-quality images, detailed metadata, and the inherent challenges of historical handwriting, makes SHIBR an invaluable resource for developing and testing advanced machine learning and computer vision techniques.

As research in this area progresses, SHIBR is imagined to play a crucial role in bridging the gap between historical archives and modern digital accessibility, potentially revolutionizing how we interact with and learn from historical records.

Abbas Cheddad

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Hendry Izaac Elim

over 1 year ago

A very systematic research with excellent outcome and practical impacts

Follow the Topic

Image Processing

Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Image Processing

Automated Pattern Recognition

Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Automated Pattern Recognition

Computer Vision

Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Computer Vision

Genealogy

Humanities and Social Sciences > History > Historiography and Method > Genealogy

Neural Computing and Applications

Neural Computing and Applications

An international journal which publishes original research and other information in the field of practical applications of neural computing and related techniques such as genetic algorithms, fuzzy logic and neuro-fuzzy systems.

More about the journal

RE "Active Restoration of Lost Audio Signals Using Machine Learning and Latent Information"

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

RE "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset"

Share this post

Share with...

...or copy the link