RE "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset"

Our article presents a new digital image dataset of historical handwritten birth records from Swedish parish archives. This dataset, called SHIBR (Swedish Historical Birth Records), is a valuable resource for researchers in various fields, including document analysis, genealogy, and history.
Published in Computational Sciences
RE "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset"
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Read the paper

SpringerLink
SpringerLink SpringerLink

SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset - Neural Computing and Applications

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date of baptism, father’s first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.

The paper "SHIBR—The Swedish Historical Birth Records: a semi-annotated dataset" presents a significant contribution to the field of historical document analysis and genealogical research. This comprehensive dataset, known as SHIBR (Swedish Historical Birth Records), is the first and largest of its kind for Swedish historical documents, offering open access to 15,000 high-resolution colour images of birth records from 1800 to 1840.

Dataset Overview

SHIBR consists of digitized handwritten birth records from various Swedish parishes, accompanied by detailed metadata. The dataset contains 191,301 indexed rows corresponding to the 15,000 images, divided into three subsets:

  • Training set: 133,941 indexed rows and 10,500 images
  • Evaluation set: 28,303 indexed rows and 2,250 images
  • Test set: 29,057 indexed rows and 2,250 images

Each entry in the dataset is annotated with 17 columns of information, providing a rich source of data for various research purposes. These columns include:

  1. Child's first name
  2. Birth date
  3. Baptism date
  4. Father's first and last name
  5. Mother's first and last name
  6. Parents' occupations
  7. Birthplace
  8. County and parish information
  9. Image identifiers and file paths

Significance and Applications

The SHIBR dataset is significant for several reasons:

  1. Historical Value: It provides researchers and genealogists with access to valuable historical records.
  2. Algorithm Development: The dataset can be used to develop and improve algorithms for analysing handwritten documents.
  3. Challenging Data: The unique handwriting styles in SHIBR make it a challenging dataset for existing deep learning models, which can drive innovation in the field.
  4. Complementary Resource: SHIBR complements the previously published ARDIS dataset, which focuses on numerical handwritten data.

Potential Applications

The SHIBR dataset has numerous potential applications:

  1. Genealogical Research: Helping individuals trace their family history and ancestry.
  2. Historical Studies: Providing insights into Swedish society and demographics in the early 19th century.
  3. Machine Learning: Training and testing algorithms for handwriting recognition and document analysis.
  4. Digital Humanities: Supporting interdisciplinary research combining computer science and historical studies.

Data Mining Insights

We conducted data mining on the dataset to uncover interesting statistics and facts:

  1. Name Popularity: The paper presents the top 10 most common names for newborns, fathers, and mothers during the period covered by the records.
  2. Maternal Age Distribution: A comprehensive analysis of mothers' ages across different Swedish counties is provided, offering insights into historical demographic patterns.
  3. Occupational Trends: The dataset reveals common job titles for both men and women during the early 19th century, with farming being the predominant occupation for men.

Comparative Analysis

The paper includes a comprehensive survey of contemporary datasets in the field, highlighting SHIBR's unique position:

  1. Size and Scope: With 15,000 high-resolution images, SHIBR is significantly larger than many existing historical document datasets, such as the Esposalles database (173 images) or the Saint Gall database (60 images).
  2. Language and Character Set: SHIBR is the first semi-annotated historical document image dataset featuring Swedish characters, filling a gap in the available resources for Nordic language document analysis.
  3. Temporal Coverage: The dataset spans four decades (1800-1840), providing a substantial timeframe for studying evolving handwriting styles and societal changes.

Technical Challenges and Opportunities

We highlight several challenges presented by the SHIBR dataset:

  1. Handwriting Variability: The unique and diverse handwriting styles in the Swedish records pose difficulties for existing deep learning models trained on other datasets.
  2. Historical Context: The archaic language, obsolete occupations, and historical place names require specialized knowledge for accurate interpretation and annotation.
  3. Image Quality: While high-resolution, the historical nature of the documents introduces issues such as ink bleeding, paper degradation, and varying levels of preservation.

These challenges present opportunities for researchers to develop more robust and adaptable algorithms for historical document analysis.

Conclusion and Future Work

The SHIBR dataset represents an advancement in the availability of large-scale, semi-annotated historical document datasets. Its potential applications span multiple disciplines, including computer science, history, genealogy, and linguistics. We suggest that the dataset could be used for competitions focused on various document analysis problems, encouraging further innovation in the field.

By providing this extensive and detailed dataset as open access, we aim to stimulate research in historical document analysis, particularly for Swedish and Nordic language documents. The combination of high-quality images, detailed metadata, and the inherent challenges of historical handwriting, makes SHIBR an invaluable resource for developing and testing advanced machine learning and computer vision techniques.

As research in this area progresses, SHIBR is imagined to play a crucial role in bridging the gap between historical archives and modern digital accessibility, potentially revolutionizing how we interact with and learn from historical records.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Go to the profile of Hendry Izaac Elim
about 1 month ago

A very systematic research with excellent outcome and practical impacts 

Follow the Topic

Image Processing
Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Image Processing
Automated Pattern Recognition
Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Automated Pattern Recognition
Computer Vision
Mathematics and Computing > Computer Science > Computer Imaging, Vision, Pattern Recognition and Graphics > Computer Vision
Genealogy
Humanities and Social Sciences > History > Historiography and Method > Genealogy