Ladyna Wittscher

Researcher, Friedrich-Schiller-University Jena
  • Germany

Topics

Channels contributed to:

Behind the Paper

Recent Comments

May 31, 2026

Thank you so much for your interest in the topic!

Based on the survey, I would say that the importance of the pretraining data is still often underestimated. There is a natural tendency to focus on the latest SSL algorithms or architectures, but many studies suggest that the size, quality, and especially the domain alignment of the pretraining data can have an equally large effect on performance. Since the downstream dataset is usually fixed when training a self-supervised model from scratch, and the number of suitable pretraining datasets is often limited, it becomes particularly important to align the remaining design choices with the available data.

Still, it is important to add that there is no universally dominant design choice and that we need way more research.