Behind the Paper

The human choices that impact AI – A survey on design choices for self-supervised learning in computer vision

Behind every self-supervised vision model lies a chain of human design choices that shape its performance, robustness, and transferability. Choices regarding pretext data, pretext tasks, model architecture, and transfer strategies matter. Successful self-supervision depends on their alignment.

The hidden design choices behind self-supervised vision models

Anyone who has ever trained a neural network from scratch knows how much the final result can depend on decisions made before training even begins. Hyperparameters shape the training conditions, influence the model architecture, and can strongly affect performance. Yet, compared with the attention given to new architectures and algorithms, there is still relatively little research on how sensitive models are to these decisions. And hyperparameters are not the only important choices developers have to make when setting up their own model: Which datasets should be used? Which training algorithm should be applied? Which choices are most relevant and therefore should be considered most carefully? Which ones interact and should be considered jointly rather than in isolation?

These questions are not only important to developers, but they also matter for a broader public, as these human design choices can influence the performance, robustness and behaviour of artificial intelligence. I have often asked myself these questions when training self-supervised models from scratch and I am very interested in the human influence on AI. This was one of the main motivations behind my paper. I wanted to know which choices were most relevant and should be considered most carefully. Additionally, I have been asking myself how much the computer vision research suffers from over-adaption to benchmarking datasets like ImageNet and how easily we can generalize to other datasets and domains.

Why self-supervised learning makes design more complex

Self-supervised learning is a very popular approach to train neural networks as it reduces the need for annotated data by adding a pretext task before the actual downstream task that predicts self-generated pseudolabels. For example, parts of an image can be masked, and the model has to predict the missing content from the visible regions. This first pretraining step should allow to learn semantic knowledge beneficial for the downstream task. The model can be pretrained on the same dataset that will later be used for the downstream task, which is known as self-pretraining. Alternatively, it can also be pretrained on another dataset from the same domain, known as in-domain pretraining, or on a dataset from a different domain, known as out-of-domain pretraining. After the pretext task, the learned representation has to be transferred to the downstream task, usually either by linear probing with a frozen feature extractor or by fine-tuning the full network. Consequently, the introduction of the pretext task leads to additional complex design choices developers have to make compared to standard supervised learning. Many studies propose new and improved self-supervised methods with specific pretext tasks, but it is often difficult to determine whether their success is truly due to the method itself or whether other design choices are more decisive.

Design choices for self-supervised learning are significantly more complex than traditional hyperparameters. They consist of groups of interdependent and intertwined decisions, that may also include traditional hyperparameters. They have a major impact on the performance of the final model, and individual aspects have already been examined in previous publications, but a comprehensive overview has been lacking. Even small changes to the pretext task, data, network size, augmentation, knowledge transfer, or evaluation can significantly alter the results. Consequently, one single design choice is not “good” or “bad” in and of itself, but rather appropriate or inappropriate for a specific dataset, pretext task, and downstream goal.

The challenges of surveying a rapidly evolving field

Due to the complexity of the subject, it took more than two years of intensive work to write “A survey on design choices for self-supervised learning in computer vision.” One of the paper's key features is that it structures its findings according to different design choices, thereby providing a practical and user-friendly structure. The main challenge was that not all publications report the relevant hyperparameters and design choices in sufficient detail. We urgently need better reporting culture in this area to achieve greater transparency in research and make results more accessible to other researchers. Furthermore, self-supervised learning is developing rapidly, and new papers appear almost daily. This required a clear structure and a systematic overview of the different design choices and pretext task types. At the same time, this was also one of the most rewarding aspects of the project as I learned something new almost every day, and the topic never became boring. Another difficulty was terminology. Very similar, and sometimes almost identical, concepts are often described with different names, which pushes traditional keyword search to its limits when it comes to identifying interesting contributions. For that reason, AI-based semantic search and cross-references from relevant papers became important complements to database searches. Narrowing the scope was also essential. Initially, I planned to include all data types into the survey. However, given the amount of available literature, this quickly became unrealistic. I therefore decided to focus on image data. For anyone planning a similar review article, this would be one of my strongest recommendations: define a manageable scope early.

Which design choices matter most for self-supervision

The results of this comprehensive review emphasize the benefits of in-domain pretraining, especially when the available out-of-domain dataset is smaller or not substantially larger than the available in-domain dataset. Consequently, self-pretraining can be particularly useful if no suitable larger in-domain dataset is available or if the alternative would be a small or poorly aligned out-of-domain dataset. The usefulness of a pretraining dataset depends on its size, quality, domain similarity, and compatibility with the pretext task. Increasing the network capacity mostly improves the performance and is therefore advisable, especially in combination with challenging pretext tasks and large pretext datasets. Fine-tuning is more beneficial than linear probing, especially given self-pretraining and generative self-supervision.

I hope that this article will help researchers and practitioners make more informed design choices when applying self-supervised learning. I also hope it encourages more systematic research into the effects and interactions of these choices. Self-supervised learning promises to learn from data without labels. But to use it well, we also need to better understand how to design self-supervision itself.