Why FPS games?
First-person shooter (FPS) games are hugely popular on live-streaming platforms such as Twitch, attracting hundreds of millions of viewers every month. This makes such games an ideal choice for content creators, streamers and game designers. The FPS genre has been evolving for over 30 years, so FPS games are also diverse in visual style, content, and game modes; this diversity poses a significant challenge for general AI research. These factors make FPS games an attractive platform for our research, motivating us to collect gameplay videos from 30 popular games published between 1991 and 2023. In total, the GameVibe dataset contains 120 videos, including in-game audio, of approximately 1 minute each.
What type of data do we provide?
We provide the original unprocessed video clips in the repository, including the in-game audio. We also include latent representations extracted from pre-trained foundation models. These include latent representation extracted from Video Masked Autoencoders and Masked Video Distillation for visuals, as well as BEATS and MFCC latent representations for audio. The engagement labels were provided using the PAGAN platform, with the data taking the form of unbounded, time-continuous signals. We provide the raw signals, as well as various stages of signal processing including aggregation, normalization and outlier filtering in the data repository.
How do we reliably collect this data?
Reliably collecting data on human emotion is one of the fundamental challenges in AI and human feedback research. Emotions are subjective; this makes it difficult to assess the quality of the data we collect. Are disagreements between annotators exposed to the same stimulus down to actual differences in their emotion response, due to noise in the annotation tool, or due to a misunderstanding of the annotation process? With this challenge in mind, we developed a quality-assurance pipeline which assesses the reliability of human participants, and can help us understand the reliability of the collected data. We conducted a study to ensure the validity of this approach, where we compared the reliability of crowdsourced labels versus data collected in the lab. This helped us ensure that our labels were as representative of the participants’ internal state as possible.
To ensure there is enough data on each video to approximate the ground truth accurately, every clip in the dataset is annotated by 5 human participants. As part of our repository, we provide scripts to process these labels and extract the ground truth signals for each clip, including methods for removing outliers based on their degree of disagreement with other participants. This flexibility allows the dataset to cater to research projects which want to model the subjectivity of the labels, as well as those which solely seek to model the consensus of participants.
Who is this for?
Using the latent representations provided in the repository, our initial studies show that affect models can reliably predict viewer engagement in unseen clips and annotators when trained on clips of the same game. This can be useful for game developers and content creators to assess the quality of new content. One key strength of the dataset is that it comprises 30 different games, enabling studies on the generalizability of multimodal affect models to predict labels in unseen games. We believe that this can help progress towards the ultimate goal of general AI models of human affect.
Where to find it?
The GameVibe corpus, including the audiovisual stimuli and the human-annotated labels can be found at https://doi.org/10.17605/OSF.IO/P4NGX. This repository also contains processing scripts and some examples of latent representations extracted using well-established foundation models such as VideoMAEv2 (for video) and BEATS (for audio), as described above.
Please sign in or register for FREE
If you are a registered user on Research Communities by Springer Nature, please sign in