Spatio-temporal match data for Soccer Analytics

The largest collection ever of spatio-temporal data of soccer matches is made public on Scientific Data. A crucial resource for the developing of Sports Analytics.
Published in Research Data
Spatio-temporal match data for Soccer Analytics
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

In the history of soccer, we remember the exploits of great champions - Maradona, Baggio, Ronaldo. Nonetheless, we can learn essential lessons from the career of less known players, too.

Carlos Henrique Raposo, also known as Kaiser, has been a Brazilian professional footballer for more than 20 years. He played as a forward for ten different clubs in Brazil, Argentina, Mexico, USA, and France. His career, however, has a unique peculiarity: Carlos Kaiser played just two official matches. By establishing a friendship with famous footballers and asking them to recommend him to the managers of their new clubs, Kaiser managed to change the team almost every year. Once hired by a club, Kaiser simulated fake injuries throughout the season, thus hiding his mediocre footballing talent. An intricate network of lies and social relationships that made him survive in the football world for around 20 years.

Although the case of Kaiser is one in a million, the history of soccer is not new to sensational purchases that then turned out to be resounding failures. Not even talented and legendary managers were immune to these situations (who remembers Luther Blissett?). The reason behind these blunders resides solely in one place: the lack of data describing the performance of players throughout their careers. The availability of data can provide a way to track the evolution of Kaiser’s performance in matches and training sessions, probably highlighting his inadequacy to play soccer at a high level.

Nowadays, we have tools to discover effective players, hence avoiding failing purchases and unusual situations like Kaiser. Indeed, massive data about the performance of players are collected by specialized companies, thanks to sensing technologies that provide high-fidelity data streams extracted from every match. In particular, the so-called soccer-logs (aka match events data) describe the events that occur during a match and are collected through proprietary tagging software. Each match event contains information about its type (pass, shot, foul, tackle, etc.), a time-stamp, the player(s), the position on the field, and additional information (e.g., pass accuracy). The volume and complexity of these data provide an unprecedented opportunity to observe the performance of players and teams during a match and track their evolution during a season. Here is an example of a visualization, based on soccer-logs, of the evolution of the performance of players for an entire seasons: https://playerank.d4science.org/.


Example of soccer-logs
(Left) Example of events observed for Lionel Messi (FC Barcelona) during a match in the Spanish first division, season 2015/2016. Each event is shown on the field at the position where it has occurred with a marker indicating the type of the event. (Right) Example of event in the dataset corresponding to an accurate pass by player 3344 (Rafinha) of team 3161 (Internazionale) made at second 2.41 of match 2576335 (Lazio - Internazionale) started at position (49, 50) of the field.


Unfortunately, soccer-logs are private data owned by the companies that collect them. Acquiring these data for research purposes is difficult, and it is a considerable cost for companies and especially researchers in the field of sports analytics. It goes without saying that the lack of public soccer-logs constitutes a severe limit to the development of sports analytics.

This is why, in collaboration with company Wyscout/Hudl, we make publicly available an extensive collection of soccer-logs that covers seven prominent male soccer competitions. The collection has been used recently during the Soccer Data Challenge initiative organized by European project SoBigData and, to the best of our knowledge, it is the largest collection of soccer-logs ever released to the public. These data are hugely beneficial to the scientific community because they can contribute to foster sports analytics research in several directions, such as the ones we sketch below.

Performance and tactical analysis. The evaluation of performance is crucial for many actors in the sports industry: from managers who want to monitor the quality of their players to scouts who aim to improve the retrieval of talents. In this regard, we recently developed PlayeRank, an open algorithm based on soccer-logs to evaluate automatically the quality of the performance of players during a season. The automatic discovery of tactics is also crucial in soccer: while tactical analyses are currently performed by reviewing matches in videos, soccer-logs can be used to perform automatic discovery of tactics, simplifying the complex process of match analysis. Our collection can serve as a common ground to compare and validate different solutions to the aforementioned problems.

Evaluating soccer performance with PlayeRank
Performance quality calculated as the PlayeRank score of L. Messi (red line), C. Ronaldo (blue line), and M. Salah (black line).


Complex Systems analysis. Two soccer teams in a match represent a complex system whose global behavior depends in subtle ways on the dynamics of the interactions among the players. Soccer-logs enable the representation of a team as a network, in which nodes represent players and the edges interactions between nodes, usually passes. Soccer-logs allow the definition of different types of interactions between both teammates and opponents by relying on the several event types they encode. Such a richness of information, combined with the dichotomous nature of soccer matches (where collaboration and competition coexist), provides an unprecedented opportunity to investigate novel aspects about the dynamics of complex networks.

Passing networks of soccer teams
Representation of the player passing networks of Napoli (left) and Juventus (right) during the match Napoli-Juventus. Nodes represent players, while edges represent passes between players. The size of the nodes reflects the number of ingoing and outgoing passes (i.e. node's degree), while the size of the edges is proportional to the number of passes between the players.

Science of Success. The possibility to track players and team performance creates the opportunity to explore the relationship between performance and success, where a team's success can be intended as its outcome in a competition and the player’s one as their popularity or market value. While this relationship has been investigated for individual sports, apart from a few attempts, there is no much work for soccer, partly due to the absence of public datasets of performance. Our dataset gives the unprecedented opportunity to answer fascinating questions like What are the tactical patterns of successful teams? What are the factors influencing a player's popularity and market value? To what extent is success predictable from the observable performance?

We hope our open data collection can stimulate the creativity of scientists all around the world and foster the development of new ideas, methods, and analyses that can contribute to strengthen the emerging field of sports analytics. Nowadays, by tracking the performance of a player in time we may avoid one-in-a-million situations like Kaiser. Unless the new Kaisers are good data hackers.

Luca Pappalardo and Paolo Cintia

Reference: L. Pappalardo et al., A public data set of spatio-temporal match events in soccer competitions (2019) Scientific Data, DOI: 10.1038/s41597-019-0247-7, https://www.nature.com/articles/s41597-019-0247-7

Author information

Luca Pappalardo is a researcher at the Institute of Information Science and Technologies (ISTI) at the Italian National Research Council (CNR) in Pisa, Italy. He got a PhD in Computer Science at the University of Pisa. His research focuses on the analysis of Big Data in various contexts, from human mobility to social networks and sports performance. 

Paolo Cintia is a researcher at the Department of Computer Science of the University of Pisa, Italy, and he is CEO of PlayeRank, a startup that develops analytical tools for soccer fans and clubs. He got a PhD in Computer Science at the University of Pisa and published several articles on the evaluation of performance and the tactical analysis in several sports.


Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Data for epigenetics research

This Collection presents data within epigenetics research including, but not limited to, data generated through techniques such as ChIP, bisulphite, nanopore and RNA sequencing, single-cell epigenetics/epigenomics, spatial genomics/epigenomics, and the role of non-coding RNAs in epigenetic modulation.

Publishing Model: Open Access

Deadline: Sep 30, 2024

Neuroscience data to understand human behaviour

This Collection presents descriptions of datasets combining brain imaging or neurophysiological data performed alongside real-world tasks or exposure to different stimuli.

Publishing Model: Open Access

Deadline: Oct 31, 2024