Have you ever wished to analyze the data from someone in a distant laboratory? Or wished to share your painstakingly collected data with the world? If so, you may have encountered some of the challenges Neuroscientists face when dealing with data management.
First, data sizes are large. This emerges from a need for ever-more precise, simultaneously acquired information, in order to best explain any scientific findings. In neuroscience, rapid advances in techniques such as electrophysiology and neural imaging mean that scientists can now capture the activity of hundreds or thousands of neurons at once. Also, it has become increasingly important to record the context in which a given neural data has been acquired; and as such, experimental acquisition typically implicates the tracking of all sorts of variables, for example body movements through many angled videos, or ambient sound, pressure and temperature throughout. Gathering all this information has a cost, and that cost is size. A single session dataset can reach hundreds of gigabytes, and encompass many files of various formats. How should one process and store such information?
Second, you are not alone. This data needs to be shared - internally in the case of a collaboration, and eventually with the world as is now typically required by most publishers. How to share this (large) data with others across the globe, organizing and labeling it in a manner that enables them to navigate through all the different experimental conditions?
The case of the International Brain Laboratory (IBL) was a perfect example for all these needs. 22 laboratories, geographically distributed over several time zones, were interested in analyzing the data collected by 10 of them. These 10 partner laboratories were performing either the same experiments as part of a collaboration-wide project, or experiments deviating from it as part of lab-driven endeavors. The data inflow was large and incoming daily from all partners. The most heavy-sized experiments contained behavioral, electrophysiology and video data. All of which needed to be processed, stored and later swiftly accessed for both manual curation and analysis. How would you have created the data architecture? We have one word: modular.
Let’s first tackle data storage. Not only will you need compression algorithms to save on disk size (one of which we have developed for the particular case of Neuropixels data), but you will need a place with enough memory to hold all the data collected by the partners, and a robust copy mechanism to transfer these files across from each laboratory. In the case of the IBL, we were sponsored¹⁻² and able to host all our data (191 TB) at the San Diego Supercomputer Center (SDSC). We used Globus as a system to transfer the files from the local machines to the SDSC, to circumvent all the different educational firewalls.
Now comes the question of how to handle the heavy processing of the data that is required for turning it into meaningful variables for analysis. To save time and money, we localized and thereby parallelized such processing before sending it to the SDSC. Notably, spike sorting and video segmentation were done on a local lab server that gathered the data from the many computers used in a given experiment. Each lab had a server, and as such acted as its own little processing farm. And in the case of an experiment needing to be re-processed post-registration to SDSC, one had the option to download the data and re-run the processing either on any local server of the network (which was free as the machines are owned), on a supercomputing cluster (e.g. accessible via the local institutions), or on a commercial cloud.
Then comes the question of sharing the data with others. There needs to be a way for others to search through the file system and retrieve only those that are of interest to them. For example, one might want to use in their analysis only experiments done on a particular mouse strain, or passing certain quality standards. For the purpose of meta information storing, we used Alyx, a database system that is primarily used for mouse colony management, but that also holds information on experimental design, quality control measures and file records (i.e. what files were produced as part of the experiment). To search through this metadata, we developed ONE, an API that can query the Alyx database and load the specific datasets once found. Interestingly, we can load in single datasets (i.e. single files) - and in some cases even a portion of the data contained in the file - rather than having to load the whole session folder, which greatly speeds data accessibility when doing live analysis.
How to share the data with others is only one part of the question - the other part being for what purpose. Internal collaborators of the IBL needed to access the data right after it had been acquired, to visually check on it and curate it. Daily plots and easy-to-input notes have been essential to swiftly assess the quality of the data, preventing repeated experimental issues. For notes and single-session plots, we used Alyx as a laboratory ledger; however, we needed a way to show long-term results, such as the mouse performance computed over weeks.
How to best store data analysis results is not a trivial question. We first turned to the Datajoint company, who created a SQL database purposely to save analysis results computed over several sessions, and display overarching plots. However, we quickly realized that we needed to create tables for our own sake. For example, it was impractical to download dozens of files if you wanted to get the trials data for one mouse over days - rather, given that this trial data was small, we could store them in a table and provide only one file to the end user. A similar reasoning occurred when wanting to analyze the neurons recorded in a particular brain region. Serving tables reduces download time, and processing errors that may arise when trying to create such a data concatenation oneself. This is one of our steps towards large-scale data computation, breaching beyond a single-session-based analysis.
Once you have managed to find a system to share data internally across IBL members, sharing with the whole world may appear trivial. However, you need to think about data download bandwidth and cost, user-entry barrier and as such file format. The format we used internally is called a “staging format”, useful to rapidly view, analyze and if necessary re-process the data. But when it comes to sharing the data with the world, it has been processed and quality vetted at length and can thus be stored in a different format, a so-called “archiving format” that may be better known by the public. We have therefore partnered with Neurodata Without Borders to convert our files into this common format, and store them on the DANDI archive. In the meantime, and in order to offer the most bandwidth possible, we are hosting our data for the public on sponsored⁴ Amazon storage space.
As you can see, each of the solutions found by the IBL responded to specific requirements. We encourage you to adopt parts of this architecture, based on your own needs and funding capacity.
Sponsors and partners
- Simons Foundation
- Wellcome Trust
- San Diego Supercomputer Center
- Amazon (Open Data Sponsorship program)
- Neurodata Without Borders
- DANDI archive