Microbial specialised metabolism, largely encoded in and driven by metabolic gene clusters, is instrumental to ecological specialisation. Beyond the role played in ecological interactions, this immense source of chemical diversity can be leveraged for planetary quality of life through applications in medicine or agriculture. Throughout the last decade, and strongly leveraging rapid advances in DNA sequencing technologies, computational tools were developed that enabled automatic detection of biosynthetic gene clusters (i.e., antiSMASH) and their scalable and systematic comparison (i.e., BiG-SCAPE and BiG-SLiCE). In parallel, the MIBiG data standard and reference database of experimentally validated BGCs provided an important analysis resource.
BiG-SCAPE became widely adopted by the research community, aiding researchers in their analyses spanning from mining genomes for specialised metabolism to describing the biosynthetic diversity of entire ecosystems. Having spent the better part of my PhD computationally categorising biosynthetic diversity of an extremely diverse system - the marine sponge and holobiont, I had not only a deep appreciation for the value of the tool, but also extensive experience with its functionalities (and shortcomings). Driven by a desire to develop my skills in research software engineering, I joined the BiG-SCAPE team at the end of 2022, with the goal of kick-starting the process of designing and building the next iteration of BiG-SCAPE.
At the time, one of BiG-SCAPE's original developers, Dr. Jorge Navarro-Muñoz, was also employed at the Wageningen University (WUR) Bioinformatics Group (within which the Medema lab resides) as an eLearning Developer. This meant that for the first months of the project I was able to engage in extensive discussions with Jorge, gain a more comprehensive understanding of the design choices behind BiG-SCAPE, and formulate concrete goals for BiG-SCAPE 2.0. Soon after, Arjan Draisma followed suit having just finished his MSc at WUR. Arja was already familiar with BiG-SCAPE as his MSc thesis focused on addressing scalability bottlenecks in BiG-SCAPE 1.0. Additionally, Arjan brought extensive software engineering experience from his years employed at Philips. We were able to get right to work on what we had identified as the major goals for BiG-SCAPE 2.0: improving scalability without compromising on BiG-SCAPE's trademark sensitivity, improving accuracy of the comparisons by addressing shortcomings in the alignment and clustering algorithms, adopting novel concepts in biosynthesis introduced by the newer versions of antiSMASH, and a software-sustainability-focused complete overhaul of the code base.
Simultaneously, we were faced with the challenges that come with short, temporary, academic contracts, and general lack of funding calls targeted towards work on improving/maintaining already existing research software. At this time, both me and Arjan were employed on very short-duration contracts. This meant that we were not able to do any long term planning for our BiG-SCAPE 2.0 development goals, and had to adjust these goals to the time that was available. In parallel, together with Marnix, I put together a couple grant proposals, in the hope to extend our contracts and consequently BiG-SCAPE 2.0's deadlines. We were thrilled when one of the funding calls was granted, albeit a small one, and another of the applications scored extremely high and would eventually be funded upon resubmission the year that followed. At the same time, the Bioinformatics Group, recognizing the value of permanent support staff, opened a Research Software Engineer position, which Arjan would successfully apply to. This meant we could bring a third team member with the funding that had become available, and Nico Louwen, a recent bioinformatics graduate with excellent skills, joined our team. At this point, we had managed to grow our little BiG-SCAPE team and secure additional funding. We were excited with the opportunity that this extended timeframe gave us to tackle larger goals for BiG-SCAPE 2.0.
Early on we decided that instead of investing time in tracking down and solving specific scalability bottlenecks in the BiG-SCAPE 1.0 code, we would start from zero, and make use of design principles centered in modularity and interoperability from the get go. We also made updates and introduced new concepts to the biological algorithms of BiG-SCAPE 2.0: we addressed known issues, including improving compatibility with antiSMASH features; we leveraged the antiSMASH region concept for more relevant delineation of biosynthetic regions; and we introduced new alignment modes designed to improve accuracy of the similarity calculations, by improving handling of the challenging and predictive nature of biosynthetic region borders. A change in paradigm, which is central to BiG-SCAPE 2.0, is the extensive flexibility in run parameters that gives the user extreme control over each run. We are excited to see how this customisation potential can be leveraged, and encourage users to go beyond default settings and deeply explore their data.
BiG-SCAPE 2.0 facilitates accurate clustering and interactive analysis of gene cluster data.
Even though all new biological updates to BiG-SCAPE 2.0 were made based on simple intuitive principles that would (in theory, at least) only improve BiG-SCAPE's accuracy, we wanted to test the new behaviour on benchmarking data. However, benchmarking gene cluster family (GCF) generation is a somewhat paradoxical task, as there is no standardised definition of what a GCF is. We study metabolic gene clusters (often, although not always) because we are interested in a molecule that is either produced or consumed by the proteins encoded in these clusters. As such, one could argue that a GCF should contain only gene clusters that are responsible for the production of chemical molecules with not-too-dissimilar (how much is up for debate) structures and/or functions. But gene cluster similarity is measured, as far as BiG-SCAPE is concerned, by the sequence and architectural similarity between the clusters, and not by the similarity of their products. In this framework, a GCF should contain gene clusters that are themselves along a spectrum of not-too-dissimilar architectures. Again, it is up for debate, what the degree of dissimilarity is that we are referring to here. As these concepts remain in discussion within the research community, we wanted to compile as diverse as possible of a benchmarking dataset. That meant compiling manually curated GFCs that were generated based on sequence and architectural similarity, on structural similarity of produced molecules, derived from closely related genomes, as well as taxonomically distant genomes, that include cases of convergent and divergent evolution of gene clusters, as well as complete and fragmented genomic input data. In addition to the characteristics of the datasets themselves, it was important to have manual curations provided by different experts. To achieve this, we put out a call to the research community, and are deeply thankful to M. Adamek, A. Gavriilidou, N. Ziemert, M.M. Zdouc for contributing their datasets to our collection. The benchmarking results were extremely positive as a whole, and beautifully showcase how behaviour can change from dataset to dataset, research goal to research goal.
All in all, this paper, and BiG-SCAPE 2.0, was made possible due to the exceptional team and successful collaboration model that joins academic (PhD students, postdocs, PIs, etc) and academic-adjacent (research assistants, software engineers, technicians, etc) staff. I am convinced that these kinds of strategic collaboration choices can have an immense impact on driving scientific progress, as well as collective skill development of the research community. As such, I would like to take this opportunity to make a call to the community to consider the potential of allocating budget to such academic-adjacent positions specifically in bioinformatics research groups, and for funding agencies to more systematically support this through, e.g., infrastructure and engineering grants.
Myself, and the entire team, are excited to see these tools become available, and hope that they will be useful to the natural product research community. Finally, we very much welcome interactions and contributions from the community, to continue improving the tools further and in alignment with the needs of natural product research.
To find out more about what can be achieved with BiG-SCAPE 2.0, check out our paper BiG-SCAPE 2.0 and BiG-SLiCE 2.0: scalable, accurate and interactive sequence clustering of metabolic gene clusters | Nature Communications.