Computational pipelines in biomedicine don't maintain themselves. I spoke with some scientists about the data analysis pipelines they build, maintain and troubleshoot.
For my story When computational pipelines go 'clank‘ about using pipelines, building and maintaining them, I spoke with a number of scientists.
Previously, for a different story, I spoke with some researchers on a related topic: benchmarking tools. That piece is called Benchpressing with genomics benchmarkers.
Throughout this piece, you will see some videos related to that benchmarking story. It just feels like they go well together. Here's a video about what spaghetti and code have in common with thoughts from Kasper Lage at MGH/The Broad Institute.
A computational pipeline involves a quite specific combination of software and scripts. There is a strictly defined set of input data and a precise and reproducible delivery of results, says Björn Nystedt. “In my experience, a real problem in the discussion about bioinformatics pipelines is the confusion about what one can actually expect a pipeline to deliver, and to what extent they can really aid research projects.” Nystedt directs the National Bioinformatics Infrastructure Sweden for SciLifeLab, a research initiative which started as joint effort between Karolinska Institute, KTH Royal Institute of Technology, Stockholm University and Uppsala University and now supports research activities all across Sweden in the life sciences and medicine.
Standardization matters with pipelines; there’s a need to reliably and efficiently perform a well-defined analysis over and over again, says Nystedt. Standards are not always widespread in research as they are, for example, in routine clinical applications.
In bioinformatics, a number of pipelines are standardized but not all of them. Because of this lack of standards, different pipelines can lead to different results about the same dataset. Here is a Nature News & Views and the accompanying paper in the area of neuroimaging. It's one example of this difficult situation.
Pipelines are often discussed in the context of improving research quality, but this improvement will only happen if the pipelines are used for what they are actually designed to do, says Nystedt. Even “minor violations to the very detailed assumptions of a computational pipeline can lead to sub-optimal analyses and even to wrong conclusions,” he says. That’s why developers need to find a “sweet-spot” to address a task common to many but complex and useful in specific instances, too.
Rapid change makes this task ever more challenging. Over the last decade, the life sciences have seen extremely rapid development of data types, data volumes and analysis software. Computational pipelines get outdated more quickly, and it takes resources for updates and maintenance to remain valid, says Nystedt.
Here is a little tale about what can happen when developers benchmark their own tools. It's a potential self-assessment trap, says USC researcher Serghei Mangul.
It’s not free
Given how deeply methods development can affect biomedical research, says Jinghui Zhang, who chairs computational biology at St Jude Children’s Research Hospital, she wishes funders would more readily recognize the time-consuming effort behind developing, distributing and maintaining high-quality analysis tools.
A well-designed pipeline with adequate built-in quality metrics, can “improve research quality, save an enormous amount of global post-doc time, and massively reduce computational resource needs,” says Nystedt. “Unfortunately, universities and funding bodies have so far responded only weakly to this opportunity.” They struggle to find the right processes, evaluation methods, and job positions to support this type of work and to maximize the value of research funding.
The ‘best’ pipeline, as a specific combination of software tools and scripts with defined input data and reproducible output, may well differ across species, lab protocol versions and study designs.
Several good initiatives related to pipelines have been launched with open-source code and contributions from a community of developers, says Nystedt, among them bcbio-nextgen and the nf-core framework.
Bcbio is a community portal to pipelines for automated high-throughput sequence analysis, for looking at somatic or structural variants, for single-cell RNA-seq, ChIP-seq, ATAC-seq and many other applications.
Nf-core was developed in Sweden’s National Genomic Infrastructure at SciLifeLab and holds over 40 pipelines. Its community-based curated pipelines were built with the workflow manager Nextflow. Contributors are welcome but only pipelines with high-quality documentation, a description of generalized results files and test datasets will be included.
Carole Goble of the University of Manchester and her team are finalizing a workflow registry that is in “pre-pre-beta,” she says and it'a linked to the EU COVID-19 data portal. “The killer is dependencies of course, and the use of externally hosted resources,” she says of pipelines.
There is a repository of ‘awesome pipelines’ in various stages of development, compiled by a community of Nextflow users.
There are ‘composers' for pipelines such as FlowCraft.
Pipeliner is a framework for pipelines that handle sequence data processing and is based on Nextflow.
Tools to pipelines
The online registry bio.tools includes, as of this writing, 17,265 entries from over 1,000 contributors including tools in genomics, proteomics and metabolomics, among other areas. It was compiled as part of European Infrastructure for Biological Information (ELIXIR), and by the Danish ELIXIR node in particular. There is also a section on COVID-19 related tools.
Pipeline builders and users benefit from a wealth of software tools. “The fact that there are plenty fish in the sea, doesn’t mean you are going to eat today,” says David Ochoa, platform coordinator for Open Targets at the European Bioinformatics Institute. Open Targets is a public-private project of the EBI and Wellcome Sanger Institute with pharmaceutical companies such as Bristol Myers Squibb, GSK, Sanofi and Takeda. These days, pipeline development is thriving. “However, developing, maintaining and documenting long-standing tools is still the exception,” he says. “Turning a prototype pipeline into a robust reliable infrastructure requires a usually underestimated amount of time and resources.
When labs build a pipeline, they should ask themselves some serious questions such as: why they are doing so, says UCLA cancer researcher Paul Boutros. “What prevents them from using the work of others and leveraging off of it?” he says. “Might they collaborate with a larger team to optimize or leverage their pipeline in a way that makes everybody better off? Many labs want to create their infrastructure, leading to a lot of poorly maintained pipeline code,” he says.
With data pipelines, as well as with many other products, the key question is who is the consumer and how are they intending to use it, says Ochoa. “Developers regularly obviate important aspects of development such as error-handling, scalability or documentation, mostly because they develop for their own benefit.” Distributing the code and building a community of developers and users around a pipeline is frequently underestimated.
Bringing the right technologies focused on the algorithm implementation side can help smaller labs build pipelines, says Miguel Carmona, a backend software developer at Open Targets. They can consider Apache Spark for large-scale data-processing. “It enforces a declarative way of specifying the implementation of simple and more complex algorithms,” he says. It’s a way to define complex interrelated or iterative algorithms that process a variable amount—be it big data or small data—without having to focus on the specific way to scale the infrastructure to accommodate the resource requirements or having to deal with IO transformation makes any pipeline more resilient.
The availability of skills required to build and run the pipelines is another factor to consider, says the EBI’s proteomics team leader Juan Antonio Vizcaino. “This is often not available for smaller research groups,” he says. For instance, bioinformaticians and scientific programmers need to know about domain-specific issues and many specific aspects about pipeline construction and deployment, such as how to use software containerization or workflow management systems,
With commercial software, groups need to buy the right licensing for the software to run in high performance computing infrastructures. Additionally, if Windows is used as the operating systems, licensing costs need to be considered for it as well, he says.
In the platform Galaxy, users can find a ‘tool shed’ with thousands of tools. They can fire up a virtual machine and build a pipeline from tools of their choice. Users will want to thoroughly check out the tools they might choose, says Irene Papatheodorou, EMBL-EBI team leader on gene expression.
“We definitely do not serve old tools but we do keep them for reproducibility,” says Penn State University researcher Anton Nekrutenko and one of Galaxy’s co-founders. That means users might find multiple versions of tools such as TopHat, TopHat2, HISAT and HISAT2 TopHat is a splice junction mapper for RNA-Seq reads. HISAT is another such RNA-seq aligner.
Version control is the way to resolve such issues, says St. Jude’s Zhang. “However, when the new software is fundamentally different from the old software, a name change can lead to loss of the connection,” she says. Including a set of benchmark data can, for example, help and developers can update the performance matrix. “Ideally this should be a semi-automated dynamic process,” she says. In reality, it usually requires someone to write a review paper that compares with the performance of multiple methods to provide guidance for the user.
TopHat2 is quite accurate, says Steven Salzberg, a Johns Hopkins University computational biologist and TopHat co-developer, “so it's fine if people use that instead of HISAT2” developed by his former trainee Daehwan Kim. In their comparison study in Nature Methods, TopHat2 was 56 times slower than HISAT, which means that users will “pay a cost in CPU usage,” he says. That translates to higher costs when building a pipeline in the cloud.
One way to avoid issues with software versions and pipeline-building is to have active and constant contributions from a sizeable community as is the case with Galaxy, says the EBI’s Vizcaino. “It is not realistic to expect that the Galaxy developers can deal with all issues related to every different piece of software,” he says. What is needed is a connection between developers of scientific software and the Galaxy developers.
In computational proteomics, although things are evolving slowly towards Linux OS, there are still some very popular tools that can run only in Windows, which is also a limitation for their deployment in high performance compute infrastructures, says Vizcaino.
Here's another video mini-tale about benchmarking, again from Kasper Lage. Here, it's about about expectations, about baseball and benchmarking
As a way to address issues with software dependencies, which might be additional bits of software tools needed in a pipeline, it can be helpful to virtualize a pipeline. Using containerization, pipeline developers can configure an environment with all the required packages and system configuration and effectively ‘ship’ it to users around the world, says Clay McLeod, who directs who directs engineering and bioinformatics at St Jude. End-users can then run that Docker image anywhere, which also allows them to run the tool in an environment which is identical to one the pipeline developer used. “We feel this is the future for solving the dependency problem,” he says.
Virtualization can be challenging for less experienced bioinformaticians, says Grabherr. Pipeline builders might find it hard to test on all platforms and configurations, “so what works on the developer’s compute cluster is not guaranteed to work anywhere else,” he says.
Team can consider trying out new technologies such as ClickHouse, says Carmona, which “could potentially change the way we see some typical algorithms currently implemented.” ClickHouse is a database management approach that generates reports in real-time. For instance at Open Targets they want to prioritize targets in real-time, based on user input at scale to help link experienced scientists and the power of systematic computation at scale.
Also, says Carmona, there are new data formats that are better designed to handle the complexity and nested nature of biological data. The format Apache Parquet for data sets is, he says, prepared well for querying large and complex datasets by partitioning the data based on metadata. Hence, he says, knowing if a number is in the range of that partition but also where it is positioned, allows fast data processing with the scale required today.
Some labs, consortia or companies want to scale up their pipeline. Here is a podcast of a conversation I had on the ins and outs of pipeline maintenance and what it takes to scaling up a pipeline. They talked about workflow managers and some general trends they see related to pipeline building maintenance. The conversation is with John Ellithorpe, DNAnexus executive vice president and chief product officer and George Asimenos, chief technology officer at DNAnexus.
Computational biologist C. Titus Brown of the University of California, Davis and his team have been using the workflow manager Snakemake to develop a computational pipeline for decontaminating genome assemblies in metagenomics. He sees a new breed of biologist/bioinformatician coming along. “These workflow-enabled biologists will become increasingly valuable as data set size and complexity increases, along with the associated tool chain,” he says. Few labs are training them, he says, but he is.
There are many workflow managers, one of around 100 workflow manager tools for linking workflow steps and connecting software tools such that output of one is the next’s input. @AlbertVilella, a bioinformatician and consultant, did a poll a while back and posted a Google Sheet with results of preferences and links to the many tools. “It’s relatively up to date,” says Vilella. The preferred workflow managers remain the same, he says: NextFlow, Snakemake, ones from the Broad Institute and others.
Workflow managers increasingly matter, especially for production-grade pipeline development and operation, says St Jude's McLeod. Some are more or less sophisticated, and pipeline developers lie all along that spectrum, based on their role and environment. Snakemake and Nextflow seem to be popular in medium or larger individual labs where multiple individuals work on projects, they are relatively easy to pick up and have good integration with typical local computing setups. For larger operations, such as at St. Jude, open workflow standards such as CWL and WDL appear to be the leading workflow languages he sees in community projects that adopt them and support from cloud providers. “There are some exceptions to these examples, but this is the trend from my perspective,” he says.
To craft their pipeline, the Brown Lab uses Snakemake. He sees the main workflow managers in bioinformatics as these.
“Bioinformatics people in Sweden like Snakemake and Nextflow,” says Manfred Grabherr, a computational biologist at Uppsala University and chief technology officer at Methority, a company that develops algorithms and applications in machine learning and artificial intelligence. He prefers Grapevine and its more straightforward syntax, but it hasn't yet attracted a wider community. It was developed at National Bioinformatics Infrastructure Sweden where Grabherr had an appointment.
It can be used to set up and run bioinformatics pipelines with a more loosely defined syntax, it is natural language based. There’s a script, a table that describes the data, and some grammars, he says.
For example, says Grabherr, instead of needing to type
bowtie2 -p 16 -x mm10/mm10 -q -1 20170828BK_MC/HA_H33TAG_ctrl_R1.dedup.fastq -2 20170828BK_MC/HA_H33TAG_ctrl_R2.dedup.fastq --fast -S 20170828BK_MC/HA_H33TAG_ctrl.sam
samtools view -bS -F 4 20170828BK_MC/HA_H33TAG_ctrl.sam > 20170828BK_MC/HA_H33TAG_ctrl.bam
A user might want something closer to
map my reads to the mm10 reference genome in the mouse/ folder and save the alignments as a bam file, you can find the fastq files in a table under row fastq_file
In Grapevine it reads like this:
map file @table.fastq_file to reference @ref > @bam"
The request: ‘map reads to a reference’ is abstracted away from the user. And if a user wants to swap the aligner tool Bowtie with a different aligner, the script does not need to change.
The variables @ref and the data table are what the user offers and the command applies to different data sets.
Evaluating, validating pipelines
Brown and his team are mulling over how to evaluate their new pipeline-in-the-making. “For me, evaluating workflows is really hard, especially as you tune parameters and update workflows over time,” he says. It takes manual effort and it can get tricky to first choose a liberal set of parameters for an initial highly sensitive set of predictions, and then choose a conservative set of parameters to get highly specific predictions. After curating those, one could start to dig into ‘contigs,’ or contiguous genomic regions, that remain unclassified after this pipeline runs to then explore the uncertainty between those parameter sets.
The pipeline is built on sourmash, a platform Brown and colleagues previously developed, “but how those parameters play out when you add higher level logic is complicated.” There are known knowns and known unknowns to evaluate, even though pipeline builders usually have hunches about what to look for. And then, there are the “unknown unknowns.”
With pipelines, documentation is crucial for further development and to help users. “It’s funny, I think I spent about as much mental energy on documentation and user experience over the last three weeks as I did on core programming,” says Brown.
Approaches to validate pipelines can take many forms and will vary with the pipeline. For clearly defined tasks, says Ochoa, a pipeline developer can probably ensure the output is the desired outcome by using continuous integration or continuous development. “As the pipeline becomes more analytic, very often you would require to play with the parameters and explore what the different potential outcomes are,” he says. And one will often need an external benchmark to fully understand the implications of the implemented algorithm.
Snakemake helps with small projects, says the EBI’s Carmona, but when the number of files you need to process gets orders of magnitude bigger than the average usage, DAG resolution takes forever, he says. DAG, or directed acyclic graph, is a way to set up the ‘jobs’ to be run in a pipeline.
Just as with validating software, pipelines need to be compared to the current state-of-the art, says Vizcaino, which is usually done by performing benchmarking studies. The data needed for this benchmarking can be generated for new use cases, but very often, datasets in the public domain are often used for this. To assess that the pipeline can run in a given infrastructure, the most ideal way to do it once there is a first version running, is to have a continuous integration and continuous deployment system, so that the software is tested automatically in regular timeframes or when there is some change in the code.
Workflow managers help with sharing workflows. “It’s a positive sign that people try to ensure the reproducibility of their pipelines,” says Ochoa. “However, it’s important not to forget that while these tools are very good for prototyping, there is likely to be a more optimal way to engineer and scale pipelines,” he says. “There is a time and place for every technology.”
(Credit: Brand X)