Wrangling a de novo sequencing benchmark

In any machine learning study, high quality data for training and validating the model is critical. This paper describes the result of an iterative process of data wrangling and quality control, which ultimately produced a benchmark dataset for de novo peptide sequencing from mass spectrometry data.
Like

Share this post

Choose a social network to share with, or copy the URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The behind-the-paper story here is about just how iterative this process can be. In a sense, the first iteration of the benchmark is the version described in the initial DeepNovo paper, though for all I know that version itself represents multiple prior iterations. We used that benchmark in the first paper describing our Casanovo de novo sequencing model. When we began working on the second paper about Casanovo, we were motivated to create a new version of the benchmark because we weren’t sure how the database searching and FDR control were done.


The process, as outlined in my lab notebook, ended up producing ten versions of the benchmark.

  1. I started by manually downloaded the nine reference proteome fasta files from Uniprot. To produce the first version of the benchmark, I wrote scripts to automatically download all the mass spectrometry data using ppx, convert to MGF format using ThermoRawFileParser, and search the data using the Tide search engine followed by Percolator. The peptide sequences for peptide-spectrum matches (PSMs) accepted at a 1% PSM-level false discovery rate threshold were then inserted into the MGF files, discarding any spectra that failed to be
    identified.
  2.  Initially, I used the same modifications that had been in the DeepNovo analysis. However, we realized that we should make this benchmark consistent with the set of modifications in MassIVE-KnowledgeBase, because those were the modifications we were training Casanovo with.
  3. For each of the nine species, I had initially used estimates of precursor m/z tolerance and fragment bin size generated by our tool, Param-Medic. However, it seemed more defensible to use the parameters listed in the publications describing these datasets, so I switched to using those.
  4. I discovered that in the initial download of the raw files, some of the downloads failed. So I fixed that problem and re-generated the benchmark.
  5. I had initially created a version of the benchmark with one big MGF file per species, but this was problematic because scan numbers ended up being repeated. So I switched to creating one MGF file per raw file.
  6. We found that some of the annotations in the original DeepNovo benchmark did not properly account for isotope errors (see Figure 5 in this paper), so I added handling of isotope errors to my search parameters. Overall, this change did not make a big difference in the number of accepted PSMs. For one species (Vigna mungo) the number dropped slightly; for all the others it increased by a few thousand PSMs.
  7. We realized that some peptides in the benchmark were shared between species, so I added a post-processing step to eliminate these shared peptides.
  8. A user pointed out that our benchmark did not actually contain any N-terminal modifications. It turns out that this was a known bug in the Tide search engine, which had been recently fixed. I therefore re-ran the entire search procedure to generate a new version of the benchmark.
  9. One of the reviewers of the second Casanovo paper asked us to be sure that different modified forms of the same peptide sequence all be associated with a single species. This seemed like a good idea, so I made this change.
  10. Unfortunately, Tide and Casanovo do not agree on how to represent a peptide containing modifications: Tide puts them in square brackets, whereas Casnovo leaves off the brackets but precedes the mass with a “+”. I therefore added a cleaning step to convert all the Tide peptides to Casanovo format.

As this list makes clear, the process of creating the revised nine-species benchmark and ensuring its quality has been iterative. I fully expect to need to make additional updates to the benchmark as others begin using it, so I will continue to update our GitHub repository of scripts  and the benchmark itself as needed.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Mass Spectrometry
Physical Sciences > Chemistry > Analytical Chemistry > Mass Spectrometry
Machine Learning
Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning

Related Collections

With collections, you can get published faster and increase your visibility.

Epidemiological data

This Collection presents a series of articles describing epidemiological datasets spanning diverse populations, ecosystems, and disease contexts. Data are presented without hypotheses or significant analyses, and can be derived from population surveys, health registries, electronic health records, field sampling, or other sources.

Publishing Model: Open Access

Deadline: Dec 22, 2024

Data for epigenetics research

This Collection presents data within epigenetics research including, but not limited to, data generated through techniques such as ChIP, bisulphite, nanopore and RNA sequencing, single-cell epigenetics/epigenomics, spatial genomics/epigenomics, and the role of non-coding RNAs in epigenetic modulation.

Publishing Model: Open Access

Deadline: Dec 28, 2024