Behind the Paper

Wrangling a de novo sequencing benchmark

In any machine learning study, high quality data for training and validating the model is critical. This paper describes the result of an iterative process of data wrangling and quality control, which ultimately produced a benchmark dataset for de novo peptide sequencing from mass spectrometry data.

Published in Chemistry and Computational Sciences

Nov 12, 2024

William Noble

Professor, University of Washington

Liked by India Ambler and 1 other

Explore the Research

The behind-the-paper story here is about just how iterative this process can be. In a sense, the first iteration of the benchmark is the version described in the initial DeepNovo paper, though for all I know that version itself represents multiple prior iterations. We used that benchmark in the first paper describing our Casanovo de novo sequencing model. When we began working on the second paper about Casanovo, we were motivated to create a new version of the benchmark because we weren’t sure how the database searching and FDR control were done.

The process, as outlined in my lab notebook, ended up producing ten versions of the benchmark.

I started by manually downloaded the nine reference proteome fasta files from Uniprot. To produce the first version of the benchmark, I wrote scripts to automatically download all the mass spectrometry data using ppx, convert to MGF format using ThermoRawFileParser, and search the data using the Tide search engine followed by Percolator. The peptide sequences for peptide-spectrum matches (PSMs) accepted at a 1% PSM-level false discovery rate threshold were then inserted into the MGF files, discarding any spectra that failed to be
identified.
Initially, I used the same modifications that had been in the DeepNovo analysis. However, we realized that we should make this benchmark consistent with the set of modifications in MassIVE-KnowledgeBase, because those were the modifications we were training Casanovo with.
For each of the nine species, I had initially used estimates of precursor m/z tolerance and fragment bin size generated by our tool, Param-Medic. However, it seemed more defensible to use the parameters listed in the publications describing these datasets, so I switched to using those.
I discovered that in the initial download of the raw files, some of the downloads failed. So I fixed that problem and re-generated the benchmark.
I had initially created a version of the benchmark with one big MGF file per species, but this was problematic because scan numbers ended up being repeated. So I switched to creating one MGF file per raw file.
We found that some of the annotations in the original DeepNovo benchmark did not properly account for isotope errors (see Figure 5 in this paper), so I added handling of isotope errors to my search parameters. Overall, this change did not make a big difference in the number of accepted PSMs. For one species (Vigna mungo) the number dropped slightly; for all the others it increased by a few thousand PSMs.
We realized that some peptides in the benchmark were shared between species, so I added a post-processing step to eliminate these shared peptides.
A user pointed out that our benchmark did not actually contain any N-terminal modifications. It turns out that this was a known bug in the Tide search engine, which had been recently fixed. I therefore re-ran the entire search procedure to generate a new version of the benchmark.
One of the reviewers of the second Casanovo paper asked us to be sure that different modified forms of the same peptide sequence all be associated with a single species. This seemed like a good idea, so I made this change.
Unfortunately, Tide and Casanovo do not agree on how to represent a peptide containing modifications: Tide puts them in square brackets, whereas Casnovo leaves off the brackets but precedes the mass with a “+”. I therefore added a cleaning step to convert all the Tide peptides to Casanovo format.

As this list makes clear, the process of creating the revised nine-species benchmark and ensuring its quality has been iterative. I fully expect to need to make additional updates to the benchmark as others begin using it, so I will continue to update our GitHub repository of scripts and the benchmark itself as needed.

William Noble (He/Him)

Professor, University of Washington

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Mass Spectrometry

Physical Sciences > Chemistry > Analytical Chemistry > Mass Spectrometry

Machine Learning

Mathematics and Computing > Computer Science > Artificial Intelligence > Machine Learning

Scientific Data

Scientific Data

A peer-reviewed, open-access journal for descriptions of datasets, and research that advances the sharing and reuse of scientific data.

More about the journal

Related Collections

With Collections, you can get published faster and increase your visibility.

Invertebrate omics

This Scientific Data Collection welcomes Data Descriptors documenting the curation, validation, and open sharing of genomic, transcriptomic, and proteomic datasets for invertebrate species.

Publishing Model: Open Access

Deadline: Feb 08, 2026

Explore this Collection

Computed Tomography (CT) Datasets

This Scientific Data Collection highlights a series of articles that describe CT imaging datasets.

Publishing Model: Open Access

Deadline: Feb 21, 2026

Explore this Collection

Latest Content

Behind the Paper

Uncovering a Hidden Mucosal Fungus Ally in the Fight Against Intestinal Inflammation

Behind the Paper

The natural product tubeimoside I impedes cervical cancer metastasis by stabilizing HDAC5 and suppressing the H3K27ac/KPNA2 axis

Behind the Paper

Brains in the Branches: how chickadees outsmart our harsh winters

Behind the Paper

Mapping the Hidden Geography of High-Risk Fertility in India: What District-Level Data Reveals

Behind the Paper

On Colorectal Cancer: Can Blocking DNA Repair Improve Outcomes After HIPEC? Insights From Organoid Models

Cookies

We use cookies to ensure the functionality of our website, to personalize content and advertising, to provide social media features, and to analyze our traffic. If you allow us to do so, we also inform our social media, advertising and analysis partners about your use of our website. You can decide for yourself which categories you want to deny or allow. Please note that based on your settings not all functionalities of the site are available.

Further information can be found in our privacy policy.

Wrangling a de novo sequencing benchmark

Share this post

Share with...

...or copy the link