In Fall 2020, I worked on a project to quantify protein translocalization in macrophages responding to lipopolysaccharides using liquid-chromatography mass-spectrometry (LC/MS). To acquire the proteomic data, I intended to use a method called data-independent-acquisition (DIA) to quantify thousands of proteins from each sample; the problem was I had dozens of samples, and this method would only analyze one at a time, at a rate of about 2 hours per sample. This meant I would occupy, for days, an instrument that was already in high-demand by other lab members, which would delay their work and my own due to slow data acquisition. My PhD advisor, Nikolai Slavov suggested we multiplex the samples by chemically labeling them with isotopologous (non-isobaric) mass tags, so we could run them in parallel to increase throughput—a method we would eventually develop into plexDIA. While increasing sample-throughput was the main intended benefit of plexDIA, we would come to discover a less-intended benefit: increased data completeness.
Multiplicatively increased throughput
Our goal with plexDIA was to develop an experimental and computational framework that would maintain the advantages of label free (LF) DIA while scaling the throughput multiplicatively. We collaborated with Vadim Demichev, the author of DIA-NN—a widely used DIA data analysis software—and began exploring this possibility with some standards composed of bacteria, yeast, and human proteomes spiked-in at different ratios to benchmark the quantitative accuracy and proteomic coverage of plexDIA and LF-DIA as shown in Fig. 1a. After months of experimental and analytical fine-tuning, plexDIA only achieved about 84% of the proteomic depth as LF-DIA—it was nearly multiplicative scaling, but not quite; naturally, I was disappointed.
However, I also investigated data completeness—the proportion of proteins that are quantified in common across samples. As it turns out, data completeness is an unexpectedly major advantage of plexDIA. plexDIA quantified approximately 6,300 proteins across each of the three samples while LF-DIA quantified only 5,850 on average; this is shown as Venn-diagrams of proteins quantified across samples for triplicates in Fig 1b.
Fig. 1: Benchmarking plexDIA proteomic coverage
How is data completeness increased with plexDIA?
Samples run in parallel will have peptides that co-elute; because the mass offset of each sample’s peptides is known, confident peptide sequence identifications from one sample can be extended to other co-eluting samples, and in doing so, improve data completeness relative to LF-DIA. Multiplexed methods generally have this benefit, and we suspected the same may be true for plexDIA. However, given that LF-DIA is well known for achieving high data completeness, it was unclear how much plexDIA could improve upon it. Consistent with its renown, the LF-DIA data completeness between replicates of the same samples was high; yet, data completeness was surprisingly low between pairs of samples that had dissimilar protein compositions. In other words, the more different a pair of samples’ proteomes were, the more missing data between the pair of samples for LF-DIA. This is problematic if one wishes to quantify differentially abundant proteins… plexDIA, however, maintained low data missingness across dissimilar samples, achieving as low as 2% missingness within a set (i.e. run) as shown in Fig 2a.
Fig. 2: High data completeness with plexDIA
An example of what data completeness can enable
A major benefit of increased data-completeness is that it enables quantifying proteins with very different abundances across samples. Indeed, data completeness was an advantage with plexDIA for all input sizes—from samples composed of thousands of cells, down to single cells. One compelling example was the quantification of KRT7 in a single monocyte (U-937), as shown in Fig. 2b; in this single U-937 cell, there was insufficient evidence to assign a confident sequence identity for a KRT7 peptide in this cell due to its low abundance. However, because a cell-type (PDAC) with ~30-fold greater abundance of KRT7 was run in parallel, and the exact retention time and m/z coordinates are known, it was possible to quantify this peptide in the U-937 cell—something not easily possible with LF-DIA.
A more complete dataset
Not only was data completeness improved for samples run in parallel, it was improved for the entire dataset with plexDIA. We found an interesting pattern when data completeness was plotted as a heatmap between samples and replicates, as shown in Fig 2c: different samples analyzed across plexDIA sets have the same data completeness as replicates of LF-DIA samples. Perhaps this is because LF-DIA and plexDIA are now subject to the same variability: liquid chromatography. Therefore, not only does plexDIA improve data completeness for samples run in parallel, it improves it across sets as well by buffering variability of proteomic compositions. This is particularly useful for projects analyzing many samples that are proteomically dissimilar, as it will reduce missingness across the entire dataset. For what began as a project to increase proteomic throughput, data completeness was an unexpectedly major advantage.
I would like to thank everyone at the Slavov Lab, especially Prof. Slavov for his mentorship and guidance throughout the project, Andrew Leduc for preparing single cells for proteomic analysis using his nPOP method, and Prof. Demichev for creating and fine-tuning the plexDIA module in DIA-NN!