Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by John Cottrell (July 15, 2015)

Search only for peptides we care about?

The July issue of Nature Methods contains a commentary from William Noble with the slightly provocative title "Mass spectrometrists should search only for peptides they care about". At first sight, this seems to contradict our general advice to include contaminant sequences when searching a single organism database or when using a taxonomy filter.

One of Noble’s examples is a study related to malaria, where the sample contained both human and plasmodium proteins. He observes that, if only the plasmodium proteins are of interest, you can get better sensitivity at a given FDR by searching a plasmodium-only database, because the search space is one third the size of a combined plasmodium and human database.

Whether this is a good idea is partly a matter of where the experiment stands in the range from discovery to targeted. Indeed, Noble points out that "… focusing an experiment on a small, targeted collection of peptides necessarily eliminates the possibility of serendipitous, unexpected discoveries." But, if we assume that the sample is already well characterised and that no aspect of the human proteins is of any interest, is it best to search a plasmodium-only database?

Noble mentions the issue of shared peptides. This is not a serious problem for his example because human and plasmodium are distant proteomes, with few peptides in common. Shared peptides leading to incorrect protein inference would be a more serious concern if the two proteomes were more closely related. Obvious examples are reporting a BSA contaminant as HSA in a study of human proteins or reporting human keratin as mouse keratin in a study of mouse proteins. Noble suggests eliminating shared peptides from the database, but this is not a trivial procedure, and doesn’t seem like a practical proposition for most researchers.

Alternatively, Noble proposes "… first searching the spectra against a database containing only the irrelevant peptides and eliminating from the data set all spectra that match this ‘garbage’ database with high confidence. The remaining spectra could then be searched against the database of interesting peptides. For stringent FDR thresholds, I believe that this approach is likely to yield slightly better power than the simple strategy proposed here, at the expense of being somewhat more complicated to implement." In fact, this is not at all complicated to do in Mascot. For a batch of files, you can use a Mascot Daemon follow-up search. For interactive searching, just choose Re-search with Non-significant selected.

A more serious issue, which Noble doesn’t discuss, is the composition of the dataset. If the spectra are 90% plasmodium and 10% human, searching a plasmodium-only database will almost certainly give better sensitivity. But, if the spectra are 10% plasmodium and 90% human, the sensitivity may appear to get worse. For example, here are the counts of significant PSMs at 1% FDR for searching an Orbitrap dataset that is approx 90% vaccinia and 10% bovine against SwissProt (Vaccinia virion data courtesy Paul Gershon, Department of Molecular Biology & Biochemistry, UC-Irvine):

taxonomy database entries target PSMs decoy PSMs FDR
mammalia plus viruses 82,973 11,525 115 1.00%
viruses 16,602 14,266 142 1.00%
mammalia 66,371 453 4 0.88%

If we are interested in the vaccinia proteins, there is clearly a substantial gain in sensitivity by searching with a taxonomy of viruses. If we are interested in the bovine proteins, it isn’t immediately obvious whether the search with a taxonomy of mammalia has done better or worse than the wider search. In fact, the sensitivity is substantially worse for the narrower search, as can be seen from the counts of significant target PSMs for individual proteins:

accession mammalia plus viruses mammalia-only
HSP7C_BOVIN 107 74
RS27A_BOVIN 49 23
TBB5_BOVIN 17 13
RAB7A_BOVIN 14 9
ALBU_BOVIN 9 6
EF1A1_BOVIN 5 3

In general terms, if you can narrow the search space and retain most of the true matches, the sensitivity at a given FDR is likely to improve. This could be by narrowing the database, as illustrated by Noble’s examples, or it could equally well be by narrowing the mass tolerance or dropping variable modifications. On the other hand, if you narrow the search space and discard most of the true matches, the sensitivity at a given FDR is likely to get worse, as shown by the vaccinia / bovine numbers.

Removing the vaccinia target sequences removes all of the true, vaccinia PSMs from the target count. The score distribution for the decoy PSMs is shifted lower, because the database is smaller, but this is a marginal effect. In order to achieve 1% FDR for the small count of true bovine PSMs, the threshold has to be raised very high, to reject a larger number of decoy matches, so that the apparent sensitivity becomes worse.

The Mascot scores of the true bovine PSMs are the same in the mammalia-only search as in the mammalia plus viruses search, and the expect values (or p values or PEPs) for individual target PSMs are slightly better, because the search space is smaller. In this sense, the matches are no less reliable. The apparent loss of sensitivity is a consequence of FDR being a very simple yardstick. Unfortunately, it is the generally accepted measure for quality control, and you might have a hard job getting results with a very high FDR published.

Keywords: ,

2 comments on “Search only for peptides we care about?

  1. susan on said:

    after reading this post, I want to say that, we can only focus on one thing when we are doing a common thing. however, there is no connection with experiment. we have to consider many factors when we are doing experiment, or we will miss many factors. many inventions came from incident. because scientists consider many aspects of their work.

  2. Steven Harrison on said:

    Is the reason for specializing a search for peptides because of a hypothetical time restraint? Is it that a person may need to search for specific peptides, and being out of time or resources, they’d have to prioritize? I assume that with enough time and resources, you could do both a generalized search and a specialized one. But I don’t know what it takes to do one of these tests.