High mass accuracy: precursors
High performance instruments can deliver low ppm accuracy for both precursors and fragments. How does this impact database search? In this post, the first of two, we’ll look at the implications of very accurate precursor m/z.
High mass accuracy and high resolution have transformed mass spectrometry over the last few years. Apart from increasing file sizes, and maybe the time taken for peak picking, the consequences have been entirely positive. Yet, some care is needed when it comes to database searching, because high accuracy can mean that the database contains very few candidate sequences for any given spectrum.
This problem is most acute with a single organism database or taxonomy filter, and tryptic specificity. To illustrate, here are some counts of candidates. The database is SwissProt 2012_11 and the counts are averages for the four spectra in this search.
taxonomy | # sequences | # candidates 1 ppm | # candidates 10 ppm | # candidates 100 ppm |
arabidopsis | 11674 | 7 | 66 | 373 |
green plants | 33299 | 150 | 286 | 989 |
all entries | 538585 | 487 | 2545 | 13971 |
If we knew all the correct matches were in the database, there wouldn’t be a problem. We could reasonably assume that the best match was the correct match. Unfortunately, with real samples, this is never the case. The protein sequences in SwissProt are consensus sequences, and there will be variants in the sample that are not represented in the database. More importantly, many of the analyte peptides will be modified in ways that are not considered in the search, so that a correct match is not possible, and we must compare the score for the match with some probability-based threshold to decide whether it is significant or not.
Seven candidates is obviously not much of a choice, and score thresholds, which are based on the number of candidates, will be low. In theory, the numbers still work. But, with the best will in the world, testing a spectrum against 7 peptides is never going to be as reliable as testing against several thousand.
Provided your data set is of reasonable size, the best insurance against the unexpected is a target decoy search. The critical requirement is to have sufficient significant matches in the target to make a reasonably precise estimate of the FDR. If you are aiming at 1% FDR, you should have at least 1000 significant matches in the target (and 10 in the decoy). Then, if you are lucky enough to have data with low ppm accuracy, you can take a purely empirical approach. The correct search conditions are those that give the best sensitivity at the target FDR. Who cares whether this is 2 ppm or 20 ppm?
If your data set is small, you are dependent on the accuracy of the scoring algorithm. Safest to ensure you have a reasonable number of candidate sequences for each spectrum, even if this means opening out the tolerance by a factor of 10 or searching with a wider taxonomy than strictly necessary.
Keywords: FDR, scoring, statistics