High FDRs for methylated peptides
"Large Scale Mass Spectrometry-based Identifications of Enzyme-mediated Protein Methylation Are Subject to High False Discovery Rates" in the March edition of Molecular & Cellular Proteomics represents a very substantial and systematic piece of work from the University of New South Wales. Three types of sample preparation (coomassie gel, unstained gel, HILIC) were combined with three different ionisation methods (CID, ETD, HCD) to give nine data sets. The headline finding is that FDRs for methylated peptides are terrible – mostly between 70% and 90% – even though the global FDR estimated by target/decoy is approximately 1%.
Scary stuff! Why are the numbers so high and does this indicate some fundamental deficiency in target/decoy?
There are several aspects to be considered – too many to deal with in a single article. Here, we will discuss whether we can expect a subgroup of matches to share the global FDR. Future articles will explore the limitations of site analysis by database search and issues associated with target/decoy statistics for combinations of search results.
The UNSW study focuses on methylated peptides: whether the modification is post-translational and whether it has been localised to the correct residue. Multiple searches were performed using discrete sets of variable modifications and the results consolidated. All searches included Carbamidomethyl (C) and Oxidation (M) as varmods, combined with one of the following sets:
- Methyl (K), Dimethyl (K), Trimethyl (K)
- Methyl (R), Dimethyl (R)
- Methyl (DE)
- Ethyl (DE)
- Propyl (DE)
- Propionamide (C)
The question is, if the global PSM FDR is set to 1% using target/decoy, can we assume that the FDR for methylated peptides is also 1%? One way to gain an insight is to compare the matches from the target search with those from the decoy search.
The ‘raw’ search results were not included in the files available from PRIDE PXD002857, so we took one of the sets of raw files (nostainbands_orbi_1.raw through 28) and searched the merged peak lists using the parameters given in the method section of the paper plus the first set of variable modifications. The results will not be identical to those in the paper (different peak picking, later release of SwissProt, Carbamidomethyl (C) searched as fixed) but they will be very similar. Matches were filtered for 1% PSM FDR by target/decoy using the Mascot expect value as the cut-off. Again, this is not identical to the procedure in the publication, which used a q-value cut-off in Proteome Discoverer.
This table lists some characteristics of the significant matches from the target and decoy searches. The 615 decoy matches are assumed to be false while the 61,531 target matches are believed to be 99% true.
target | decoy | |
---|---|---|
# matches | 61531 | 615 |
score > 50 | 50% | 1% |
score < 30 | 12% | 51% |
average length | 15 | 9 |
length < 10 | 15% | 67% |
0 varmods | 97% | 59% |
1 varmod | 3% | 31% |
2 or more varmods | 0% | 10% |
L at N-term | 11% | 14% |
Delta < 1ppm | 74% | 65% |
Hopefully, it is generally understood that when we say 1% FDR, this applies to the set of search results, not individual matches. A low scoring match is more likely to be false than a high scoring match. If we were to take a subgroup of the results with scores below 30, the FDR would be higher than 1%. If we took a subgroup with scores above 50, the FDR would be lower than 1%. The figures in the table quantify this effect for this particular search. For example, 99% of the false matches have Mascot scores below 50.
If we take a subgroup based on some other property that correlates with score, we will see a similar effect. The table shows that short peptides are over-represented in the false matches. The average length for correct matches is 15 residues, while for false matches it is 9, and 67% of false matches are shorter than 10 residues compared with 15% of true matches. This is because score depends on length: the longer the peptide, the more fragment peaks available to be matched. So, the FDR for the subgroup of 9-mers would be substantially higher than the global FDR.
The numbers for variable modifications are the important ones. Only 3% of true matches contain any variable modifications compared with 41% of false matches. This is not because the number of variable modifications in a peptide correlates with score, it is because these are chance matches and most of the candidates are modified peptides. (The abundances of K and R in SwissProt are similar, so we can expect half of the tryptic limit peptides to have K at the C-term. There are 3 varmods with specificity K, so for each unmodified peptide containing a single K there are 3 modified peptides. The enzyme specificity in these searches allows for 2 missed cleavages, so for every limit peptide we also have peptides with 1 or 2 internal K or R. For a peptide with 2 K there are 15 modified possibilities for each unmodified peptide, and 63 for a peptide with 3 K.)
This effect isn’t specific to methylation, it applies to all variable modifications to a greater or lesser extent. So, the answer to the earlier question is no – we cannot assume that the FDR for modified peptides will be the same as the global FDR.
As a sanity check, the final two rows of the table are a couple of characteristics that occur with similar frequency in target and decoy. Its hard to imagine how having leucine at the amino terminus would have any effect on the score, so this would be a safe (though useless) basis for slicing up the results. The numbers for mass accuracy are slightly surprising. One might have expected false matches to have a more uniform distribution of mass errors across the +/- 5ppm tolerance window than true matches, but this seems not to be the case for this particular search.
Can anything be done about the FDR for modified peptides? Yes, if this is the most important aspect of the experiment, and we are willing to sacrifice general sensitivity, we can just tighten the significance threshold. For this particular search, looking at all varmods including oxidation, we get to 1% FDR for peptides with varmods (1247 target, 12 decoy) at a significance threshold of 0.0008, and a global FDR of 0.12% (46050 target, 56 decoy).
In the MCP paper, the authors reported that acceptable FDRs for methylated peptides could not be achieved by simply raising the score threshold, but this was using a more rigorous definition of what constitutes a false positive. Discriminating between matches to the correct sequence with different patterns of modifications is a much tougher problem than filtering out matches to unrelated sequences, and will be the subject of the next article.
Keywords: FDR, methylation, modification
Two important factors determine the relationship between the subgroup FDR of modified peptides and the global FDR of all identified peptides: One is the abundance of modified peptides in the protein sample (or in the MS data), and the other is the abundance of modified peptides in the database ( or in the search space). If and only if these two abundances (or probabilities in other words) are equal to each other, the subgroup FDR equals the global FDR at the same score threshold. However, this hardly happens in reality, because these two abundances are not related at all — one is in the real world and the other in the virtual word. So a proper way to control the FDR of modified peptides is to separate them from the whole result set and estimate their FDR separately using the target-decoy strategy. I have carried out some formal analyses on this problem and one may be interested in them (Molecular & Cellular Proteomics,13(5):1359-1368; Statistics and Its Interface,5:47–59). I think the conclusions apply to other identification objectives beyond modifications, e.g. novel peptides in proteogenomics (Bioinformatics 31(20):3249-3253).