Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Site analysis and localisation confidence

If a peptide has two serines and a single phosphate on one of them, there may or may not be evidence in the MS/MS spectrum to favour one site over the other. It depends on the separation of the two sites, whether there are sequence ions in the region between the potential sites, and the signal to noise for the assignable fragment ion peaks.

Mascot Delta Score

When there is site ambiguity, Mascot reports localisation confidence using the Mascot Delta Score or MD-score. Localisation confidence is based on the score difference between the matches.

The concept was quantified by Savitski, M. M., et al. (2011). "Confident Phosphorylation Site Localization Using the Mascot Delta Score." MCP 10: M110.003830. Very briefly, a collection of 180 synthetic analogs of natural phosphopeptides was analysed to quantify the accuracy of using the score difference between the top two matches. This made it possible to determine the false localisation rate for a given score difference. As might be expected, the numbers were observed to have some dependency on instrument characteristics and ionisation method.

The default setting in Mascot is slightly more conservative than the FLR data reported by Savitski et al., such that two matches with an MD-score of 10 will be reported as ‘probabilities’ of 91% and 9%. This is based on the Mascot score being -10log10(P), where P is the probability of random match. Hence, a difference of 10 in the score corresponds to a factor of 10 in the probability of the peptide sequence match.

The sensitivity can be adjusted using a global parameter setting in the options section of mascot.dat. The default corresponds to SiteAnalysisMD10Prob 0.1. Decrease this value (e.g. to 0.05) to make the numbers more conservative. If you are tempted to increase the setting (e.g. to 0.2) to make the effect for a given score difference more dramatic, we recommend testing the accuracy of the results by analysing some known standards, as in Savitski et al.

Requirements

Site analysis is performed whenever the top rank match is significant and contains one or more variable modifications for which alternative arrangements are possible. Although the concept is simple, some aspects are far from self-evident. The requirements are:

1. The top peptide match to a spectrum carries one or more variable modifications for which alternative arrangements are possible. If a search includes Phospho (ST) and Phospho (Y) as variable modifications, and the top-ranking peptide match has two serines and no threonines or tyrosines, you will only see site analysis reported if one of the serines carries a phosphate. If the peptide mass corresponds to zero or two phosphates, then no alternative arrangements are possible and there will be no site analysis.

2. The score for the top match is significant. Site analysis seems pointless unless you are confident of the sequence.

Site analysis is only ever performed for the highest scoring peptide sequence match. If there are significant matches to more than one sequence, this could be because the spectrum is chimeric, and contains fragments from co-eluting, isobaric peptides. Another possibility is that there are few or no peaks in the spectrum to distinguish between the peptides. Or, maybe one or more of the matches are false positives. In such cases, site analysis for the top match is on slightly shaky ground, never mind site analysis for anything other than the top match.

3. There is at least one further match to an alternative arrangement of modifications (which need not be significant). Mascot saves a maximum of 10 matches per spectrum, and at least one of the alternatives must be matched and scored. If the score is unknown, site analysis is not possible.

Imagine that the top match is significant, with a score of 42 and the tenth match has a score of 28. If there is no match for an alternative arrangement, all we can say is that the score for the best alternative arrangement must be between 0 and 28, which is quite a wide range and could have a marked effect on the calculation. On the other hand, if the difference in scores between the top and bottom matches is more than 30, it makes little difference to the site analysis whether the score for the best alternative match is just below that of the tenth match or zero, so this condition is likely to be modified or dropped in a future release.

Other considerations

The higher the mass accuracy, the less likely you are to see significant matches to more than one sequence. One exception is when deamidation is included in the search, because it is common to get peptide sequences in the database that differ only in N<->D and Q<->E. Beware of searching with both deamidation and a non-zero setting for #13C, which is meant to allow for the wrong peak of the isotopic distribution being used for the precursor mass. Unless you have extremely high mass accuracy, this can lead to secondary matches to sequences that are deamidated when they should just have a 1 or 2 Da error on the precursor mass or vice versa. The correct match will usually have the higher score because of better fragment ion matching, but this can still be a source of confusion.

Site analysis doesn’t attempt to distinguish between site uncertainty and site occupancy. That is, if peptide contains two phosphorylation sites and one phosphate, and we obtain matches with similar scores for both arrangements, this could be because of lack of information (no peaks to distinguish the two possibilities) or it could be because the sample is a mixture of the two forms (peaks for both possibilities are present). We recognise it would be useful to be more specific about this, and its something we’ll look at for a future release.

Finally, remember that site analysis only considers the modifications selected for the search. Imagine a search where we specify Phospho (ST) and get a match to a peptide with sequence xSxxxxxSYxxxxxxxxTx. Site analysis reports the phosphate is 99% localised on S8. But, if the search was repeated with Phospho (Y) included, the site analysis could easily change to 50% for S8 and 50% for Y9. Site analysis is only meaningful if all the sites for the target modification are included in the search. This is a particular challenge for something like methylation, which is listed in Unimod as being observed on 11 different residues.

Examples

The results are displayed in the Peptide View report. For example, using the default setting produces the following results:

Score Mr(calc) Delta Sequence Site Analysis
83.41846.71790.1889DIGSESTEDQAMEDIKPhospho S4 84.56%
75.81846.71790.1889DIGSESTEDQAMEDIKPhospho S6 14.73%
62.71846.71790.1889DIGSESTEDQAMEDIKPhospho T7 0.72%
26.91846.78080.1261KLNSNPENYCESELK 
22.81846.77290.1339KMEDSVGCLETAEEVK 
15.51846.9230-0.0161GAYTIEQHPVLGLEIK 
14.21846.77290.1339KMEDSVGCLETAEEVK 
13.91846.87540.0315YVKGIYENLPSIDEK 
13.81846.88660.0202QLIEAPDPVPSFEVAR 
13.31846.90520.0016KIDFSNIAMLFGGVQK 

A large score difference will strongly favour one arrangement.

Score Mr(calc) Delta Sequence Site Analysis
84.53541.79000.0191KRYGASAGNVGDEGGVAPNIQTAEEALDLIVDAIKDeamidated N9 99.79%
57.23541.79000.0191KRYGASAGNVGDEGGVAPNIQTAEEALDLIVDAIKDeamidated N19 0.19%
47.93541.79000.0191KRYGASAGNVGDEGGVAPNIQTAEEALDLIVDAIKDeamidated Q21 0.02%
14.33541.77350.0355INKRLNYIKRQPHQSDDEPAQIMGYKNK 
14.33541.77350.0355INKRLNYIKRQPHQSDDEPAQIMGYKNK 
13.53541.74700.0620ENEVPERKNYEDEMQVTKLPVNQNILKN 
13.03541.80130.0078RNVISQINDGQVQVTTQKLPHPVSQIGDGQIQ 
12.93541.74720.0618ALLVMSDKVYENYTNNINFYMSKNLIKK 
12.83541.8641-0.0551IRSTFKYSPINNPNLILDVKNGSGNEQRPTI 
12.63541.74720.0618ALLVMSDKVYENYTNNINFYMSKNLIKK 

When there is little to choose between two arrangements, this could indicate a lack of evidence or it could indicate a mixture of the two forms. There is nothing in the algorithm to distinguish between these possibilities.

Score Mr(calc) Delta Sequence Site Analysis
73.14178.08080.0369KIATYQERDPANLPWGSSNVDIAIDSTGVFKELDTAQKDeamidated N19 42.20%
72.54178.08080.0369KIATYQERDPANLPWGSSNVDIAIDSTGVFKELDTAQKDeamidated N12 37.01%
70.04178.08080.0369KIATYQERDPANLPWGSSNVDIAIDSTGVFKELDTAQKDeamidated Q6 20.72%
45.44178.08080.0369KIATYQERDPANLPWGSSNVDIAIDSTGVFKELDTAQKDeamidated Q37 0.07%
21.94178.04630.0713ISMADNLLSTINKSEINKGFDRNLGELLLQQQQELR 
15.34178.09870.0189TVGDYVITPDICLERKSISDLIGSLQNNRLANQCKK 
15.04178.09870.0189TVGDYVITPDICLERKSISDLIGSLQNNRLANQCKK 
15.04178.09870.0189TVGDYVITPDICLERKSISDLIGSLQNNRLANQCKK 
15.04178.09870.0189TVGDYVITPDICLERKSISDLIGSLQNNRLANQCKK 
15.04178.09870.0189TVGDYVITPDICLERKSISDLIGSLQNNRLANQCKK