Common myths about protein scores
Mascot Server is used in many different application areas by both mass spectrometry experts and non-experts. Over the years, we’ve spotted a few recurring misconceptions about how protein scores are interpreted and used. All the examples come from peer-reviewed papers.
Protein scores in PMF searches
The very first thing to check is, what type of experiment is being reported. If it’s peptide mass fingerprinting (PMF), Mascot calculates a statistical score for each identified protein. The score reflects the probability that the match between the observed molecular masses and the digested database entry is a random event. Mascot also reports an expect value, which is a p-value corrected for multiple testing. A protein hit is statistically significant if its expect value is above the significance threshold (by default 0.05). The example PMF search illustrates these points.
A paper might say the authors accepted “all proteins with score > 51″ or “protein scores greater than 67 (p<0.05)”. You could even see the phrase “protein score at significance level (p<0.05)”. These are all valid ways of accepting protein hits in a PMF search. Another, perhaps simpler way is to look at the expect value. If the protein hit’s expect value (say 2.2e-14) is below the significance level (say 0.05), the hit is statistically significant.
Protein scores in MS/MS searches
If the experiment uses MS/MS data to identify peptides, the situation is more complicated and there is room for misunderstanding.
When Mascot compares an MS/MS peak list to the in silico fragmented peptide, it gives the match a statistical score. The score reflects the probability that the match between the observed and calculated fragment masses is a random event. The match is given an expect value, which is a function of the score and the threshold. If the expect value is below the significance level, the match is statistically significant. Have a look at the example MS/MS search for a sample of peptide scores and expect values.
Mascot also reports a protein score in an MS/MS search. However, this is not a statistical score. It is simply the sum of score over threshold of the peptide matches assigned to the protein hit, plus the average threshold of these matches.
(The obsolete summary reports allow switching the protein score type to an older one, called Standard, which was the sum of scores of non-duplicate peptide matches, minus a small correction.)
Protein scores in MS/MS searches are only used for ranking protein hits. The goal is to put proteins with lots of strong peptide evidence at the top of the list.
Unfortunately, you sometimes see MS/MS papers with phrases like “all proteins identified with a Mascot score higher than 60 [...] were considered reliable” or “peptides and proteins with a Mascot score higher than 35 and 50, respectively, were automatically accepted.” Occasionally, you even see a mention of protein expect values, which Mascot does not calculate in MS/MS searches.
As you can see from the definition of the protein score, thresholding by score has no clear meaning. A protein score of 60 could mean the protein has one significant peptide match with score 60, or the protein could have 47 peptide matches each with score 14 and threshold 13.
Unintentionally accepting one-hit wonders
This brings us to another error sometimes seen in the scientific literature. Here’s a sample of methods from papers using LC-MS/MS:
“Proteins were accepted if they had at least one ‘rank 1′ peptide with a peptide ion score of more than 50.0 (p < 0.05).”
“The presence of at least one peptide with a significant ion score was required for positive protein identification.”
“Proteins that met our criteria for ‘identified proteins’ exhibited ≥ 1 peptide with an individual Mascot score of pā<ā0.05.”
“Proteins with a score of at least 30 for single high-confidence peptides were considered positive identifications.”
The intent is, without doubt, to filter out potential false positive proteins. Very few papers say which results report or protein score type was used, but assuming the above methods are accurate descriptions, they all let through one-hit wonders.
To see this, let’s look at Protein Family Summary. Protein clustering ensures all family members must have at least one significant peptide match, which almost always is the rank 1 match. The absolute value of the peptide match score doesn’t matter as long as the match is significant. The second and third methods provide no filtering at all in this case. The first and fourth methods are inadequate: they will accept some one-hit wonders (peptide score is high enough) and reject some protein hits that are identified by more than one peptide.
There is a straightforward procedure to control for false positive protein hits. First of all, do a target-decoy search so that you get a reliable estimate of the peptide false discovery rate. Choose a target FDR appropriate for your experiment. Now filter out protein hits that were identified by a single peptide sequence. This gets rid of one-hit wonders as well as controls for false positive peptides. Have a look at Creating a list of confidently identified proteins for an example how to do this in Report Builder.