Mascot Server ships with Percolator, which is an algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications.
Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble and Michael J MacCoss at the University of Washington, Department of Genome Sciences.
Percolator scores can be shown in Mascot Server reports instead of Mascot ions scores.
Percolated results are available both for old and new searches. However, they are only available when a decoy search has been performed. The Mascot results file is not changed by Percolator, but additional files for each results file are created by running ms-createpip.exe
and percolator.exe
.
ms-createpip.exe
calculates core features from the peptide match data. Mascot Server 3.0 and later can also fetch predicted features from other machine learning tools like MS2Rescore using an adapter interface. There are two sets of pip (Percolator input) and pop (Percolator output) files, whose names depend on Mascot options and ML adapter parameters. These are enumerated below.
Name | Meaning | Dependencies |
---|---|---|
Core pip file | Percolator input (pip) file created by ms-createpip.exe , containing features calculated from target and decoy peptide matches. | The core pip file name and contents depend on mascot.dat options:
|
Final pip file | Pip file created by insert_predicted_data.pl from the core pip file, containing features predicted by machine learning tools (e.g. MS2Rescore). | The final pip file name and contents depend on the core pip file, the above mascot.dat options, plus any ML adapter parameters (e.g. with MS2Rescore, the MS2PIP model name). |
Pop file | Percolator output (pop) file created by percolator.exe from a pip file. Percolator creates one pop file for target matches and one for decoy matches. | The target and decoy pop file names depend on the above mascot.dat options plus any ML adapter parameters. The contents depend only on the pip file given as argument. |
When you use Parser in a client application, you need the results file and the target/decoy pop files. Normally these are downloaded from Mascot Server, and the easiest way is using ms_http_client_search.
Parser reads pop files from a fixed location:
The pop file name is generated from Percolator settings in the mascot.dat Options section, by calling ms_mascotresfilebase::setPercolatorFeatures().
Parser will only find the downloaded pop file if you use exactly the same Percolator options as were used on server side.
The detailed workflow is:
PERCOLATE=1
search parameter.DECOY
option is on. Download the results file. flags2
parameter when creating an ms_peptidesummary object. If the downloaded pop file name doesn't match the one generated by setPercolatorFeatures(), it may be because the Mascot options have a difference, or it could be because you are using a different version of Parser compared to the Mascot Server version. An easy workaround is to rename the downloaded pop files to match the file names returned by ms_mascotresfilebase::getPercolatorFileNames().
For completeness, the workflow in Mascot Server is as follows. The workflow is implemented in the server-side script mascot/bin/refine_results_with_ml.pl. The detailed steps are:
DECOY
option is on. refine_results_with_ml.pl performs the steps:
ms-createpip.exe
, specifying the output filename as the core pip file. insert_predicted_data.pl
with core pip file as –pipfile_in and final pip file as –pipfile_out, and giving the ML adapter parameters as arguments. percolator.exe
with command-line parameters from getPercolatorExeFlags() , giving the final pip and pop file names as argument. Finally:
flags2
parameter when creating an ms_peptidesummary object. Steps 5 to 10 can be performed automatically after a search by specifying the appropriate options in getExecAfterSearch() .
There is also a static function, staticGetPercolatorFileNames() that can be called to get filenames without creating an ms_mascotresfile_msr
or ms_mascotresfile_dat
object.
A 'Percolator score' is calculated from the posterior error probability (PEP) by
percolatorScore = -10 * log10(PEP)
This is analogous to the Mascot ions score, which is -10*log10(p-value)
. The posterior error probability is similar to but not the same as a p-value.
Percolator processes the rank 1 matches found by Mascot, plus any other ranks as defined by ms_mascotoptions::getPercolatorTargetRankScoreThreshold and ms_mascotoptions::getPercolatorTargetRankRelativeThreshold.
Peptide matches that were not processed by Percolator get a score based on the rank 1 match, scaled by the Mascot ions score:
rank2PercolatorScore = (rank2MascotIonsScore/rank1MascotIonsScore) * rank1Percolatorscore
If the PEP value for rank 1 is exactly 1, it is reset to 0.9999. This is to ensure that there is a tiny amount of spread in scores for lower ranking peptides.
The function ms_peptide::getPercolatorScores() returns the posterior error probability which is used by Mascot Parser. A different score (calculated by Percolator itself) and the qValue can also be obtained, but these are unused by Mascot Parser.
When the MSPEPSUM_PERCOLATOR flag is specified, all Mascot scores are replaced with percolator derived scores. The original Mascot ions score for a peptide is still available by calling getPercolatorScores(). The following table describes how each of the existing Mascot functions have been changed to return Percolator values.
Mascot Parser Function | How value is calculated |
---|---|
ms_peptide::getIonsScore() | -10 * log10( posterior error probability ) |
ms_protein::getPeptideIonsScore() | Same value as getIonsScore() above, except a minor correction for large proteins is applied in the same way as for the Mascot Score. |
ms_protein::getScore() | Calculated using the percolator scores rather than Mascot scores. Same rules for MudPIT and standard scoring apply. |
ms_protein::getNonMudpitScore() | Calculated using the percolator scores rather than Mascot scores. Same rules for MudPIT scoring apply. |
ms_mascotresults::getPeptideIdentityThreshold() | Calculated by taking -10log10(sigthreshold) , so a value of sigthreshold=0.05 gives a threshold score of ~13 |
ms_mascotresults::getAvePeptideIdentityThreshold() | With Percolator, the threshold is the same for every query, so this is exactly the same as getPeptideIdentityThreshold() above. |
ms_mascotresults::getMaxPeptideIdentityThreshold() | With Percolator, the threshold is the same for every query, so this is exactly the same as getPeptideIdentityThreshold() above. |
ms_mascotresults::getHomologyThreshold() | With Percolator, there is no homology threshold, so this always returns 0. |
ms_mascotresults::getHomologyThresholdForHistogram() | With Percolator, there is no homology threshold, so this always returns 0. |
ms_mascotresults::getPeptideExpectationValue() | Return the posterior error probability by calculating it back from the score using 10 ^ score/-10 . The same value can also be obtained by calling ms_peptide::getPercolatorScores() and retrieving the posterior error probability value. |
ms_mascotresults::getProbFromScore() | Calls getPeptideExpectationValue() as above. |
ms_mascotresults::getIonsScoreHistogram() | Returns a vector of Percolator scores rather than Mascot scores. |
ms_mascotresults::getProteinScoreForHistogram() | Returns the Percolator protein score rather than Mascot protein score |
ms_mascotresults::getNumHitsAboveIdentity() | Any peptide with a posterior error probability less than the significance value specified will be counted. |
ms_mascotresults::getNumDecoyHitsAboveIdentity() | Any peptide with a posterior error probability less than the significance value specified will be counted. |
ms_mascotresults::getNumHitsAboveHomology() | No homology thresholds, so always returns the same number as getNumHitsAboveIdentity() . |
ms_mascotresults::getNumDecoyHitsAboveHomology() | No homology thresholds, so always returns the same number as getNumDecoyHitsAboveIdentity() . |
You currently cannot specify MSPEPSUM_PERCOLATOR with an Integrated error tolerant search.
The following configuration functions are relevant: