Requirements and data flow for rescoring
Mascot Server ships with Percolator, which is a well-known algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications. This is often termed rescoring or refining with machine learning. For background information, see How does rescoring with machine learning work?.
Percolator will usually give a worthwhile improvement in sensitivity. The requirements and data flow are described below. By default, refining results with machine learning uses core features calculated by Mascot. You can optionally select a DeepLC model for predicted retention times and an MS2PIP model for spectral similarity.
Requirements
Mascot automatically runs a target-decoy search for all supported search types. The decoy matches provide the negative examples for the classifier, and a subset of the high-scoring matches from the target database provide the positive examples.
When enabled, Percolator trains a machine learning algorithm called a support vector machine (SVM) to discriminate between the positive and negative matches by assigning weights to a number of features. Examples of features include Mascot score, precursor mass error, fragment mass error, number of variable modifications, etc. The vector of features with their optimal weights is then be used to re-rank matches from all queries.
There are occasions when refining results with machine learning can fail. For example, if there are very few good matches in the search results, it may not have enough positive examples to work with. Conversely, a sufficient number of decoy matches are needed to have enough negative examples.
The full requirements are:
- Must be an MS/MS search (not PMF).
- The search must include the results from an automatic decoy database search.
- The search must contain at least 750 queries.
- At least 100 database entries must be searched.
- The search must not be an error tolerant search.
Two options in mascot.dat can be edited to change the lower limits, although we recommend keeping the limits at their defaults:
- PercolatorMinQueries 750: Minimum number of queries to allow rescoring.
- PercolatorMinSequences 100: Minimum number of database entries to allow rescoring.
Data flow
You can enable refining with machine learning either in the search form when submitting the search, or in format controls in the summary report.
All the steps are implemented within nph-cache_families.pl, which is a utility script that post-processes results at the end of the database search.
At the completion of a qualifying search, Mascot runs ms-createpip.exe to create a Percolator input file (*.pip) in the result file’s cache directory. This file contains the core features calculated by Mascot.
If parameters for any ML adapters have been selected, Mascot runs insert_predicted_data.pl. Mascot ships with an adapter for MS2Rescore, which is enabled by selecting a DeepLC model or an MS2PIP model.
- insert_predicted_data.pl prepares input for each selected ML adapter.
- The script runs the adapter (e.g. MS2RescoreAdapter.exe), which predicts the relevant features and writes the values in a temporary TSV file.
- insert_predicted_data.pl merges the predicted features with the core features to produce a new Percolator input (pip) file.
nph-cache_families.pl runs percolator.exe. The input is the above pip file. The output is a pair of files (*.target.pop, *.decoy.pop) in the result file’s cache directory.
nph-cache_families.pl runs a utility program (MS2RescoreReport.exe), which creates the machine learning quality report.
Finally, nph-cache_families.pl combines the rescoring data from the *.pop files with the Mascot search results. It creates a cache file for the interactive reports, where Mascot ions score is replaced with posterior error probability (see below), and runs protein inference.
Rank 1 target matches and rank 1 decoy matches are always written in the pip file. Two options in mascot.dat control whether matches other than rank 1 are used in training:
- PercolatorTargetRankScoreThreshold: Matches below rank 1 are not used if score less than this value (default 20)
- PercolatorTargetRankRelativeThreshold: Matches below rank 1 are not used if score difference divided by rank 1 score is greater than this value (default 0.2)
Expect value and significance threshold
When refining is enabled, the original Mascot scores will be replaced as follows:
- Score: -10log(PEP)
- Expect value: PEP (posterior error probability)
- Identity threshold score for p<0.05: 13
Percolator returns p-values, q-values and Posterior Error Probabilities (PEPs) for each match. The q-value can be thought of as the (local) false discovery rate. If we accept all matches with q-values of 0.01 or less, the false discovery rate will be 1%. The PEP (posterior error probability) is the probability that an individual match is a chance event.
When you set target FDR, for example 1% FDR, Mascot automatically finds the optimal PEP threshold that yields the target false discovery rate. Thresholding on PEP is more robust than thresholding on q-value, and they are in fact two sides of the same coin.
Relevant publications
Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, & Michael J MacCoss at the University of Washington, Department of Genome Sciences. The software is released under an Apache 2.0 licence and included with Mascot by permission.
We would also like to acknowledge the work of Markus Brosch and colleagues at the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results and developed a wrapper application called Mascot Percolator.
There are a number of relevant publications:
- Kall, L., et al., Semi-supervised learning for peptide identification from shotgun proteomics datasets, Nature Methods 4 923-925 (2007)
- Kall, L., et al., Posterior error probabilities and false discovery rates: Two sides of the same coin, Journal of Proteome Research 7 40-44 (2008)
- Kall, L., et al., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases, Journal of Proteome Research 7 29-34 (2008)
- Kall, L., et al., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry, Bioinformatics 24 I42-I48 (2008)
- Brosch, M., et al., Accurate and Sensitive Peptide Identification with Mascot Percolator, Journal of Proteome Research 8 3176-3181 (2009)
- Spivak, M., et al., Improvements to the Percolator Algorithm for Peptide Identification from Shotgun Proteomics Data Sets, Journal of Proteome Research 8 3737-3745 (2009)