Predicting retention time and spectral similarity with MS2Rescore
One of the requirements of refining results with machine learning is a set of features (or metrics) for each peptide-spectrum match. When enabled, Percolator uses the features in finding an optimal separation between correct and incorrect matches.
Two types of features are available: core features calculated by Mascot; and features predicted from physico-chemical properties of peptides. For the latter, Mascot Server ships with MS2Rescore.
Predicted features can be a powerful discriminator between correct and incorrect matches. However, some care is needed when selecting a suitable model.
MS2Rescore
MS2Rescore is a “Modular and user-friendly platform for AI-assisted rescoring of peptide identifications”, developed at the University of Ghent. MS2Rescore provides a common Python interface to DeepLC for predicting retention times and MS2PIP for spectral similarity.
MS2Rescore is included in Mascot Server with permission from its developers. MS2Rescore and its dependencies have various open source licences, which are detailed in the Mascot Installation & Setup manual.
Key publications:
- Buur et al.: MS2Rescore 3.0 is a modular, flexible, and user-friendly platform to boost peptide identifications, as showcased with MS Amanda 3.0. J Prot Res (2024)
- Declercq et al.: MS2Rescore: Data-driven rescoring dramatically boosts immunopeptide identification rates. Molecular & Cellular Proteomics (2021)
- Silva et al.: Accurate peptide fragmentation predictions allow data driven approaches to replace and improve upon proteomics search engine scoring functions. Bioinformatics (2019)
Mascot integration
Mascot Server ships with a specially packaged installation of MS2Rescore, compiled and built to work seamlessly on both Windows and Linux systems. When you install Mascot Server, MS2Rescore is automatically unpacked into the Mascot installation directory.
A packaged Python environment, including all MS2RescorePython code and libraries, is stored in mascot/bin/ML_adapters/matrix_science. The directory contains native code libraries, precompiled Python modules and data files, such as HTML templates, necessary for the execution of MS2Rescorefunctionality.
An executable, MS2RescoreAdapter.exe, implements a Mascot adapter interface for machine learning tools. This executable takes as input a Mascot results file and a specification of PSM identifiers (query, rank) for which predicted features are requested. The adapter also takes as arguments the DeepLC model name and MS2PIP model name. When either one is specified, the adapter calls suitable MS2RescorePython functions to predict the requested features, and saves these in a temporary TSV file.
A Mascot utility script, insert_predicted_data.pl, runs MS2RescoreAdapter.exe as necessary and combines the predicted features with core features calculated by Mascot.
The data files containing model weights for DeepLC and MS2PIP are stored in mascot/ML_models.
GPU (not required)
All code runs on the CPU. A GPU is not required.
Internet access (not required)
All MS2Rescore components and model files are supplied with Mascot; nothing is downloaded from the Internet.
We have additionally disabled all functionality in MS2Rescore, MS2PIP, DeepLC and other Python modules that would have tried to download arbitrary files from the Internet.
DeepLC models for RT prediction
DeepLC is a “retention time predictor for (modified) peptides that employs Deep Learning”, developed at the University of Ghent. DeepLC takes a peptide sequence and variable modifications, computes the elemental composition and predicts the retention time. An important strength of DeepLC is that it can make accurate predictions even for variable modifications not seen during the training step.
Key publication: Bouwmeester et al.: DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nature Methods 18, 1363–1369 (2021).
Requirements
The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.
Retention time must be included in the peak lists, so that it is available in the Mascot result file. If the input is MGF, it must specify the RTINSECONDS parameter for each query. It is not sufficient to have the information embedded in the scan title string. If the input is mzML, retention times must be included as correctly tagged metadata (CV terms), not as scan titles.
Retention time predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.
Models shipped with Mascot Server
Mascot Server ships with the below DeepLC models. The contents of this table is derived from DeepLC documentation and Supplementary Table 2 of Bouwmeester et al., Nature Methods 18, 1363–1369 (2021), as well as the inverse name mapping in the Zenodo repository for the publication.
The model full_hc_PXD005573_mcp is a starting point recommended by the DeepLC developers. It is a generalisable model that seems to work well in many cases.
RP means reverse phase; HILIC means hydrophilic interaction liquid chromatography; SCX means strong cation exchange.
Model | Name in pub. | Column type | Gradient length | Peptide properties | Train/test data (unique peptides) |
---|---|---|---|---|---|
full_hc_PXD005573_mcp (recommended) |
DIA HF | RP | 2h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) | Reiter et al. 2017, PXD005573 (43,002) |
full_hc_ATLANTIS_SILICA_fixed_mods | ATLANTIS SILICA | HILIC | 90min | Tryptic, Carbamidomethyl (C) fixed | Gussakovsky et al. 2017 (30,848) |
full_hc_LUNA_HILIC_fixed_mods | LUNA HILIC | HILIC | 90min | Tryptic, Carbamidomethyl (C) fixed | Gussakovsky et al. 2017 (23,512) |
full_hc_LUNA_SILICA_fixed_mods | LUNA SILICA | HILIC | 90min | Tryptic, Carbamidomethyl (C) fixed | Gussakovsky et al. 2017 (26,051) |
full_hc_PXD008783_median_calibrate | Semi-tryptic, metabolic (14N/15N), open modification | Chi et al. 2018, PXD008783 | |||
full_hc_SCX_fixed_mods | SCX | SCX | 90min | Tryptic, Carbamidomethyl (C) fixed | Gussakovsky et al. 2017 (21,638) |
full_hc_Xbridge_fixed_mods | Xbridge | HILIC | 90min | Tryptic, Carbamidomethyl (C) fixed | Gussakovsky et al. 2017 (31,483) |
full_hc_arabidopsis_psms_aligned | Arabidopsis | RP | 2h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) | Mucha et al. 2019, PXD008812 (10,132) |
full_hc_dia_fixed_mods | SWATH library | RP | 135min | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) | Rosenberger et al. 2014, PXD000954 (96,798) |
full_hc_hela_hf_psms_aligned | HeLa hf | RP | 1h | Tryptic, TMT, phosphopeptide enriched | Kelstrup et al. 2017, PXD006932 (137,821) |
full_hc_hela_lumos_1h_psms_aligned | HeLa Lumos 1h | RP | 1h | Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M) | Li et al. 2019, PXD013477 (13,310) |
full_hc_hela_lumos_2h_psms_aligned | HeLa Lumos 2h | RP | 2h | Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M) | Li et al. 2019, PXD013477 (34,231) |
full_hc_mod_fixed_mods | HeLa DeepRT | RP | 4h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term), Phospho (STY) | Sharma et al. 2014 (2,917) |
full_hc_pancreas_psms_aligned | Pancreas | RP | 110min | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) | Wang et al. 2019, PXD010154 (33,421) |
full_hc_plasma_lumos_1h_psms_aligned | Plasma lumos 1h | RP | 1h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) | Li et al. 2019, PXD013477 (49,495) |
full_hc_plasma_lumos_2h_psms_aligned | Plasma lumos 2h | RP | 2h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) | Li et al. 2019, PXD013477 (2,997) |
full_hc_prosit_ptm_2020 | ProteomeTools PTM | RP | 50min | Tryptic, 21 PTMs | Zolg et al. 2018, PXD009449 (3,659) |
full_hc_unmod_fixed_mods | Yeast DeepRT | RP | 4h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) | Nagaraj et al. 2012 (4,867) |
full_hc_yeast_120min_psms_aligned | Yeast 2h | RP | 2h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) | Jarnuczak et al. 2016, PXD003472 (15,822) |
full_hc_yeast_60min_psms_aligned | Yeast 1h | RP | 1h | Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) | Jarnuczak et al. 2016, PXD003472 (12,197) |
Limitations
DeepLC calibrates the predicted retention times to the time range of the observed peptide matches. This works well when the gradient of the selected model (the RT range of peptides used in training) and the gradient of the experiment have similar lengths. However, calibration can fail if you are trying to predict retention times for a short slice of a full elution profile, or for a very short gradient. Typically, using a 1h model works fine for 30-90 minute gradients, but the 1h model isn’t very good for a 2-minute or 5-minute gradient. The main symptom is a narrow band of predicted retention times in the machine learning quality report.
DeepLC prediction accuracy reduces if the peptide is very different from the training set. For example, DeepLC cannot make a good prediction for a peptide containing a selenium, because none of the models contain selenium. Another example is a variable modification containing zinc, which is absent from most training sets.
The maximum sequence length for DeepLC is 60 residues. The predicted retention time for longer peptides is zero.
The current version of DeepLC doesn’t fully support isotopes except 13C and 15N. If two peptide differ only by a deuterium, for example, DeepLC predicts the same retention time even though the real RT may differ.
You can easily assess the model performance by viewing the machine learning quality report.
MS2PIP models for spectral similarity
MS2PIP is a “Fast and accurate peptide fragmentation spectrum prediction for multiple fragmentation methods, instruments and labeling techniques”, developed at the University of Ghent. MS2PIP takes a peptide sequence, variable modifications and charge state as input, and predicts the MS/MS fragmentation spectrum, including peak intensities.
MS2PIP supports tryptic and non-tryptic peptides (based on the selected model), and it is able to make spectral predictions for a range of variable modifications.
The accuracy of the predictions depends on correct model choice, explained further below.
Key publications:
- Declercq et al.: Updated MS2PIP web server supports cutting-edge proteomics applications. Nucleic Acids Research (2023)
- Gabriels et al.: Updated MS2PIP web server delivers fast and accurate MS2 peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques. Nucleic Acids Research (2019)
- Degroeve et al.: MS2PIP prediction server: compute and visualize MS2 peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Research, 43(W1), W326–W330. (2015)
- Degroeve, S., & Martens, L.: MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics 29(24), 3199–203. (2013)
Requirements
The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.
Spectral predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.
Models shipped with Mascot Server
Mascot Server ships with the below MS2PIP models. The contents of this table is derived from MS2PIP v4.0 documentation.
Model | Version | Fragmen |
MS2 analyzer | Peptide properties | Train/test data (unique peptides) |
---|---|---|---|---|---|
CID | v20190107 | CID | Linear ion trap | Tryptic | NIST CID Human (340,356) |
CID-TMT | v20190107 | CID | Linear ion trap | Tryptic, TMT-labeled | PXD041002 (72,138) |
CIDch2 | v20190107 | CID | Linear ion trap | Tryptic, 1+ and 2+ fragments | NIST CID Human (340,356) |
HCD2019 | v20190107 | HCD | Orbitrap | Tryptic | MassIVE-KB (1,623,712) |
HCD2021 | v20210416 | HCD | Orbitrap | Tryptic and chymotryptic | Combined dataset (520,579) |
HCDch2 | v20190107 | HCD | Orbitrap | Tryptic, 1+ and 2+ fragments | MassIVE-KB (1,623,712) |
Immuno-HCD | v20210316 | HCD | Orbitrap | Immunopeptides | Combined dataset (460 191) |
TMT | v20190107 | HCD | Orbitrap | Tryptic, TMT-labeled | Peng Lab TMT Spectral Library (1,185,547) |
TTOF5600 | v20190107 | CID | Quadrupole time-of-flight | Tryptic | PXD000954 (215,713) |
iTRAQ | v20190107 | HCD | Orbitrap | Tryptic digest, iTRAQ-labeled | NIST iTRAQ (704,041) |
iTRAQphospho | v20190107 | HCD | Orbitrap | Tryptic, iTRAQ-labeled, enriched for phosphorylation | NIST iTRAQ phospho (183,383) |
timsTOF2023 | v20230912 | CID | Ion mobility quadrupole time-of-flight | Tryptic and elastase, immuno class 1 | Combined dataset (234,973) |
timsTOF2024 | v20240105 | CID | Ion mobility quadrupole time-of-flight | Tryptic and elastase, immuno class 1 & 2 | Combined dataset (480,024) |
Limitations
MS2PIP predictions are most accurate when you select a model whose training data came from a similar instrument, and your experimental peptides have variable modifications that were all part of the training set. Prediction accuracy decreases the more your variable modifications differ from the training set.
Predicting fragmentation spectra for non-tryptic peptides using a tryptic model is possible (MS2PIP gives a prediction) but might not be accurate. On the other hand, a model trained on semi-tryptic or endogenous peptides typically has good performance when predicting spectra for tryptic peptides.
You can easily assess the model performance by viewing the machine learning quality report.