Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Predicting retention time and spectral similarity with MS2Rescore

One of the requirements of refining results with machine learning is a set of features (or metrics) for each peptide-spectrum match. When enabled, Percolator uses the features in finding an optimal separation between correct and incorrect matches.

Two types of features are available: core features calculated by Mascot; and features predicted from physico-chemical properties of peptides. For the latter, Mascot Server ships with MS2Rescore.

Predicted features can be a powerful discriminator between correct and incorrect matches. However, some care is needed when selecting a suitable model.

MS2Rescore

MS2Rescore is a “Modular and user-friendly platform for AI-assisted rescoring of peptide identifications”, developed at the University of Ghent. MS2Rescore provides a common Python interface to DeepLC for predicting retention times and MS2PIP for spectral similarity.

MS2Rescore is included in Mascot Server with permission from its developers. MS2Rescore and its dependencies have various open source licences, which are detailed in the Mascot Installation & Setup manual.

Key publications:

Mascot integration

Mascot Server ships with a specially packaged installation of MS2Rescore, compiled and built to work seamlessly on both Windows and Linux systems. When you install Mascot Server, MS2Rescore is automatically unpacked into the Mascot installation directory.

A packaged Python environment, including all MS2RescorePython code and libraries, is stored in mascot/bin/ML_adapters/matrix_science. The directory contains native code libraries, precompiled Python modules and data files, such as HTML templates, necessary for the execution of MS2Rescorefunctionality.

An executable, MS2RescoreAdapter.exe, implements a Mascot adapter interface for machine learning tools. This executable takes as input a Mascot results file and a specification of PSM identifiers (query, rank) for which predicted features are requested. The adapter also takes as arguments the DeepLC model name and MS2PIP model name. When either one is specified, the adapter calls suitable MS2RescorePython functions to predict the requested features, and saves these in a temporary TSV file.

A Mascot utility script, insert_predicted_data.pl, runs MS2RescoreAdapter.exe as necessary and combines the predicted features with core features calculated by Mascot.

The data files containing model weights for DeepLC and MS2PIP are stored in mascot/ML_models.

GPU (not required)

All code runs on the CPU. A GPU is not required.

Internet access (not required)

All MS2Rescore components and model files are supplied with Mascot; nothing is downloaded from the Internet.

We have additionally disabled all functionality in MS2Rescore, MS2PIP, DeepLC and other Python modules that would have tried to download arbitrary files from the Internet.

DeepLC models for RT prediction

DeepLC is a “retention time predictor for (modified) peptides that employs Deep Learning”, developed at the University of Ghent. DeepLC takes a peptide sequence and variable modifications, computes the elemental composition and predicts the retention time. An important strength of DeepLC is that it can make accurate predictions even for variable modifications not seen during the training step.

Key publication: Bouwmeester et al.: DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nature Methods 18, 1363–1369 (2021).

Requirements

The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.

Retention time must be included in the peak lists, so that it is available in the Mascot result file. If the input is MGF, it must specify the RTINSECONDS parameter for each query. It is not sufficient to have the information embedded in the scan title string. If the input is mzML, retention times must be included as correctly tagged metadata (CV terms), not as scan titles.

Retention time predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.

Models shipped with Mascot Server

Mascot Server ships with the below DeepLC models. The contents of this table is derived from DeepLC documentation and Supplementary Table 2 of Bouwmeester et al., Nature Methods 18, 1363–1369 (2021), as well as the inverse name mapping in the Zenodo repository for the publication.

The model full_hc_PXD005573_mcp is a starting point recommended by the DeepLC developers. It is a generalisable model that seems to work well in many cases.

RP means reverse phase; HILIC means hydrophilic interaction liquid chromatography; SCX means strong cation exchange.

Model Name in pub. Column type Gradient length Peptide properties Train/test data (unique peptides)
full_hc_PXD005573_mcp
(recommended)
DIA HF RP 2h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) Reiter et al. 2017, PXD005573 (43,002)
full_hc_ATLANTIS_SILICA_fixed_mods ATLANTIS SILICA HILIC 90min Tryptic, Carbamidomethyl (C) fixed Gussakovsky et al. 2017 (30,848)
full_hc_LUNA_HILIC_fixed_mods LUNA HILIC HILIC 90min Tryptic, Carbamidomethyl (C) fixed Gussakovsky et al. 2017 (23,512)
full_hc_LUNA_SILICA_fixed_mods LUNA SILICA HILIC 90min Tryptic, Carbamidomethyl (C) fixed Gussakovsky et al. 2017 (26,051)
full_hc_PXD008783_median_calibrate Semi-tryptic, metabolic (14N/15N), open modification Chi et al. 2018, PXD008783
full_hc_SCX_fixed_mods SCX SCX 90min Tryptic, Carbamidomethyl (C) fixed Gussakovsky et al. 2017 (21,638)
full_hc_Xbridge_fixed_mods Xbridge HILIC 90min Tryptic, Carbamidomethyl (C) fixed Gussakovsky et al. 2017 (31,483)
full_hc_arabidopsis_psms_aligned Arabidopsis RP 2h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) Mucha et al. 2019, PXD008812 (10,132)
full_hc_dia_fixed_mods SWATH library RP 135min Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) Rosenberger et al. 2014, PXD000954 (96,798)
full_hc_hela_hf_psms_aligned HeLa hf RP 1h Tryptic, TMT, phosphopeptide enriched Kelstrup et al. 2017, PXD006932 (137,821)
full_hc_hela_lumos_1h_psms_aligned HeLa Lumos 1h RP 1h Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M) Li et al. 2019, PXD013477 (13,310)
full_hc_hela_lumos_2h_psms_aligned HeLa Lumos 2h RP 2h Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M) Li et al. 2019, PXD013477 (34,231)
full_hc_mod_fixed_mods HeLa DeepRT RP 4h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term), Phospho (STY) Sharma et al. 2014 (2,917)
full_hc_pancreas_psms_aligned Pancreas RP 110min Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) Wang et al. 2019, PXD010154 (33,421)
full_hc_plasma_lumos_1h_psms_aligned Plasma lumos 1h RP 1h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) Li et al. 2019, PXD013477 (49,495)
full_hc_plasma_lumos_2h_psms_aligned Plasma lumos 2h RP 2h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M) Li et al. 2019, PXD013477 (2,997)
full_hc_prosit_ptm_2020 ProteomeTools PTM RP 50min Tryptic, 21 PTMs Zolg et al. 2018, PXD009449 (3,659)
full_hc_unmod_fixed_mods Yeast DeepRT RP 4h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) Nagaraj et al. 2012 (4,867)
full_hc_yeast_120min_psms_aligned Yeast 2h RP 2h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) Jarnuczak et al. 2016, PXD003472 (15,822)
full_hc_yeast_60min_psms_aligned Yeast 1h RP 1h Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term) Jarnuczak et al. 2016, PXD003472 (12,197)

Limitations

DeepLC calibrates the predicted retention times to the time range of the observed peptide matches. This works well when the gradient of the selected model (the RT range of peptides used in training) and the gradient of the experiment have similar lengths. However, calibration can fail if you are trying to predict retention times for a short slice of a full elution profile, or for a very short gradient. Typically, using a 1h model works fine for 30-90 minute gradients, but the 1h model isn’t very good for a 2-minute or 5-minute gradient. The main symptom is a narrow band of predicted retention times in the machine learning quality report.

DeepLC prediction accuracy reduces if the peptide is very different from the training set. For example, DeepLC cannot make a good prediction for a peptide containing a selenium, because none of the models contain selenium. Another example is a variable modification containing zinc, which is absent from most training sets.

The maximum sequence length for DeepLC is 60 residues. The predicted retention time for longer peptides is zero.

The current version of DeepLC doesn’t fully support isotopes except 13C and 15N. If two peptide differ only by a deuterium, for example, DeepLC predicts the same retention time even though the real RT may differ.

You can easily assess the model performance by viewing the machine learning quality report.

MS2PIP models for spectral similarity

MS2PIP is a “Fast and accurate peptide fragmentation spectrum prediction for multiple fragmentation methods, instruments and labeling techniques”, developed at the University of Ghent. MS2PIP takes a peptide sequence, variable modifications and charge state as input, and predicts the MS/MS fragmentation spectrum, including peak intensities.

MS2PIP supports tryptic and non-tryptic peptides (based on the selected model), and it is able to make spectral predictions for a range of variable modifications.

The accuracy of the predictions depends on correct model choice, explained further below.

Key publications:

Requirements

The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.

Spectral predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.

Models shipped with Mascot Server

Mascot Server ships with the below MS2PIP models. The contents of this table is derived from MS2PIP v4.0 documentation.

Model Version Fragmentation MS2 analyzer Peptide properties Train/test data (unique peptides)
CID v20190107 CID Linear ion trap Tryptic NIST CID Human (340,356)
CID-TMT v20190107 CID Linear ion trap Tryptic, TMT-labeled PXD041002 (72,138)
CIDch2 v20190107 CID Linear ion trap Tryptic, 1+ and 2+ fragments NIST CID Human (340,356)
HCD2019 v20190107 HCD Orbitrap Tryptic MassIVE-KB (1,623,712)
HCD2021 v20210416 HCD Orbitrap Tryptic and chymotryptic Combined dataset (520,579)
HCDch2 v20190107 HCD Orbitrap Tryptic, 1+ and 2+ fragments MassIVE-KB (1,623,712)
Immuno-HCD v20210316 HCD Orbitrap Immunopeptides Combined dataset (460 191)
TMT v20190107 HCD Orbitrap Tryptic, TMT-labeled Peng Lab TMT Spectral Library (1,185,547)
TTOF5600 v20190107 CID Quadrupole time-of-flight Tryptic PXD000954 (215,713)
iTRAQ v20190107 HCD Orbitrap Tryptic digest, iTRAQ-labeled NIST iTRAQ (704,041)
iTRAQphospho v20190107 HCD Orbitrap Tryptic, iTRAQ-labeled, enriched for phosphorylation NIST iTRAQ phospho (183,383)
timsTOF2023 v20230912 CID Ion mobility quadrupole time-of-flight Tryptic and elastase, immuno class 1 Combined dataset (234,973)
timsTOF2024 v20240105 CID Ion mobility quadrupole time-of-flight Tryptic and elastase, immuno class 1 & 2 Combined dataset (480,024)

Limitations

MS2PIP predictions are most accurate when you select a model whose training data came from a similar instrument, and your experimental peptides have variable modifications that were all part of the training set. Prediction accuracy decreases the more your variable modifications differ from the training set.

Predicting fragmentation spectra for non-tryptic peptides using a tryptic model is possible (MS2PIP gives a prediction) but might not be accurate. On the other hand, a model trained on semi-tryptic or endogenous peptides typically has good performance when predicting spectra for tryptic peptides.

You can easily assess the model performance by viewing the machine learning quality report.