Help > Predicting retention time and spectral similarity with MS²Rescore

Predicting retention time and spectral similarity with MS²Rescore

One of the requirements of refining results with machine learning is a set of features (or metrics) for each peptide-spectrum match. When enabled, Percolator uses the features in finding an optimal separation between correct and incorrect matches.

Two types of features are available: core features calculated by Mascot; and features predicted from physico-chemical properties of peptides. For the latter, Mascot Server ships with MS²Rescore.

Predicted features can be a powerful discriminator between correct and incorrect matches. However, some care is needed when selecting a suitable model.

MS²Rescore

MS²Rescore is a “Modular and user-friendly platform for AI-assisted rescoring of peptide identifications”, developed at the University of Ghent. MS²Rescore provides a common Python interface to DeepLC for predicting retention times and MS²PIP for spectral similarity.

MS²Rescore is included in Mascot Server with permission from its developers. MS²Rescore and its dependencies have various open source licences, which are detailed in the Mascot Installation & Setup manual.

Key publications:

Mascot integration

Mascot Server ships with a specially packaged installation of MS²Rescore, compiled and built to work seamlessly on both Windows and Linux systems. When you install Mascot Server, MS²Rescore is automatically unpacked into the Mascot installation directory.

A packaged Python environment, including all MS²RescorePython code and libraries, is stored in mascot/bin/ML_adapters/matrix_science. The directory contains native code libraries, precompiled Python modules and data files, such as HTML templates, necessary for the execution of MS²Rescorefunctionality.

An executable, MS2RescoreAdapter.exe, implements a Mascot adapter interface for machine learning tools. This executable takes as input a Mascot results file and a specification of PSM identifiers (query, rank) for which predicted features are requested. The adapter also takes as arguments the DeepLC model name and MS²PIP model name. When either one is specified, the adapter calls suitable MS²RescorePython functions to predict the requested features, and saves these in a temporary TSV file.

A Mascot utility script, insert_predicted_data.pl, runs MS2RescoreAdapter.exe as necessary and combines the predicted features with core features calculated by Mascot.

The data files containing model weights for DeepLC and MS²PIP are stored in mascot/ML_models.

GPU (not required)

All code runs on the CPU. A GPU is not required.

Internet access (not required)

All MS²Rescore components and model files are supplied with Mascot; nothing is downloaded from the Internet.

We have additionally disabled all functionality in MS²Rescore, MS²PIP, DeepLC and other Python modules that would have tried to download arbitrary files from the Internet.

DeepLC models for RT prediction

DeepLC is a “retention time predictor for (modified) peptides that employs Deep Learning”, developed at the University of Ghent. DeepLC takes a peptide sequence and variable modifications, computes the elemental composition and predicts the retention time. An important strength of DeepLC is that it can make accurate predictions even for variable modifications not seen during the training step.

Key publication: Bouwmeester et al.: DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nature Methods 18, 1363–1369 (2021).

Requirements

The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.

Retention time must be included in the peak lists, so that it is available in the Mascot result file. If the input is MGF, it must specify the RTINSECONDS parameter for each query. It is not sufficient to have the information embedded in the scan title string. If the input is mzML, retention times must be included as correctly tagged metadata (CV terms), not as scan titles.

Retention time predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.

Models shipped with Mascot Server

Mascot Server ships with the below DeepLC models. The contents of this table is derived from DeepLC documentation and Supplementary Table 2 of Bouwmeester et al., Nature Methods 18, 1363–1369 (2021), as well as the inverse name mapping in the Zenodo repository for the publication.

The model full_hc_PXD005573_mcp is a starting point recommended by the DeepLC developers. It is a generalisable model that seems to work well in many cases.

RP means reverse phase; HILIC means hydrophilic interaction liquid chromatography; SCX means strong cation exchange.

Model	Name in pub.	Column type	Gradient length	Peptide properties	Train/test data (unique peptides)
full_hc_PXD005573_mcp (recommended)	DIA HF	RP	2h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term)	Reiter et al. 2017, PXD005573 (43,002)
full_hc_ATLANTIS_SILICA_fixed_mods	ATLANTIS SILICA	HILIC	90min	Tryptic, Carbamidomethyl (C) fixed	Gussakovsky et al. 2017 (30,848)
full_hc_LUNA_HILIC_fixed_mods	LUNA HILIC	HILIC	90min	Tryptic, Carbamidomethyl (C) fixed	Gussakovsky et al. 2017 (23,512)
full_hc_LUNA_SILICA_fixed_mods	LUNA SILICA	HILIC	90min	Tryptic, Carbamidomethyl (C) fixed	Gussakovsky et al. 2017 (26,051)
full_hc_PXD008783_median_calibrate				Semi-tryptic, metabolic (14N/15N), open modification	Chi et al. 2018, PXD008783
full_hc_SCX_fixed_mods	SCX	SCX	90min	Tryptic, Carbamidomethyl (C) fixed	Gussakovsky et al. 2017 (21,638)
full_hc_Xbridge_fixed_mods	Xbridge	HILIC	90min	Tryptic, Carbamidomethyl (C) fixed	Gussakovsky et al. 2017 (31,483)
full_hc_arabidopsis_psms_aligned	Arabidopsis	RP	2h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term)	Mucha et al. 2019, PXD008812 (10,132)
full_hc_dia_fixed_mods	SWATH library	RP	135min	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M)	Rosenberger et al. 2014, PXD000954 (96,798)
full_hc_hela_hf_psms_aligned	HeLa hf	RP	1h	Tryptic, TMT, phosphopeptide enriched	Kelstrup et al. 2017, PXD006932 (137,821)
full_hc_hela_lumos_1h_psms_aligned	HeLa Lumos 1h	RP	1h	Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M)	Li et al. 2019, PXD013477 (13,310)
full_hc_hela_lumos_2h_psms_aligned	HeLa Lumos 2h	RP	2h	Tryptic, SILAC, Carbamidomethyl (C) fixed, Oxidation (M)	Li et al. 2019, PXD013477 (34,231)
full_hc_mod_fixed_mods	HeLa DeepRT	RP	4h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term), Phospho (STY)	Sharma et al. 2014 (2,917)
full_hc_pancreas_psms_aligned	Pancreas	RP	110min	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M)	Wang et al. 2019, PXD010154 (33,421)
full_hc_plasma_lumos_1h_psms_aligned	Plasma lumos 1h	RP	1h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M)	Li et al. 2019, PXD013477 (49,495)
full_hc_plasma_lumos_2h_psms_aligned	Plasma lumos 2h	RP	2h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M)	Li et al. 2019, PXD013477 (2,997)
full_hc_prosit_ptm_2020	ProteomeTools PTM	RP	50min	Tryptic, 21 PTMs	Zolg et al. 2018, PXD009449 (3,659)
full_hc_unmod_fixed_mods	Yeast DeepRT	RP	4h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term)	Nagaraj et al. 2012 (4,867)
full_hc_yeast_120min_psms_aligned	Yeast 2h	RP	2h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term)	Jarnuczak et al. 2016, PXD003472 (15,822)
full_hc_yeast_60min_psms_aligned	Yeast 1h	RP	1h	Tryptic, Carbamidomethyl (C) fixed, Oxidation (M), Acetyl (Protein N-term)	Jarnuczak et al. 2016, PXD003472 (12,197)

Limitations

DeepLC calibrates the predicted retention times to the time range of the observed peptide matches. This works well when the gradient of the selected model (the RT range of peptides used in training) and the gradient of the experiment have similar lengths. However, calibration can fail if you are trying to predict retention times for a short slice of a full elution profile, or for a very short gradient. Typically, using a 1h model works fine for 30-90 minute gradients, but the 1h model isn’t very good for a 2-minute or 5-minute gradient. The main symptom is a narrow band of predicted retention times in the machine learning quality report.

DeepLC prediction accuracy reduces if the peptide is very different from the training set. For example, DeepLC cannot make a good prediction for a peptide containing a selenium, because none of the models contain selenium. Another example is a variable modification containing zinc, which is absent from most training sets.

The maximum sequence length for DeepLC is 60 residues. The predicted retention time for longer peptides is zero.

The current version of DeepLC doesn’t fully support isotopes except 13C and 15N. If two peptide differ only by a deuterium, for example, DeepLC predicts the same retention time even though the real RT may differ.

You can easily assess the model performance by viewing the machine learning quality report.

MS²PIP models for spectral similarity

MS²PIP is a “Fast and accurate peptide fragmentation spectrum prediction for multiple fragmentation methods, instruments and labeling techniques”, developed at the University of Ghent. MS²PIP takes a peptide sequence, variable modifications and charge state as input, and predicts the MS/MS fragmentation spectrum, including peak intensities.

MS²PIP supports tryptic and non-tryptic peptides (based on the selected model), and it is able to make spectral predictions for a range of variable modifications.

The accuracy of the predictions depends on correct model choice, explained further below.

Key publications:

Requirements

The database search must be an MS/MS search, and it must be run as an automatic target-decoy search.

Spectral predictions for crosslinking, spectral library searches and error tolerant searches are not currently supported.

Models shipped with Mascot Server

Mascot Server ships with the below MS²PIP models. The contents of this table is derived from MS²PIP v4.0 documentation.

Model	Version	Fragmentation	MS2 analyzer	Peptide properties	Train/test data (unique peptides)
CID	v20190107	CID	Linear ion trap	Tryptic	NIST CID Human (340,356)
CID-TMT	v20190107	CID	Linear ion trap	Tryptic, TMT-labeled	PXD041002 (72,138)
CIDch2	v20190107	CID	Linear ion trap	Tryptic, 1+ and 2+ fragments	NIST CID Human (340,356)
HCD2019	v20190107	HCD	Orbitrap	Tryptic	MassIVE-KB (1,623,712)
HCD2021	v20210416	HCD	Orbitrap	Tryptic and chymotryptic	Combined dataset (520,579)
HCDch2	v20190107	HCD	Orbitrap	Tryptic, 1+ and 2+ fragments	MassIVE-KB (1,623,712)
Immuno-HCD	v20210316	HCD	Orbitrap	Immunopeptides	Combined dataset (460 191)
TMT	v20190107	HCD	Orbitrap	Tryptic, TMT-labeled	Peng Lab TMT Spectral Library (1,185,547)
TTOF5600	v20190107	CID	Quadrupole time-of-flight	Tryptic	PXD000954 (215,713)
iTRAQ	v20190107	HCD	Orbitrap	Tryptic digest, iTRAQ-labeled	NIST iTRAQ (704,041)
iTRAQphospho	v20190107	HCD	Orbitrap	Tryptic, iTRAQ-labeled, enriched for phosphorylation	NIST iTRAQ phospho (183,383)
timsTOF2023	v20230912	CID	Ion mobility quadrupole time-of-flight	Tryptic and elastase, immuno class 1	Combined dataset (234,973)
timsTOF2024	v20240105	CID	Ion mobility quadrupole time-of-flight	Tryptic and elastase, immuno class 1 & 2	Combined dataset (480,024)

Limitations

MS²PIP predictions are most accurate when you select a model whose training data came from a similar instrument, and your experimental peptides have variable modifications that were all part of the training set. Prediction accuracy decreases the more your variable modifications differ from the training set.

Predicting fragmentation spectra for non-tryptic peptides using a tryptic model is possible (MS²PIP gives a prediction) but might not be accurate. On the other hand, a model trained on semi-tryptic or endogenous peptides typically has good performance when predicting spectra for tryptic peptides.

You can easily assess the model performance by viewing the machine learning quality report.

Matrix Science

Predicting retention time and spectral similarity with MS2Rescore

MS2Rescore

Mascot integration

GPU (not required)

Internet access (not required)

DeepLC models for RT prediction

Requirements

Models shipped with Mascot Server

Limitations

MS2PIP models for spectral similarity

Requirements

Models shipped with Mascot Server

Limitations

Predicting retention time and spectral similarity with MS²Rescore

MS²Rescore

MS²PIP models for spectral similarity