Best MS2PIP model for Thermo Orbitrap
Mascot Server 3.0 greatly improves protein and peptide identification rates with Thermo Orbitrap instruments. The new version ships with MS2PIP, which provides fragment intensity predictions. When the database search results are correlated with predicted spectra, it boosts the number of statistically significant matches even with straightforward tryptic digests.
CID and HCD models
For qualitative work and label-free quantitation, Mascot Server 3.0 ships with three MS2PIP models for Thermo Orbitrap instruments: CID, HCD2019/HCD2021 and Immuno-HCD. The ‘best’ one depends on how you have configured the instrument. Basically, if you use HCD, then choose HCD2021 or Immuno-HCD. Otherwise choose CID. The exact instrument model matters less than the fragmentation mechanism. For example, HCD fragmentation with Thermo Q Exactive is more similar to HCD fragmentation with Thermo Exploris, than CID fragmentation with Thermo Q Exactive.
The MS2PIP CID model, as the name implies, is best with linear ion traps and in-source CID. The model was trained by the MS2PIP developers using spectra from the high-quality NIST Human CID spectral library. The training data comprise 304k unique peptide matches from a large range of Thermo and non-Thermo instruments, and seems to capture the general CID fragmentation patterns well.
The MS2PIP HCD2021 model is best with HCD. The 2021 version is described by Declerq et al. (2023) and is an upgrade over the HCD2019 model. HCD2021 was trained with 520k unique tryptic and chymotryptic peptides, giving it wide applicability to most enzymatic digests and even endogenous peptides. The HCD2019 model was trained only on tryptic peptides. Both perform well with tryptic digests, but HCD2021 is generally preferred.
For immunopeptides, Mascot also ships with the MS2PIP model Immuno-HCD. As the name implies, this model is intended for HLA-I and HLA-II peptides, although it has good performance with tryptic digests too.
Selecting the best MS2PIP model discusses model selection and evaluation in more detail.
Example: DDA LFQ run (PXD028735)
The theory is all well and good, but concrete numbers are more satisfying! Van Puyvelde et al. (2022) is a multi-group effort that provides a comprehensive, high quality LFQ benchmark data set. They used six instruments, one of which is Thermo Q Exactive HF-X. Among the many mixtures is a QC sample, which is a mixture of human (65%), yeast (22.5) and E. coli (12.5%), and it was run in 9 technical replicates on every instrument.
For this demonstration, we arbitrarily selected a Thermo raw file from the QC mixture, technical replicate 3 (LFQ_Orbitrap_DDA_QC_03.raw, 3.5GB), from the paper’s PRIDE project PXD028735. There’s nothing special about this file and similar improvement in identification rates can be seen with all the raw files in this data set. The MS1 scans are profile and MS2 scans centroid. Because the raw data are high quality to begin with, and to keep things simple, we processed the file with Mascot Distiller 2.8 using default.ThermoXcalibur.opt settings.
(See Peak picking Thermo .RAW data with Mascot Distiller for an explanation of the difference between centroid and profile data.)
Following the steps in Optimizing your search parameters, we used the following search parameters:
- Database: UP5640_H_sapiens (predefined definition)
- Database: UP2311_S_cerevisiae (predefined definition)
- Database: UP625_E_coli_K12 (predefined definition)
- Precursor tolerance: 10ppm
- Fragment tolerance: 20ppm
- Fixed modifications: Carbamidomethyl (C)
- Variable modifications: Oxidation (M)
- Enzyme: Trypsin/P, 2 missed cleavages
These are very typical search conditions, which makes this an excellent data set for benchmarking.
The data were acquired using HCD, so the best MS2PIP model is HCD2021. The results for Mascot Server 2.7, 2.8 and 3.0 are tabulated below, at 1% sequence FDR.
Mascot | Refine? | MS2PIP? | DeepLC? | Proteins | Protein FDR | Sig. unique seq. | Seq. FDR | Sig. PSMs | PSM FDR |
---|---|---|---|---|---|---|---|---|---|
2.7 | yes | (n/a) | (n/a) | 4,643 | 6.89% | 20,911 | 0.87% | 40,702 | 0.57% |
2.8 | yes | (n/a) | (n/a) | 4,667 | 4.54% | 22,367 | 1.0% | 44,517 | 0.63% |
|
|||||||||
3.0 | yes | HCD2021 | (none) | 4,951 | 4.75% | 24,863 | 1.0% | 50,333 | 0.64% |
3.0 | yes | HCD2021 | yes* | 5,007 | 4.73% | 25,104 | 1.0% | 50,969 | 0.64% |
* DeepLC model full_hc_PXD005573_mcp
To make a fair comparison, we enabled refining with machine learning in each case. Version 2.7 is unable to reach 1% sequence FDR, because it had a couple of bugs around score thresholding when Percolator is enabled. Version 2.8 is better, but the new machine learning features in Mascot Server 3.0 are a big improvement.
Enabling the HCD2021 model for fragment intensities boosts the number of identified peptide sequences by 11% compared to version 2.8 (19% compared to 2.7). Median correlation between predicted and observed spectra is 0.81. Mascot ships with a machine learning quality report, which explains how and why machine learning improved the results.
Enabling retention time predictions using DeepLC gives a further, although smaller boost, and reaches 25k peptide sequences – 12% better than 2.8 and 20% better than 2.7 – and breaks 5k protein hits. This database search has 109k queries (MS/MS spectra), so Mascot Server 3.0 is finding a statistically significant peptide match to almost half the queries.
Keywords: benchmark, fragmentation, machine learning, MS2PIP, Percolator