Mascot workflows in Proteome Discoverer
For many users of Thermo instruments, Proteome Discoverer (PD) is their primary user interface for database searching, and Mascot is represented by a node in the workflow. This article collects together a few tips and observations concerning Proteome Discoverer 2.3 and Mascot Server 2.6.
Proteome Discoverer Configuration
Under Administration; Mascot Server, the setting Max. MGF File Size [MB] has a default value of 500. Some people increase this, thinking it is an upper limit on the size of the data set. In fact, PD splits larger peak lists into separate searches and transparently merges the results back into a single set of results. In most cases, this is a good approach, and avoids the possibility of running out of memory on systems with limited RAM. For typical Q-Exactive data, a 500 MB MGF corresponds to 150,000 spectra.
If the search results are only browsed and reported within PD, there is no down-side to this splitting. If you intend to view the search results using Mascot reports, and maybe perform repeat searches, there is the minor inconvenience of having to create a result collection, to combine the results, as described in our May 2017 newsletter. If you want to avoid splitting searches and have plenty of RAM, you could increase this setting, but don’t forget that Microsoft’s IIS web server has a hard limit of 4 GB on the size of an upload, and your server may well be configured with a lower limit.
We sometimes get questions about the authentication settings. If you are not sure whether these are required, try opening a web browser on the PD PC, connect to your Mascot Server home page, then submit a small search. If you are immediately challenged for a user name and password by the web browser, in a pop-up dialog box, you need to enter the same credentials into the web server authentication fields in PD. If you can connect to the home page but are asked for a user name and password when you try to load the search form, as below, you need to enter the same credentials into the Mascot server authentication fields in PD. Otherwise, leave the authentication fields in PD blank; entering credentials when they are not required is likely to cause problems.
Missing functionality
There are a few features in Mascot 2.6 that are not supported by the PD Mascot node:
- Multiple precursor masses for chimeric spectra
- ppm tolerance for fragments
- Error tolerant searches
- Searching combinations of Fasta files and spectral libraries
If you want to access these features for data processed through PD, locate the search in the Mascot Server search log, click on the link to open the result report, and choose to repeat the search.
Target Decoy PSM Validator
The Target Decoy PSM Validator node appears to report an FDR based on the excess of the Mascot score over the identity threshold. This is a calculated significance threshold, derived from the number of candidate peptides that were tested to find the best match. For large search spaces, such as no enzyme searches, searches with many variable modifications, or very large databases, this can be wildly over conservative. For all but the smallest data sets, Mascot reports use the homology threshold, which is an estimate of whether the score is an outlier from the distribution of scores for chance matches.
To illustrate the difference, we used Proteome Discoverer 2.3.0.523 and Mascot Server 2.6.2 to search a set of 6 fractions from PRIDE PXD000612 (20120227_EXQ5_KiSh_SA_LabelFree_HeLa_Phospho_EGF_rep3_Fr*.raw). The databases were SwissProt 2019_02 with taxonomy Homo sapiens (20,418 sequences) plus contaminants (247 sequences). Other settings were typical for a phosphoproteomics study:
------------------------------------------------------------------ Processing node 5: Mascot ------------------------------------------------------------------ 1. Input Data: - Instrument: ESI-TRAP - Protein Database: contaminants; SwissProt - Enzyme Name: Trypsin/P - Maximum Missed Cleavage Sites: 2 - Taxonomy: . . . . . . . . . . . . . . . . Homo sapiens (human) 2. Tolerances: - Fragment Mass Tolerance: 0.02 Da - Precursor Mass Tolerance: 10 ppm - Use Average Precursor Mass: False 4. Dynamic Modifications: - Show All Modifications: False - 1. Dynamic Modification: Oxidation (M) - 2. Dynamic Modification: Acetyl (Protein N-term) - 3. Dynamic Modification: Phospho (ST) - 4. Dynamic Modification: Phospho (Y) 5. Static Modifications: - 1. Static Modification: Carbamidomethyl (C) ------------------------------------------------------------------
At an FDR of 1%, PD reported 36,036 target PSMs. The Mascot summary report, when using the identity threshold, gave essentially the same count of 35,978. Switching to the homology threshold, which is the default, the count was 42,043. Only a modest penalty for using the Target Decoy PSM Validator.
The second example was a search of a set of 12 fractions from PRIDE PXD004863 (150407_08_RnA_FracA*.raw). The databases were the same but this was a no-enzyme search, because the sample was endogenous peptides from CSF:
------------------------------------------------------------------ Processing node 5: Mascot ------------------------------------------------------------------ 1. Input Data: - Instrument: ESI-TRAP - Protein Database: contaminants; SwissProt - Enzyme Name: None - Maximum Missed Cleavage Sites: 2 - Taxonomy: . . . . . . . . . . . . . . . . Homo sapiens (human) 2. Tolerances: - Fragment Mass Tolerance: 0.05 Da - Precursor Mass Tolerance: 15 ppm - Use Average Precursor Mass: False 4. Dynamic Modifications: - Show All Modifications: False - 1. Dynamic Modification: Oxidation (M) 5. Static Modifications: - 1. Static Modification: Carbamidomethyl (C) ------------------------------------------------------------------
For an FDR of 1%, PD reported 2,762 target PSMs. The Mascot summary report, when using the identity threshold, gave essentially the same count of 2,682. Using the homology threshold, the count increased to 7,720, almost a factor of 3 greater, demonstrating the importance of using the homology threshold for such a data set.
For any large search, the easiest fix is to use the Percolator node in place of the Target Decoy PSM Validator node. Running a workflow that was identical apart from this change, PD reported 25,299 target PSMs – a truly dramatic improvement in sensitivity.
Keywords: Percolator, Proteome Discoverer, scoring, security, statistics