Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by John Cottrell (April 30, 2014)

Peak list arcana*

This article addresses some aspects of how the information in a peak list is used by Mascot, what is not used, and how the peak list is processed prior to a search. These things are all in the manual, but can be difficult to find, short of reading it from cover to cover.

Fragment charge is not used in the search. The possible charge states for fragments are specified in the instrument definition. For most of the default instruments, this is 1+ if the precursor is 1+ and 1+ and 2+ if the precursor is 2+ or more. The complete set of experimental m/z values is then matched against the complete set of calculated values, ignoring charge. Most modern instruments have the resolution to determine fragment charge, so this is slightly inefficient and likely to change in a future release. In fact, the third value on the line for each fragment in the MGF format was reserved for fragment charge some time ago. Meanwhile, if fragment charge can be determined reliably, simply de-charge fragments to 1+ before writing them to the peak list and specifiy a 1+ only instrument. This is particularly important for high charge state precursors, as are encountered in top-down, when the most abundant fragments may also be highly charged.

Similar considerations apply to de-isotoping. This needs be be done as part of peak picking, using all the information contained in the raw profile data and with a knowledge of the resolution and peak shape. Trying to do de-isotope a peak list is always going to be second best. If de-isotoping and fragment ion de-charging is done well, scores will be substantially higher. If done badly, they will be lower, because peaks with the correct m/z that could have been matched have been removed or shifted. If your peak picking software doesn’t do reliable de-isotoping or fragment ion de-charging, may we recommend Mascot Distiller?

If the peak picking software fails to find the precursor isotope distribution in the survey scan, it cannot determine precursor charge, and the usual behaviour is to output one or more default charge states to the peak list. In the early days of Mascot, the peak list was searched independently for each of these charge states, which meant a report sometimes contained matches to the same spectrum for more than one precursor mass, which is not very likely. Ths was changed in Mascot 2.1, and multiple charge states for a single spectrum are now treated collectively; the charge state for the highest scoring match is taken as correct and matches to other charge states are discarded. Hard to imagine a reason for wanting to revert to the old behaviour, but if there is one, just set AutoSelectCharge to 0 in mascot.dat.

In passing, remember that the charge state parameter in the search form is a default that is almost never used because the charge state is specified in the peak list, even when it is uncertain.

There are two settings in mascot.dat that control the handling of very short peptides. Any peptide sequence from the database shorter than MinPepLenInSearch is simply ignored. Any peptide sequence match shorter than MinPepLenInPepSummary is never treated as sufficient evidence, by itself, for the presence of a protein. That is, if you had MinPepLenInSearch set to 7 and MinPepLenInPepSummary set to 10, you would see matches to 7, 8, and 9-mers in the reports but you would never see a protein split out as a new family or new family member on the basis of these matches, alone. Each family or family member would have to contain a significant match to a peptide that was at least a 10-mer and wasn’t found in any other family or family member. (Don’t do this in 2.3.02, where a bug causes a crash during cache file creation unless both settings are given the same value). In recent versions of Mascot, both cut-offs are set to 7 on installation. Please resist the temptation to reduce either to a low value. It is difficult or impossible to get a significant match to a very short peptide, as explained in the scoring and statistics module of the training course. Setting a low value for MinPepLenInSearch will have an impact on search speed and the size of the result file. Setting a low value for MinPepLenInPepSummary can cause catastrophic over-clustering in the protein family report (because very short sequences occur by chance in unrelated proteins).

Two types of processing can be applied to the peak list as part of a search. The precursor isotope distribution, which can be very intense, can be removed from each MS/MS spectrum. This is controlled by the PrecursorCutOut setting in mascot.dat. With the default arguments of –1,–1, a smart filter is created, which removes peaks within the fragment ion tolerance window about each of the precursor isotope peaks. The number of isotope peaks depends on the precursor mass, and full details can be found in the manual (search on PrecursorCutOut). If the arguments are anything other than –1,–1, a single cut-out notch is applied extending from the precursor mass plus the first value to the precursor mass plus the second value. Again, full details are in the manual.

Mascot requires a peak list and you won’t get very far trying to upload a binary raw file. However, some files can represent profile data or peak lists, such as mzML, and just because a utility creates a file in MGF format doesn’t guarantee that it has been properly peak picked. If the incoming peak list appears to be profile data, Mascot will apply a simple centroiding routine to try and recover something useful. This is controlled by two settings in mascot.dat: CentroidWidth (default 0.25) is the width in Daltons of the sliding window used for recentroiding. Re-centroiding is only performed on scans where the number of peaks exceeds CentroidWidthCount (default 1000). If you work with high resolution instruments and you can imagine that a properly picked peak list (of pickled peppers) might contain 1000 peaks, then you might want to increase this setting and/or decrease CentroidWidth. Re-centroiding is unlikely to cause much damage as far as identification is concerned, but it can play havoc with iTRAQ or TMT quantitation. If you suppress re-centroiding and submit profile data, you’ll be rewarded with the infamous M00031 error (Max number of ions is 10000. Ignoring ms-ms set starting at line number).

(* secrets or mysteries)

Keywords: , ,

2 comments on “Peak list arcana*

  1. Frank on said:

    I just like to ask if my pipeline is restricting my profile MS/MS data to a maximum fragment count of 200/ MS/MS, how is Mascot dealing with? I’m just asking because I still have the di- and tri-isotopes present due to the high intensity of the fragments selected.

    • John Cottrell on said:

      Your pipeline performs peak picking but limits the peak list to 200 peaks per spectrum? If this is simply the 200 most intense peaks, you might be discarding some useful sequence ion peaks in some cases. Could be worth experimenting with higher limits to see whether you get better scores for some peptide matches.