How long should a search take? – Data and search parameters
An earlier article discussed how hardware choices affect search speed. This article looks at the influence of the peak list and the search parameters. It’s important to distinguish between the time spent on the search itself and the time spent loading the result report; different factors apply to these two steps. It is not unknown for loading the result report to take longer than the search.
Starting with factors that influence search speed: the effect of database size is approximately linear. For a large, comprehensive database, using a taxonomy filter reduces the effective size of the database to that of the selected entries. Conversely, choosing auto-decoy doubles the number of entries to be searched and so doubles the search time.
Database size is just one factor in the size of the search space – the number of peptide sequences that must be compared with a spectrum to see which gives the best match. The smaller the search space, the faster the search and the easier it is to get a statistically significant match. Much of what follows could be summarised as "keep the search space as small as possible".
Remember that comprehensive databases grow larger with every update. If you take an old search of SwissProt or NCBInr and repeat it, chances are it will take longer for this very reason. You can confirm this by looking at the count of residues, which is displayed in the result report. All other things being equal, the search time will be proportional to this count.
Search time is proportional to the peptide mass tolerance, although the effect is far from linear. However, it is a bad idea to be over optimistic with mass accuracy. Better to set it too wide, and have the search take longer, than set it too narrow, and fail to get some matches.
If peak detection is unreliable, sometimes picking the 13C peak, use the #13C search parameter. This adds a second acceptance window around the mass of the 13C peak and optionally a third around the 13C2 peak, which is much more efficient than setting the mass tolerance to values in excess of 1 or 2 Da.
Fragment ion tolerance has little or no effect on search time.
A no-enzyme search may take 100 times as long as a search with tryptic specificity, the exact factor depending on the distribution of peptide molecular masses. If the sample was digested with a known enzyme, the majority of peptides will be specific cleavage products, and using an error tolerant search is a much faster and more sensitive way of matching semi-specific and non-specific peptides. In an extreme case, where 20% of cleavage was non-specific, you would still only expect 4% of the peptides to be non-specific at both ends, so the semi-specific version of the enzyme may well give the best sensitivity. Only perform a no-enzyme search when you have no choice, such as with endogenous peptides.
Don’t set the allowed number of missed cleavages to an unnecessarily high value. Changing the setting from zero to one missed cleavage approximately doubles the search space.
Fixed modifications have no effect on search times, but variable modifications cause a geometric increase in search times. The precise factor is difficult to estimate because it depends on the relative abundance of the modified residue(s). However, you can be sure that a search with many variable modifications will take much, much longer than a search with no variable modifications. As in the case of non-specific peptides, an error tolerant search is a more efficient way of matching peptides with relatively rare post-translational modifications. Only modifications that are very abundant should be specified as variable modifications.
If the data include isotopic labels used for quantitation, it is important to specify these as part of a quantitation method, which allows them to be treated as exclusive modifications (a choice of fixed modifications). In the case of reporter ion quantitation, such as iTRAQ and TMT, the label should be specified as fixed.
The relationship between the number of spectra in the peak list and search time is complex. Partly because of fixed overheads at the beginning and end of each search and partly due to large peptides taking longer to search than small ones because more arrangements of variable modifications are possible. All we can safely say that adding more spectra will never make a search faster.
Turning to result reports, the default report for searches of more than 300 spectra is the Protein Family Summary, which is a paged report, designed for large searches. It is possible to specify the preferred report format as a search parameter, so make sure that older, third party client software isn’t specifying the older Peptide Summary or Select Summary reports. These will be extremely slow to load for very large results.
The other golden rule for reports is always to set the number of hits to AUTO. Never choose a large, fixed number, just in case.
Quantitation protocols that are handled by the search engine, such as iTRAQ and TMT add to the time taken to display the report. This is accentuated if the quantitation method specifies global normalisation, because every peptide must be quantified before the normalisation factors can be calculated and the report displayed. Best to set normalisation to none in the method. Once the report is loaded, you can check whether everything looks OK and the various format options are correct before finally setting normalisation as required and going for a well-earned cup of coffee.
Keywords: benchmark, search parameters, tutorial