Searching uninterpreted MS/MS data
If you have no time to read this tutorial, these are the most important do’s and don’ts:
- You cannot search raw data; it must be converted into a peak list.
- Search parameters are critical and should be determined by running a standard, such as a BSA digest.
- If you are not sure which database to search, start with Swiss-Prot.
- If you use a taxonomy filter, or search a single organism database, include a contaminants database in the search.
- Only select very abundant modifications as variable.
- If the protein was digested with an enzyme, choose this enzyme.
- Use an error tolerant search to find post-translational modifications, SNPs, and non-specific cleavage products.
- Set the peptide FDR to 1% to filter out false positives.
- For important work, filter the proteins in Report Manager by requiring significant matches to at least 2 distinct sequences.
Peak list
The first requirement for database searching is a peak list; you cannot upload a raw data file. Raw data is converted into a peak list by a process called peak picking or peak detection. Often, the instrument data system takes care of this, and you can submit a Mascot search directly from the data system or save a peak list to a disk file for submission using the web browser search form. If not, or if you have a raw data file and no access to the data system, you’ll need to find a utility to convert it into a peak list.
Peak lists are text files and come in various different formats. If you have a choice, MGF is recommended. Be careful with mzML, because this may contain either raw data or a peak list. We recommend Mascot Distiller, which has been designed to work with raw data from any instrument.
Search parameters
A peak list, by itself, is not sufficient. There are also a number of search parameters that must be set appropriately. The Mascot search form links to help topics for each required parameter. Note that you can set your own defaults for the web browser search form by following the link at the bottom of the Access Mascot Server page.
The form looks much the same whether you have your own Mascot server, in-house, or whether you are connected to the free, public Mascot Server. If you are using the free, public Mascot Server, there are some restrictions, one of which is that you have to provide a name and email address so that we can email a link to your search results if the connection is broken. A more important restriction is that searches are limited in size. Whether you enter a search title is your choice. It is displayed at the top of the result report, and can be a useful way of identifying the search at a later date.
Using a standard sample
If at all possible, run a standard sample and use this to set all the search parameters. By standard sample, we mean something like a BSA digest, which will give strong matches and where you know what the answer is supposed to be. Trying to set search parameters on an unknown is much more difficult, especially if the sample was lost somewhere during the work-up or if the instrument has developed a fault.
Choosing a sequence database
The first choice you have to make, and one of the more difficult, is which database to search. The free public web site has just a few of the more popular public databases, but an in-house server may have a hundred or more. Some databases contain sequences from a single organism. Others contain entries from multiple organisms, but usually include the taxonomy for each entry, so that entries for a specific organism can be selected during a search using a taxonomy filter.
If you’re not sure what is in the sample, Swiss-Prot is a good starting point. The entries are all high quality and well annotated. Because Swiss-Prot is non-redundant, it is relatively small. The size of the database is one factor in the size of the search space – the number of peptide sequences that are compared with a spectrum to see which gives the best match. The smaller the search space, the easier it is to get a statistically significant match.
Contaminants
If you think you know what is in the sample, you may want to search an organism specific database. But, you can never rule out contaminants. This can be a severe problem if you only have a handful spectra. You might be interested in a human protein, so you search a human database, but your spectrum is for a peptide from a contaminant, so you get no match or a misleading match.
When searching entries for a single organism, always include a database of common contaminants. This is important, even if you have a large dataset and no interest in proteins from anything other than your target organism. Otherwise, you may end up reporting your sample is full of serum albumin when it is really BSA, or keratin when it is really sheep keratin from clothing. If your search uses a taxonomy filter, that’s not a problem because taxonomy is not configured for the contaminants databases, so all the entries will always be searched.
Bacteria and plants
If your target organism is well characterised, such as human or mouse or yeast or arabidopsis, there may be no need to look beyond Swiss-Prot. You can get a sense of how well your organism is represented in SwissProt by looking at the release notes, which list the 250 best represented species.
If you are interested in a bacterium or a plant, you may find that it is poorly represented in Swiss-Prot, and it would be better to try one of the comprehensive protein databases, which aim to include all known protein sequences. The two best known are NCBIprot and UniRef100. If the genome of your organism hasn’t been sequenced, you may still be out of luck, and your best hope is to search a collection of ESTs (Expressed Sequence Tags are relatively short nucleic acid sequences).
As an example, follow this link to see the entry in the NCBI taxonomy browser for avocado pear (Persea americana). This has just 286 entries in Swiss-Prot and but 43,502 in the NCBIprot. (All counts as of July 2024.)
Never choose a narrow taxonomy without looking at the counts of entries and understanding the classification. In the current Swiss-Prot, for example, there are 27,987 entries for rodentia, but 17,212 are mouse and 8,199 are rat – only 9,173 are for other rodents. So, even if your target organism is hamster, it isn’t a good idea to choose ‘other rodentia’. Better to search rodentia and hope to get matches to homologous proteins from mouse and rat.
Swiss-Prot is a non-redundant database, where sequences that are very similar have been collapsed into a single entry. This means that the database entry will often differ slightly from the protein you analysed. Standard database searching requires the exact peptide sequence, so you may miss some matches due to SNPs and other variants. This would be another reason to search a large, comprehensive database. But, remember that NCBIprot is hundreds of times the size of Swiss-Prot, so searches take proportionally longer and the search space is proportionally larger, meaning that you need higher quality data to get a significant match.
Enzyme
If your protein was digested using an enzyme, always choose this enzyme. Choosing a semi-specific enzyme or ‘None’, for non-specific cleavage, greatly increases the search time and the search space, which will almost certainly cause a net reduction in the number of matches. The error tolerant search, discussed below, is a better way of finding non-specific peptides.
If you are studying endogenous peptides, such as MHC peptides, you have no choice, and enzyme ‘None’ will look for matches in all sub-sequences of all proteins. You should refine the results using machine learning to get the best sensitivity (see further below).
If you are doing top-down, to analyse intact proteins, choose NoCleave. Note that NoCleave is not the same as None; it is the exact opposite.
When designing your experiment, be aware that an enzyme of low specificity, which digests proteins to a mixture of very short peptides, is not a good choice, because very short sequences will be found in many database entries, so have low specificity. The longer the peptide, the easier it is to get a significant match and the more likely it is that the match will point to one particular protein. In most cases, it is best to use an enzyme of specificity equal to or greater than trypsin, and focus on peptides with masses between 1200 and 4000 Da.
The number of allowed missed cleavages should be set empirically, by running a standard with this set to a high value and looking at the significant matches to judge the extent of incomplete cleavage. Setting this value higher than necessary simply increases the size of the search space, which you will now recognise as being a ‘bad thing’.
Fixed and variable modifications
Modifications in database searching are handled in two ways. First, there are the fixed or quantitative modifications. An example would be a the efficient alkylation of cysteine. Since all cysteines are modified, this is effectively just a change in the mass of cysteine. It carries no penalty in terms of search speed or specificity.
In contrast, most post-translational modifications do not apply to all instances of a residue. For example, phosphorylation might affect just one serine in a protein containing many serines and threonines. These variable or non-quantitative modifications are expensive in the sense that they increase the search space. This is because the software has to permute out all the possible arrangements of modified and unmodified residues that fit to the peptide molecular mass. As more and more modifications are considered, the number of combinations and permutations increases geometrically, and we get a so-called combinatorial explosion.
This makes it very important to be sparing with variable modifications. If the aim of the search is to identify as many proteins as possible, the best advice is to use a minimum of variable modifications, or none at all. Most post-translational modifications, such as phosphorylation, are rare and it is much more efficient to use an error tolerant search to find them.
You cannot select two fixed modifications with the same specificity. If you select variable modifications with the same specificity as a fixed modification, this excludes the possibility of an unmodified site. For example, if you choose Carbamidomethyl (C) as fixed and Propionamide (C) as variable, you can get matches to either of these but never to a peptide with free cysteine. Also, you will not get matches to a peptide modified with both carbamidomethyl and propionamide.
Mass tolerances
Making an estimate of the mass accuracy doesn’t have to be a guessing game. The Mascot result reports include graphs of mass errors. Just run a standard and look at the error graphs for the strong matches. Ignore outliers, which are likely to be chance matches, and you’ll normally see some kind of trend. Add on a safety margin and this is your error estimate. The graph for precursor mass error is in the Protein View report and the graph for MS/MS fragment mass error is in the Peptide View report. You can also use these graphs to decide whether Da or ppm is the best choice for the tolerance unit.
Sometimes, peak picking chooses the 13C peak rather than the 12C, so the mass is out by 1 Da. In extreme cases, it may pick the 13C2 peak. The #13C control allows for this, enabling you to use a tight mass tolerance and still get a match. In general, its not advisable to combine #13C with deamidation because, if you have a high level of 13C precursors, it will be difficult to detect deamidation reliably. This is another setting that should be determined empirically, by running a standard.
Instrument type
The instrument setting determines which fragment ion series will be considered in the search. Choose the description that best matches the type of instrument. If you follow the control label link, you’ll see that many of the instruments are very similar. The main problem is if you choose CID for ETD data or vice versa.
Decoy search and false discovery rate
Mascot automatically runs a target-decoy search to estimate the peptide false discovery rate (FDR), as recommended by most journals. The decoy search is done using identical search parameters, against a database in which the sequences have been reversed. You do not expect to get any real matches from the decoy database, so the number of matches observed is an excellent estimate of the number of false positives in the results from the target database.
The search form and results report allow choosing a target FDR. This allows the significance threshold to be adjusted to a peptide FDR of 5% or 1% or whatever you believe is appropriate for your work. Note that this is peptide FDR, not protein FDR.
Refining results with machine learning
Database search results can be optionally refined with machine learning. This is a powerful technique especially with ‘difficult’ data sets like endogenous peptides, very large databases and metaproteomics. The option “Refine results with machine learning” only takes effect if the search has more than 750 queries and the database more than 100 sequences.
When refine results with machine learning is selected, Mascot runs additional steps at the end of the database search. Several metrics (machine learning features) are calculated for each peptide match, such as precursor mass error, charge state, missed cleavages, amount of fragment intensity matched and average MS/MS fragment mass error. These are passed to Percolator, which trains a semi-supervised machine learning model. The model finds an optimal separation between target and decoy matches using all the available features.
Percolator often gives a significant improvement to peptide identifications. This is because it leverages extra context that is unavailable to the database search engine. For example, precursor mass error of incorrect matches is typically randomly distributed, while correct matches cluster around zero. Percolator uses this clustering for separating correct and incorrect matches in a multidimensional space.
You can optionally select machine learning models for predicted features. Mascot includes DeepLC for retention time predictions and MS2PIP for spectral similarity. Included models and how to choose are documented in MS2Rescore help.
Error tolerant search
As mentioned several times already, an error tolerant search is the most efficient way to discover most post-translational modifications, as well as non-specific peptides and sequence variants. This is a two pass search, the first pass being a simple search of the entire database with minimal modifications. The protein hits found in the first pass search are then selected for an exhaustive second pass search, during which we look for all possible modifications, sequence variants, and non-specific cleavage products. Because only a small number of entries are being searched, search time is not an issue.
The matches from the first pass search, in the limited search space, are the evidence for the presence of the proteins, while the matches from the second pass search give increased coverage. If you see a very abundant modification, best to add this as a variable modification and then search again, because the error tolerant search only catches peptides with a single unsuspected modification. Error tolerant searching is not so useful for very heavily modified proteins, such as histones, or where there is only one peptide per protein, such as endogenous peptides.
Failing to get a match
Finally, if you are analysing proteins, you should search a peak list containing data for as many peptides as possible, because there are a host of reasons why any one spectrum may fail to give a match:
- The exact peptide sequence isn’t in the database
- The peptide is modified in an unexpected way
- Non-specific enzyme cleavage
- The precursor m/z or charge is wrong
- The spectrum is very weak or noisy
If you don’t get any matches at all, you can only resort to changing the search parameters by trial and error, which is time consuming and carries the risk of ending up with a false positive. If you search many spectra, you have a much better chance that some of them match, and the search parameters can be modified systematically, or automatically, in an error tolerant search.