Help > Common mistakes

Common mistakes

If you are new to sequence database searching, you may find the tutorials Peptide Mass Fingerprint search and Searching uninterpreted MS/MS data helpful. Below are some of the most common mistakes.

Choice of database and taxonomy

A common and very understandable mistake is to search more databases than necessary. For example, on our public web site, we often see searches against both NCBIprot and SwissProt but since all of the sequences in SwissProt are also in NCBIprot, this is obviously a waste of time.

Searching multiple overlapping databases can also result in a failure to get a significant match if the spectrum is not of sufficient quality. The Mascot score for a specific match remains constant regardless of the size of the database(s). However, the significance threshold (i.e. the score required for the match to be significant, rather than just a random match), depends on the number of entries in the database.

For example, this PMF search against SwissProt gets a score of 82 for PML_HUMAN, and the significance threshold is 70 (for p<0.05). However, if you repeat the search against SwissProt and NCBIprot, then the significance threshold goes up to 94 and there is no longer a significant match. Since the search space is larger, there are now many more random matches, some with scores above 80. Indeed, there is a better, but still not significant, match than the ‘correct’ one which has a score of 84, where the taxonomy is “synthetic construct”. Without the significance threshold, it’s possible to be convinced that this “synthetic construct” protein is the correct match.

If your sample is from a species that is well represented in SwissProt, then it generally makes sense to search SwissProt using the appropriate taxonomy and a contaminants database. In the above example, choose to search SwissProt with human taxonomy and also the contaminants database. (To select two databases in a PMF search, use the control key when selecting the second database.) The repeat search finds PML_HUMAN with protein score 82, as before, but the significance threshold is now 56.

If you restrict the search to a specific taxonomy, then it’s important to include the contaminants database because the spectra may turn out to be from a contaminant, such as trypsin, rather than actually from your sample. It’s normally very useful to see this rather than just get no match.

If your sample is from a species that is not well represented in the databases, then you may need to search NCBIprot. You should also check how many protein sequences there are for the species in NCBI using the NCBI taxonomy browser. For example, if your sample is from Hystricidae (Old World porcupines), you will see that there are only 181 protein sequences. If you have your own Mascot Server, you could add Hystricidae to the taxonomy list and just search those sequences. Alternatively, you could select “Other Rodentia”, but this also only has a few sequences. A better option would be to search Rodentia in the hope that the porcupines have some homologous sequences with their less prickly cousins.

Insufficent mass values

A very common error is to submit a PMF search with a single mass value. A peptide mass fingerprint works from digesting a single protein with an enzyme to produce a number of peptides of different masses that are effectively a ‘fingerprint’ for that protein. If the instrument has only produced a single mass value, something fundamental has gone wrong. It isn’t possible to get a significant protein match from a single mass value because it can occur any number of times in the database by pure chance.

There’s a similar issue with MS-MS searches where an MS-MS spectrum has a just one or two fragment peaks. Ideally, there should be at least one peak for each residue, so if there are less fragment ions than residues, you cannot expect a high score. In practice, there will often be peaks for both b series and y series ions as well as possibly some neutral losses and other fragments. With a few noise peaks as well, 100 peaks per spectrum will not be unusual.

Impossible or low scoring mass values

Sometimes a peptide mass fingerprint has been submitted with m/z values all greater than 5000 Daltons. Assuming the enzyme was trypsin, there are very few tryptic peptides with such a high mass, so such a search will result in no matches to any protein. In this case, it’s most likely that the sample hasn’t been digested properly. It may be possible in such a case to get a match by specifying a very large number of missed cleavages, but it’s much better to repeat the analysis.

At the other end of the scale, mass values for very short peptides contribute little to the score. It is the long peptides, which are unlikely it is to occur in multiple proteins, that provide the greatest specificity, so aim to get as many peptide masses as possible in the range 1000 to 3500 Da.

For MS-MS searches, fragment masses (m/z * charge) under 50 Daltons or above the precursor mass will not contribute to the score and indicate a problem with the spectrum or peak detection.

Too many mass values

Another common mistake is having too many peaks in a spectrum. It is rare for an MS/MS spectrum of a peptide to have more than a couple hundred genuine ions, noise included. It’s certainly impossible for a spectrum to have more than 1000 real peaks. If you submit a peak list to Mascot that has more than 10,000 peaks, the search will terminate with an error.

Too many peaks is commonly caused by trying to submit raw data or profile data, where you have a peak at every mass. Raw data must be converted into a peak list by a process called peak picking or peak detection. Often, the instrument data system takes care of this, and you can submit a Mascot search directly from the data system, or save a peak list to a disk file for submission using the web browser search form.

The second explanation is incorrect or poor peak picking. Double check your peak picking settings to ensure you are producing a centroided, de-noised peak list. We recommend Mascot Distiller, which has been designed to work with raw data from any instrument.

Matrix Science

Common mistakes

Choice of database and taxonomy

Insufficent mass values

Impossible or low scoring mass values

Too many mass values