How to create a spectral library for contaminants
The tutorial Identifying most common trypsin modifications highlighted how modified and non-specific peptides from contaminants could be matched using a spectral library without increasing the search space for the target proteins. This is particularly useful for sequencing grade trypsin, which is modified by methylation or acetylation of the lysines, creating a large number of modified non-specific peptides that are missed by typical search strategies.
It would be very convenient if a spectral library that contained all the peptides expected from common contaminants was available for download. Unfortunately, it’s not that easy! Some contaminants are ubiquitous, such as trypsin, BSA, and human keratins. Others are laboratory or sample or protocol specific. For any given contaminant, the set of peptides that are seen depends on the alkylating agent, the protease, sample to substrate ratio, digestion temperature and duration, etc. The relative abundances of artefactual modifications, such as oxidation, deamidation, carbamylation, and N-term cyclisation, will also depend on the experimental protocol. In some cases, there will be additional derivatisation steps, such as isotopic labelling.
With all these experimental variables, the most practical course is to make your own library. Mascot Server includes all the necessary tools.
This example uses three raw files from PRIDE project PXD002726. Files were processed into a single merged peak list using Mascot Distiller. The sample preparation used Promega trypsin.
1. Determine suitable search parameters for the target proteins
In the example data, the target proteins are from yeast, which is very well characterised, so SwissProt with a taxonomy of yeast will be fine. Search parameters can be set by experience or trial and error. In the general case, you may need to include fixed modifications, such as alkylation or isotopic labels, but don’t include any variable modifications unless you know they are extremely abundant. The correct settings for mass tolerances and the enzyme are those that give the best sensitivity at 1% FDR.
Example search results using reasonable settings for the target proteins.
2. Identify major contaminant proteins
Next, search the whole of SwissProt using mostly the same parameters, but with any fixed mods made variable (because contaminants are not guaranteed to be labelled or alkylated). In the result report, set an FDR of 1% for PSMs.
In the report builder tab, set a filter of "Num. of significant unique sequences" > 2, or more, because we are not interested in minor contaminants. For this example, we can use an accession string filter to exclude target proteins: NOT(Accession CONTAINS "YEAST"). This leaves a list of 23 potential contaminants.
Use your own judgement regarding the selection of contaminant proteins. You might want to take all of them or just focus on one or two of the most abundant.
3. Create a Fasta for the contaminants of interest
This is not essential, but it makes subsequent searches faster and simplifies importing the contaminant PSMs into a library.
- In the report builder tab, click on the accession column link for the first protein of interest
- In the Protein View report, click on the Unformatted sequence string link
- Copy the title line and paste into a new document in a text editor (or add to the end of your existing contaminants Fasta file)
- Copy the sequence string except for the leading asterisk and paste on the line after the title
- Repeat for each protein of interest
The FASTA file can be downloaded from our public website, for the 23 contaminants identified in the preceding step. It only took a few minutes to create. If you are updating an existing contaminants database, just copy the Fasta file to the relevant folder on the Mascot Server. If this is a new database, configure in Database Manager: Fasta; Create new; Use predefined definition template; Simple_AA_template; Upload file using web browser; Activate.
4. Perform error tolerant searches
For these searches, if you created a contaminants Fasta, change the databases to SwissProt with a taxonomy filter of yeast plus the new contaminants database. Otherwise, search the whole of SwissProt. Remember to select “Automatic second pass search of selected modification classes”.
In the search results modification statistics, inspect most abundant mods. For this example, maybe those with counts over 100. These are the ones that may need to be specified as fixed or variable modifications.
Modification | Site | Above thr. | ET | Total matches |
---|---|---|---|---|
Non-specific cleavage | - | 0 | 797 | 797 |
Carbamidomethyl | C | 515 | 0 | 515 |
Oxidation | M | 0 | 469 | 469 |
Carbamidomethyl | N-term | 0 | 291 | 291 |
Deamidated | N | 0 | 168 | 168 |
Search again, this time with Carbamidomethyl (C) as fixed, Deamidated (NQ), Carbamidomethyl (N-term), and Oxidation (M) as variable.
When an error tolerant search includes variable mods, you have to be sceptical of certain matches. Some modifications have a delta that is the exact negative of another. For example, Asn->Gly and Gln->Ala are exact inversions of Carbamidomethyl. Amidated, Asp->Asn, and Glu->Gln are exact inversions of Deamidated. Deoxy, Ser->Ala, and Tyr->Phe are exact inversions of Oxidation. You will see error tolerant matches that carry complementary pairs of modifications, allowing the fragment masses to be shuffled around to get a better score while leaving the parent mass unchanged. These are almost always false, with the true match, without either mod, getting a slightly lower score.
On the plus side, this provides a way to test whether Carbamidomethyl (C) can be fixed. If there are significant numbers of error tolerant matches with deltas of -57.0215 from such a search, this would suggest the Cys were unmodified. For these data, having Carbamidomethyl (C) fixed seems very safe.
5. Perform standard searches for abundant modifications
The utility to create a spectral library from search results does not import error tolerant matches directly. Such matches often have multiple possible assignments. The error tolerant results are used to choose sets of modifications for standard searches.
Modifications that are extremely abundant in the error tolerant results should be selected for all searches because they may be found in combination on individual peptides. Less abundant modifications can be selected for separate searches. For example, we might expect to see Carbamidomethyl and Deamidated together on some peptides but it we are unlikely to see both Methyl and Acetyl.
The population of modifications on a contaminant protein will depend on its source, so we may need to run different searches for different contaminants. Ideally, we would look at porcine trypsin, human contaminants, and E. coli. contaminants separately. In the interests of brevity, we will only consider porcine trypsin in this tutorial. This shows a relatively low level of Oxidation (M) compared with the target proteins, but there are large numbers of non-specific peptides. Based on inspection of the matches in the error tolerant search, it would seem reasonable to run the following searches, all with semiTrypsin as the enzyme and Carbamidomethyl (C) as a fixed modification.
- Deamidated (NQ), Carbamidomethyl (N-term), Methyl (K), Methyl (N-term) results
- Deamidated (NQ), Carbamidomethyl (N-term), Dimethyl (K), Dimethyl (N-term) results
- Deamidated (NQ), Carbamidomethyl (N-term), Acetyl (K), Acetyl (N-term) results
- Deamidated (NQ), Carbamidomethyl (N-term), Formyl (K), Formyl (N-term) results
- Deamidated (NQ), Carbamidomethyl (N-term), Cation:Na (DE), Cation:Na (C-term) results
6. Create the library
We now create a new library from this set of search results. At any later time, we can run additional searches customised for other contaminants and add these to the existing library.
To create a new library in Database Manager, choose Library; Create new; Create from search results. In most cases, the defaults will be acceptable.
Choose Edit filters. We only want reliable matches in the library. Based on inspection of the search results, an expect value below 0.01 and a score above 50 should ensure a very low level of false matches. We only want to import matches to contaminant proteins. This is where the contaminants Fasta comes in useful: you can simply set a filter on the database name. Otherwise, if you searched the whole of SwissProt, you have to use a taxonomy filter and import all matches that are not yeast.
Choose Import search results and set the filters to ensure that only the result files created in the previous step are scanned. In some cases, it will be quickest to copy the result files to a new folder so that they can all be selected at one go by a simple wild card path.
7. Use the library
The final search results use narrow settings for the Fasta, giving good speed and sensitivity for the target yeast proteins, and they use the new spectral library. By including the library, we also get very large numbers of matches to contaminant peptides: 1592 significant PSMs to TRYP_PIG, 544 to K2C1_HUMAN, 385 to DNAK_ECOHS, and many more.