The most analysed protein is …
Trypsin, of course. The Journal of Proteome Research has a paper from the Medical University of Graz concerning the importance of correctly identifying spectra from contaminant proteins. In particular, trypsin autolysis peptides.
The authors point out that sequencing grade trypsin is modified by methylation or acetylation of the lysines, to inhibit autolysis. Unless these variable modifications are selected in a search, simply including a contaminants database will not be sufficient to catch all trypsin autolysis peptides. As part of their study, data were acquired using an LTQ-Orbitrap Velos from a yeast cell lysate digested with Promega trypsin (Data Set 1 in the paper). We downloaded these three raw files from PRIDE, processed them into a single merged peak list using Mascot Distiller, and tried a variety of searches using Mascot Server 2.6.
Based on the results of an error tolerant search, we chose Carbamidomethyl (C) as a fixed mod and Methyl (N-term), Methyl (K), Dimethyl (K), Dimethyl (N-term), Dehydro (C), Deamidated (NQ) and Carbamidomethyl (N-term) as varmods. We also decided to use semiTrypsin as the enzyme, because it was clear that there were large numbers of non-specific peptides. The results are here; identifications and counts are for 1% FDR on PSM counts.
The most abundant trypsin autolysis peptide is usually R.LGEHNIDVLEGNEQFINAAK.I, which the authors point out is identified some 21,722 times in the PRIDE Cluster resource. This peptide is abundant in the Graz data in both modified and unmodified forms. It is tempting to treat the number of determinations as pseudo-quantitative, and there are 64 spectra for the unmodified peptide versus a total of 76 for various modified forms, but this could be misleading because methylation is not favourable towards CID fragmentation. Even so, it seems clear that methylation is far from complete, probably because of the steric issues identified in the paper. The authors also observe that cleavage occurs readily after a methylated or dimethylated lysine.
The searches described in the Graz paper are all for strict tryptic specificity. When searched with semiTrypsin, R.LGEHNIDVLEGNEQFINAAK.I exhibits a near complete family of C-terminal "ragged ends", going down to R.LGEHNIDVLEGN.E. The most abundant appears to be R.LGEHNIDVLEGNEQFIN.A, which is represented by 99 spectra. Whether this occurs in solution or in the ion source is hard to say.
The reason for including the N-term varmods was that these gave strong matches in the error tolerant search. These are not Protein terminus modifications, so must be post-digest artefacts. Carbamidomethyl (N-term) is very common, and could be due to residual iodoacetamide, but why do we see Methyl (N-term) and Dimethyl (N-term)? The most likely explanation is autolysis prior to or during methylation.
Clearly, there are many peptides that would be missed in a vanilla search. At 1% FDR, the counts of PSMs and distinct sequences for the semi-tryptic search with multiple varmods are 545 and 49 compared with 281 and 10 for a search with strict trypsin and Carbamidomethyl (N-term) as the only varmod.
The Graz paper advocates editing the sequence of trypsin in the Fasta, replacing K with J, and defining J as the mass of dimethylated lysine. Unmodified lysine or mono-methylated lysine can then be matched using J-specific mods, which keeps the overall search space small because only the trypsin sequence contains any J. This is fine as far as it goes, but it doesn’t catch the N-term modifications or the non-specific cleavage. The authors mention three other possible solutions, two of which are not possible using the search engines under consideration. The third is: "Another feasible approach would be to combine the in silico generated search space with measured spectral libraries from contaminants."
This is a far more powerful option, since it allows any number of modified and non-specific peptides from any number of contaminants to be intercepted with no increase in the search space. It is very easy to create a library from search results with Mascot Server 2.6. We used Database Manager to create a library from the semi-tryptic search with thresholds of expect < 0.01 and score > 50 and a taxonomy of Sus Scrofa. (This requires Mascot Server 2.6.1.)
By including this library, we can search the Graz data against SwissProt with tight search parameters – strict trypsin, yeast taxonomy filter, a single variable mod – yet still obtain matches to all the modified and non-specific trypsin autolysis peptides, removing them as a possible source of false positives.
Job done? We just need to distribute this library so that everyone can benefit from comprehensive identification of trypsin autolysis peptides?
Not unless your sample processing protocol is identical to that used by the Graz group. Promega (porcine) trypsin or Sigma (bovine) trypsin? Iodoacetamide or some other alkylating agent? Sample to substrate ratio? Digestion temperature and duration? Other types of derivatisation, such as isotopic labels? There are so many experimental variables that the most practical course is to make your own library. Fortunately, Mascot 2.6 includes all the necessary tools. A future blog article will describe the procedure step-by-step.
Keywords: artefact, autolysis, error tolerant, spectral library, trypsin