Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Trypsin autolysis

The most analysed protein

The Journal of Proteome Research has a paper from the Medical University of Graz concerning the importance of correctly identifying spectra from contaminant proteins. In particular, trypsin autolysis peptides.

The authors point out that sequencing grade trypsin is modified by methylation or acetylation of the lysines, to inhibit autolysis. Unless these variable modifications are selected in a search, simply including a contaminants database will not be sufficient to catch all trypsin autolysis peptides. As part of their study, data were acquired using an LTQ-Orbitrap Velos from a yeast cell lysate digested with Promega trypsin (Data Set 1 in the paper). The raw data is available on PRIDE.

Example search

This example uses three raw files from PRIDE project PXD002726. Files were processed into a single merged peak list using Mascot Distiller.

Based on the results of an error tolerant search, we chose Carbamidomethyl (C) as a fixed mod and Methyl (N-term), Methyl (K), Dimethyl (K), Dimethyl (N-term), Dehydro (C), Deamidated (NQ) and Carbamidomethyl (N-term) as variable mods. We also decided to use semiTrypsin as the enzyme, because it was clear that there were large numbers of non-specific peptides. Target FDR was set to 1%.

Most abundant autolysis products

The results report shows that the most abundant trypsin autolysis peptide is usually R.LGEHNIDVLEGNEQFINAAK.I, which the authors point out is identified some 21,722 times in the PRIDE Cluster resource. This peptide is abundant in the Graz data in both modified and unmodified forms.

It is tempting to treat the number of determinations as pseudo-quantitative, and there are 125 spectra for the unmodified peptide versus a total of 250 for various modified forms, but this could be misleading because methylation is not favourable towards CID fragmentation. Even so, it seems clear that methylation is far from complete, probably because of the steric issues identified in the paper. The authors also observe that cleavage occurs readily after a methylated or dimethylated lysine.

The searches described in the Graz paper are all for strict tryptic specificity. When searched with semiTrypsin, R.LGEHNIDVLEGNEQFINAAK.I exhibits a near complete family of C-terminal "ragged ends", going down to R.LGEHNIDVLEGN.E. The most abundant appears to be R.LGEHNIDVLEGNEQFINAAK.K, which is represented by 375 spectra. Whether this occurs in solution or in the ion source is hard to say.

The reason for including the N-term variable mods was that these gave strong matches in the error tolerant search. These are not protein terminus modifications, so must be post-digest artefacts. Carbamidomethyl (N-term) is very common, and could be due to residual iodoacetamide, but why do we see Methyl (N-term) and Dimethyl (N-term)? The most likely explanation is autolysis prior to or during methylation.

Clearly, there are many peptides that would be missed in a vanilla search. At 1% FDR (and refining with machine learning enabled), the counts of PSMs and distinct sequences for the semi-tryptic search with multiple varmods are 3482 and 827 compared with 2267 and 565 for a search with strict trypsin and Carbamidomethyl (N-term) as the only variable mod.

Including autolysis products in routine searches

The Graz paper advocates editing the sequence of trypsin in the Fasta, replacing K with J, and defining J as the mass of dimethylated lysine. Unmodified lysine or mono-methylated lysine can then be matched using J-specific mods, which keeps the overall search space small because only the trypsin sequence contains any J. This is fine as far as it goes, but it doesn’t catch the N-term modifications or the non-specific cleavage. The authors mention another solution: "combine the in silico generated search space with measured spectral libraries from contaminants."

This is a far more powerful option, since it allows any number of modified and non-specific peptides from any number of contaminants to be intercepted with no increase in the search space. It is easy to create a library from search results with Mascot Server.

Autolysis and Peptide Mass Fingerprint searches

Low-level digests can be dominated by autolysis peaks. The peptide masses (neutral, Mr values) for limit digests of bovine and porcine trypsin are listed below. It is worth screening experimental data for both species, since the labelling of commercial material is not always reliable (recognised as long ago as Vestling, 1990).

The peaks from porcine trypsin at 841.50 and 2210.10 are often used in MALDI for internal mass calibration. Others peaks which have been observed by MALDI include 514.32, 1044.56, 2282.17, and 2298.17 (2282.17 with oxidised Met) (Parker, 1998).

Porcine trypsin (residues numbered after TRYP_PIG)
Entries in italics are for the variant protein I20 -> V:
From To Mono. Avg. Sequence
52 53 261.14 261.28 SR
54 57 514.32 514.63 IQVR
108 115 841.50 842.01 VATVSLPR
209 216 905.50 906.05 NKPGVYTK
148 157 1005.48 1006.15 APVLSDSSCK
98 107 1044.56 1045.16 LSSPATLNSR
134 147 1468.72 1469.68 SSGSSYPSLLQCLK
217 231 1735.84 1736.97 VCNYVNWIQQTIAAN
116 133 1767.79 1768.99 SCAAAGTECLISGWGNTK
158 178 2157.02 2158.48 SSYPGQITGNMICVGFLEGGK
58 77 2210.10 2211.42 LGEHNIDVLEGNEQFINAAK
78 97 2282.17 2283.63 IITHPNFNGNTLDNDIMLIK
179 208 3012.32 3014.33 DSCQGDSGG…SWGYGCAQK
9 51 4474.09 4477.04 IVGGYTCAA…VVSAAHCYK
9 51 4488.11 4491.07 IVGGYTCAA…VVSAAHCYK

Bovine trypsin (residues numbered after TRY1_BOVIN).
From To Mono. Avg. Sequence
110 111 259.19 259.35 LK
157 159 362.20 362.49 CLK
238 243 632.31 632.67 QTIASN
64 69 658.38 658.76 SGIQVR
112 119 804.41 804.86 SAASLNSR
221 228 905.50 906.05 NKPGVYTK
160 169 1019.50 1020.17 APILSDSSCK
229 237 1110.55 1111.33 VCNYVSWIK
146 156 1152.57 1153.25 SSGTSYPDVLK
207 220 1432.71 1433.65 LQGIVSWGSGCAQK
191 206 1494.61 1495.61 DSCQGDSGGPVVCSGK
70 89 2162.05 2163.33 LGEDNINVVEGNEQFISASK
170 190 2192.99 2194.47 SAYPGQITSNMFCAGYLEGGK
90 109 2272.15 2273.60 SIVHPSYNSNTLNNDIMLIK
120 145 2551.24 2552.91 VASISLPTS…LISGWGNTK
21 63 4550.12 4553.14 IVGGYTCGA…VVSAAHCYK