Help > Validating intact crosslinked peptide matches

Validating intact crosslinked peptide matches

Intact crosslinked search results are more complex than conventional (non-crosslinked) searches, because there are many more degrees of freedom. The precursor mass could be within tolerance of a looplinked sequence, a linear sequence with monolink and several different alpha-beta candidates. Each possibility is multiplied if you also consider variable modifications like oxidation of methionine.

Mascot uses the same scoring system for linear, looplinked and crosslinked matches, so the scores are directly comparable within a query. This makes it straightforward to compare and rank the candidate matches. The strategy for maximising sensitivity while controlling for false positives is:

Make sure the search space is complete
Only include the necessary protein sequences in crosslinking, no more
Always look at the fragmentation and the alternatives other than the rank 1 match

Illustration

Beveridge et al. (Nature Communications 11, 742 (2020)) provide a synthetic peptide library for benchmarking crosslinking-mass spectrometry search engines for proteins and protein complexes. The authors divided Cas9 tryptic peptides in 12 groups, crosslinked each group with DSS and combined the groups before MS/MS analysis. An intact crosslink between peptides from different groups is known to be incorrect, which gives a ground truth for false positives. This allows calculating the “true” false discovery rate (FDR) – the proportion of significant matches that are incorrect. Crosslink and monolink formation was also prevented at peptide N-term and at C-term lysine.

Peak picking: After downloading DSS replicate 1 from the PRIDE project PXD014337, the raw file was peak picked in Mascot Distiller. It’s always best to decharge fragment masses when analysing intact crosslinks, because the precursor can be 3+ or higher. Distiller decharging only works with profile data. In this file, the MS/MS spectra are centroided, so start from default.ThermoXcalibur.opt and change the settings to:

In MS/MS Processing, set Transform to profile data
Peak half width: 0.1
Data points per Da: 100
Preferred type: Profile
Use precursor charge as maximum
Default charge range: 2 to 4
Re-determine precursor m/z value(s) where possible
In Preferences, choose to output fragment masses as MH+

Database and crosslinking method: The Cas9 protein sequence is available in Cas9_plus10.fasta, which you can find in the Supplementary Data 2 zip file (41467_2020_14608_MOESM4_ESM.zip). This is easy to set up using simple_AA_template in Database Manager, and the database name was set to PXD014337. The crosslinking method is quickest to create by copying the HSA Xlink:DSS method supplied with Mascot and editing accordingly:

Change the name to “PXD014337 Xlink:DSS, Cas9″
Change the protein filter to <mxm:accession DatabaseName="PXD014337">Cas9</mxm:accession>
Remove the Protein N-term specificity from mxm:linkers
Include the A and W monolinks in mxm:linkers
Ensure IntraLink and InterLink are enabled (InterLink allows alpha and beta sequence to be the same)
Set MinLen to 5 (shortest synthetic peptide)

The paper only considered the W (water quenched) monolink, but looking at the sample preparation, there’s no reason why A (ammonia quenched) monolinks couldn’t also form. Both should be included as search parameters.

It’s currently not possible to disallow monolinks and crosslinks at C-term K in the crosslinking method. Such matches must be filtered out after the search.

Database search: The MGF file was searched against PXD014337 (Cas9_plus10.fasta) and the contaminants database. Since this is a synthetic sample, there is no background proteome. If there had been, it’s important to include it in database selection. The rest of the relevant parameters are: Carbamidomethyl (C) as fixed mod, Oxidation (M) as variable mod, enzyme Trypsin, max missed cleavages 1, precursor tolerance 10ppm, fragment tolerance 20ppm.

The search results show a good number of monolink matches and intact crosslinks in Cas9, including many strong matches to Xlink:DSS[A]. There are some very nice intact crosslinks, like query 5921.

Initial counts

Calculating the true FDR for significant crosslinked matches can be done by exporting the results in xiVIEW-CSV format and using a simple script to print the rows where the alpha and beta peptides are not in the same group. The group listings are in Supplementary Table 1 in Supplementary Information, and the counts are in the table below.

Search	Significance threshold	Num sig. PSMs	Num sig. CSMs	False positive CSMs	True FDR
Crosslinked Cas9	0.05	946	653	103	15.8%
Crosslinked Cas9	0.01	821	569	77	13.5%
Crosslinked Cas9	0.001	622	436	38	8.7%

Note that the calculated FDR only considers crosslinked matches. About half the significant matches are linear peptides with monolinks. There’s no ground truth in this data set for identifying false positives among them, so it’s not possible to get a full picture.

A large proportion of the false positives are cases where the alpha or beta peptide is not in any of the 12 groups. Few of these have any fragmentation from the weaker peptide. The first suspect was the Biotin-tagged YGGGGR “linker” peptide, which was covalently linked to the synthetic peptide N-term to prevent N-term crosslinks. It was subsequently cleaved with trypsin and removed with streptavidin. Perhaps the false matches are actually Biotin-YGGGGR-alpha sequence? However, inspecting the beta masses in a few of the false positives shows the required delta doesn’t match Biotin + YGGGGR.

Counts after ¹³C correction

Looking at the spectra in detail, there is often a strong precursor peak at +1Da. Same is actually true in some of the true positives. A 1Da mass shift can have many causes, but the paper does mention an issue with the instrument sometimes selecting the 13C precursor. Distiller should get the 12C peak most of the time, but it’s not infallible. Repeating the crosslinked search with Oxidation (M) and #13C=1 increases the number of crosslinked matches by 50% while reducing the FDR. This is clearly the right choice and the search space is now complete.

Search	Significance threshold	Num sig. PSMs	Num sig. CSMs	False positive CSMs	True FDR
Crosslinked Cas9 + 13C	0.05	1315	1008	125	12.4%
Crosslinked Cas9 + 13C	0.01	1070	816	76	9.3%
Crosslinked Cas9 + 13C	0.001	790	599	35	5.8%

There are still false positives where alpha or beta is not in one of the 12 groups, but there doesn’t seem to be any clear pattern. Query 5228 alpha (SDNVPSEEVVK) and beta (YKEIFFDQSK) are not in any group, although group 10 has GKSDNVPSEEVVK and group 11 has SDNVPSEEVVKK. B and y ions say it must be SDNVPSEEVVKK (1329.67765Da) from group 11, so beta mass should be 1331.628537. Maybe the true peptide is SDNVPSEEVVKK-SDNVPSEEVVKK, where the beta has a +2 mass shift? Putting deamidation on the alpha N3 won’t work, because then you lose all the fragment matches.

Query 4981 (score 97!) has alpha (MDGTEELLVKLNR) from group 5, but beta is from no group. Query 4210 alpha (QLLNAKLITQR) is from group 5 and cannot be a fluke, but beta is from no group. Same beta peptide in both cases (TEVQTGGFSK), and there are many duplicate matches. If you have any suggestions what is causing these, please leave a comment.

The second kind of false positive, far fewer in number, is where Mascot has chosen alpha and beta from different groups. Getting a few of these is unavoidable with a probabilistic scoring scheme, and it’s of course the reason why the results report provides a control for the significance threshold.

Conversely, there are crosslinked matches where alpha and beta are in the same group, but in rank 1 the intact link is on C-term K. An example is query 5983. Very nice fragmentation from the alpha and some from the beta, but the beta peptide happens to end with two lysines. The rank 1 and 2 matches have the same score and different permutations of the link position. Rank 2 must be the correct match.

The lesson is, always look at the fragmentation and alternative matches to the spectrum even when the match score is high.

Matrix Science

Validating intact crosslinked peptide matches

Illustration

Initial counts

Counts after 13C correction

Counts after ¹³C correction