Identify more HLA peptides
Endogenous peptides are challenging to identify by database searching. A Mascot no-enzyme search matches every subsequence of a protein to the observed spectrum, which makes a very large search space even if precursor tolerance is tight. As a result, Mascot score thresholds tend to be conservative and sensitivity is reduced.
Mascot ships with Percolator, which often improves discrimination between true and false matches in difficult data sets. Mascot Server 2.8 has a new set of Percolator features. The default feature set was ‘curated’ to work well with many kinds of data, but there is a particularly big improvement with endogenous peptides. Other improvements are: Percolator has been updated to the latest version; being able to specify target FDR when Percolator is enabled; and new options that control whether rank 2, 3 or below matches should be considered when training Percolator.
Library of synthetic HLA peptides
A recent paper titled The choice of search engine affects sequencing depth and HLA class I allele-specific peptide repertoires and published in Molecular & Cellular Proteomics investigated identification rates for HLA peptides. The authors compared four search engines using several immunopeptidomic data sets from PRIDE. Additionally, they synthesized a library of 2,000 peptides covering four common HLA alleles. The library is very useful for benchmarking purposes and the raw files can be found in the PRIDE project PXD025655. The sample was run on an Orbitrap with various scan settings.
We downloaded the nine raw files and created a merged peak list in Mascot Distiller. The data were searched against the human proteome with decoy searching enabled, enzyme None, precursor tolerance 5ppm and fragment tolerance 10ppm. The sample was alkylated with iodoacetamide, so Carbamidomethyl (C) was chosen as fixed and Oxidation (M) as variable mod.
The table below summarises the count of target PSMs and the number of unique sequences, as reported by Protein Family Summary. The search results are the same between Mascot 2.7 and 2.8; the difference is the Percolator feature set and Percolator version.
Target PSMs | PSM FDR | Sequences | Sequence FDR | |
---|---|---|---|---|
Mascot 2.7 | 8669 | 0.99% | 1105 | 2.71% |
+ Percolator | 22403 | 1.02% | 1929 | 4.67% |
+ RT enabled | 22602 | 0.92% | 1928 | 4.25% |
|
||||
Mascot 2.8 | 8669 | 0.99% | 1105 | 2.71% |
+ Percolator | 31744 | 1.00% | 2349 | 4.21% |
+ RT enabled | 36338 | 1.00% | 2496 | 4.53% |
Using Percolator in Mascot 2.7 improves the target PSM count by 2.5x at 1% PSM FDR and also increases the number of identified peptide sequences by 75%. Not bad, but compare to Mascot 2.8: enabling Percolator now gives a 3.6x increase in target PSMs and over 2x increase in identified peptides.
Enabling the Percolator retention time feature boosts the counts further. In Mascot 2.7, the RT feature made little difference in this data set. Percolator RT is disabled by default, because it can cause Percolator to run very slowly and we’ve found it doesn’t provide much improvement in most data sets. Endogenous peptides are the exception. You can enable the feature per dataset by adding percolate_rt=1 to the report URL.
Validating Percolator + RT results
The peptide library contained 2,000 peptides, so how can Mascot identify more sequences? The authors puzzled over the same question with other search engines, and it turns out that “a high proportion of these sequences were subsequences of the target peptide sequences and that they were overall shorter and lower in abundance”. Figure 4 in the paper summarises the situation. Histogram 4A counts the number and proportion of matches to library sequences, subsequences and “other”, peptides not present in the library.
We exported the percolated Mascot search results as CSV and made a list unique sequences with a significant rank 1 match. Comparing the list to the spreadsheet mmc2.csv in the paper’s supplementary information, the numbers corresponding to histogram 4A are: 2485 sequences identified, 61% are Library, 32% Sub-sequence and 7% Other. This compares favourably with the other search engines. Interestingly, the sequence FDR estimate from Mascot is similar to the proportion of Other sequences.
The count of unique sequences in the CSV file, 2485, differs slightly from the count in Protein Family Summary, 2496. The 11 additional sequences are actually in the unassigned list, with Percolated score 13.0 or slightly above 13 and score threshold exactly 13. They can be included in the count by ticking Unassigned queries in the export form and sorting the unassigned table by pep_rank (ascending) and pep_score (descending). In Protein Family Summary, the matches should either not be counted towards the number of unique sequences, or they should not be put in the unassigned list. The discrepancy will be fixed in a future version of Mascot.
For Histogram 4B, we counted the number of library peptides with a significant match. Mascot finds 78% of Observed peptides and 74% of Predicted. Observed refers to the 1,000 peptide sequences selected from IEDB that have been previously identified in mass spectrometry experiments. The 1,000 Predicted sequences originate from the same source proteins and were predicted by NetMHCpan to bind to the same HLA alleles.
For Histogram 4D, the fraction of identified peptides by allele are:
Allele | Fraction of library peptides identified |
Expected |
---|---|---|
HLA-A*02:01 | 0.199 | 0.25 |
HLA-A*03:01 | 0.276 | 0.25 |
HLA-B*07:02 | 0.242 | 0.25 |
HLA-B*44:02 | 0.281 | 0.25 |
The fractions are fairly evenly balanced. Mascot + Percolator identify proportionally fewer hydrophobic A*02:01 peptides, but this was common with other search engines too. Overall the new Percolator features in Mascot 2.8 are a big improvement to HLA peptide identification without introducing bias to the results.
Keywords: benchmark, endogenous, FDR, hla, Percolator