Help > Human Proteome Project data interpretation guidelines

Human Proteome Project data interpretation guidelines

The Human Proteome Project (HPP) data interpretation guidelines 3.0 are good practice and common sense in any proteomics study where reliable protein identification is critical, not just when studying the human proteome. The guidelines are easy to meet using Mascot Server.

Core guidelines

The full list consists of 9 guidelines. The first one is: complete the checklist! Guidelines 2-4 apply to every study.

2a. Deposit all MS proteomics data to a ProteomeXchange repository as a complete submission.

A complete submission means raw data, peak lists and search results in a standard format. There’s not much to say about raw data files. For the peak lists, best practice is to store a copy of the original peak lists alongside the raw files. If you reanalyse the data, this makes it easy to confirm your peak picking is working the same as before, and it ensures the peptide matches in exported search results map to the correct peak lists.

You can save the MGF file in Mascot Distiller at any time from File → Save peak list as. For convenience, the peak picking parameters are printed at the top of the MGF file. If you’re using Distiller (or another tool) as a data import filter in Mascot Daemon, the MGF file for each task is saved in C:\Program Data\Matrix Science\Mascot Daemon\MGF.

Some search submission tools and peak picking programs make it hard or impossible to save a copy of the peak list file. In this case, export the results as MGF using the Mascot export form.

Search results can be exported from Mascot in several formats, including the PSI standards mzIdentML and mzTab.

2b. Include analysis reference files (search database, spectral library, transition list, etc.) in submission.

The Mascot results file (F001234.dat) records all relevant parameters and metadata, including the version and filepath of FASTA databases or spectral libraries. When the results are exported, as much of the metadata is also exported as supported by the target file format. This makes it trivial to see exactly which database was used.

The results file doesn’t store protein sequences. If you need to include the FASTA file or spectral library in the repository submission, these must be copied from the Mascot server. You’ll need access to the Mascot Server desktop or filesystem, for example through a Windows network share or scp/sftp on Linux. The path to the current file for each database is shown in both Database Status and Database Manager. If the database was recently updated, the previous version is in the old directory.

The HPP guideline is a bit unclear, though. It’s definitely good practice to include custom or purpose-built databases or spectral libraries in the repository submission. But does it apply to publicly available databases like SwissProt, NCBI nr or neXtProt? What would be the purpose of duplicating the gigantic nr FASTA file in each submission?

3. Use the most recent version of the neXtProt reference proteome for all informatics analyses.

Database Manager makes it trivial to get the latest neXtProt. It’s supplied as a predefined definition, so you can simply enable the database through Database Manager. Updating the database is as simple as clicking a button.

4a. Describe in detail the calculation of FDRs at the PSM, peptide, and protein levels.

When you submit the search as an automatic decoy search, Mascot estimates FDR at PSM, peptide (sequence) and protein level.

If you refine results with machine learning, you should state that Percolator was used for rescoring. The Percolator features are recorded in the metadata section of the exported file.

4b. Report the PSM-, peptide-, and protein-level FDR values along with the total number of expected false positives at each level, using precision appropriate to the uncertainty in computed FDR.

The counts and FDRs are displayed in the Sensitivity and FDR section of Protein Family Summary. The values are also exported in CSV, XML, mzIdentML and mzTab formats. The export form has a dropdown choice for the peptide-level FDR (PSM or sequence FDR), but we recommend using sequence FDR.

4c. Present large-scale results thresholded at equal to or lower than 1% protein-level global FDR.

Protein FDR can be controlled by adjusting the PSM significance threshold and the minimum number of significant unique sequences required for a protein hit. Getting to a specific protein FDR can be tricky, because you need to be very clear what it is you’re thresholding and why. See Creating a list of confidently identified proteins.

New PE1 protein detections

Guidelines 5-9 are the criteria for evidence of new protein detection in the human proteome. Specifically, this means promoting an existing neXtProt entry to PE1 status or proposing a new protein for inclusion in neXtProt. Guidelines relevant to DDA data are highlighted below.

5a. If using DDA mass spectrometry for such claims, present high mass-accuracy, high signal-to-noise ratio (SNR), and clearly annotated spectra.

A high mass accuracy instrument coupled with good peak picking (like Mascot Distiller) will get you far. Once you have the search results report, click on the query number and the match details will open in Peptide View. The MS/MS spectrum is displayed in the interactive Spectrum Viewer along with annotations of matched peaks.

The annotated graphic can be exported in SVG format and loaded in an image editor, like Inkscape or Libreoffice Draw, for further editing to produce a publication quality image. The SVG export button is in the lower left corner and looks like this:

8. Even when very high confidence peptide identifications are demonstrated, consider alternate mappings of the peptide to proteins other than the claimed one. Consider isobaric sequence/mass modification variants, all known SAAVs, and unreported SAAVs.

The Protein Family Summary report clusters protein hits based on shared peptide matches. The algorithm helps considering alternate mappings in three ways: proteins with similar sequences are likely to cluster in the same family; proteins with identical peptide matches are grouped together as sameset proteins; and proteins that could be in the sample, but for which there is no direct evidence, are reported as subset proteins. For example, if a protein hit has a number of sameset proteins, any one of them could be in the sample, not just the “anchor” protein, and there’s no peptide evidence to differentiate between them.

There are cases where a spectrum matches two (or more) different sequences with the same score, for example if they only differ by I vs L, or GG vs N, or Q vs K. In this case, clustering will follow both sequences, so two family members could be separated only by an I/L difference. Protein Family Summary highlights sequence ambiguity with an asterisk in the rank column to make these cases easier to spot.

Whether a sequence is shared or unique, it’s always in the context of the databases chosen for searching. You should always include a contaminants database and any background proteome in the search space to ensure everything is accounted for. Searching a Uniprot proteome with isoforms will expose cases where identified peptides map to two different splice variants (example).

Some of the unexpected modification and single amino acid variant (SAAV) discovery can be done by running an automatic error tolerant search. In addition to trying all known variable modifications, the error tolerant pass also tests for the set of all possible amino acid substitutions. There is a limitation: error tolerant matches are not used in protein inference, so if you get a match with a SAAV, it will be reported under the protein containing the original peptide sequence. Some detective work is required to determine which protein could have originated the modified sequence.

9. Support such claims by two or more distinct uniquely-mapping, non-nested peptide sequences of length ≥9 amino acids with the above evidence in the same paper.

The last guideline is the good old rule of thumb concerning one-hit wonders. Distinct, uniquely mapping peptides are marked with a U in Protein Family Summary, so it’s little work to confirm whether a protein has two unique non-nested peptide sequence matches.

By default, protein family clustering considers peptide matches whose sequence has at least 7 residues. It may be interesting to change the clustering to require a longer sequence length. Either change MinPepLenInPepSummary to 9 in mascot.dat, or paste the string &_minpeplen=9 at the end of the report URL. Now, two family members are only separated if each one has a unique peptide match with sequence at least 9 residues long, which may help in seeing the wood for the trees.