Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Protein Family Summary report (MS/MS)

At the completion of an MS/MS search, a summary report is displayed that provides an overview of the results. All reports contain links to more detailed views of the experimental and calculated data.

The default summary report is the Protein Family Summary. This groups the proteins into families based on a novel hierarchical clustering algorithm and presents these results one page at a time, initially with 10 families per page. This report is ideally suited to large and complex MS/MS searches, where it is not practical to display all the results on a single HTML page.

Structure of the report

Use the example TMT search to open an example report in a new browser window or tab. The structure of the report is illustrated in the screenshot below.

The body of the report consists of three tabs, one for protein families, one for Report Builder, and one for unassigned matches.

Earlier versions of Mascot had a legacy summary report originally developed for small searches, Select Summary and Peptide Summary reports. These reports are still available, but they will be removed in a future Mascot release.

If there are no peptide sequence matches whatsoever from a search of MS/MS data, only molecular weight matches, then a Protein Summary report will be displayed. This indicates that the search has failed. Possibly the spectra are nothing but noise or possibly the search parameters are incorrect in some way.

Most of the information in a Family Summary is structured into sections that can be displayed or hidden. Links that toggle between the two states are easily recognised by the adjacent triangle (). The triangle points to the right when the section is collapsed and points down () when the section is expanded.

Header

At the top of the report are a few lines to identify the search uniquely: search title, date, user name, etc. The database version is identified with either a release number or an ISO datestamp.

Repeating a search

Below the search header is a button to repeat the search using the same peak lists. This is useful for investigating the effect of changes in search parameters. The choices for repeating a search are set by radio buttons for All queries, Non-significant (below identity threshold), and Unassigned.

Exporting the results

Below the search header is a drop-down list containing available export formats. Select the required format and choose Export. This will take you to the export form.

Search Parameters

The search parameters can be displayed or hidden by clicking on the sub-heading. Descriptions of individual search parameters can be found in search parameter reference.

Score Distribution

Histograms illustrating the peptide and protein score distributions can be displayed or hidden by clicking on the sub-heading.

The left hand histogram shows the peptide score distribution, divided into 16 bins. The heights of the bars show the number of matches in each bin.

Similarly, the right hand histogram shows the 50 highest protein scores. For a search of MS/MS data, protein scores are derived from ions scores as a non-probabilistic basis for ranking protein hits. The protein score histogram has little meaning for MS/MS results, but is retained for historical reasons.

Modification statistics

The modification statistics table lists the total number of instances of each modification found in significant matches. The columns are as follows:

  1. Modification – name of the modification
  2. Delta – delta mass
  3. Type – fixed, variable (fixed or variable modification), ET (error tolerant modification or semi-specific cleavage), SL (spectral library modification), crosslink (intact crosslink), looplink (intact looplink), monolink
  4. Site – description of the modification site being counted
  5. Total matches – count of instances

For an error tolerant search, the ET counts can be useful in judging whether any unsuspected modifications are so abundant that they should be selected as variable modifications for the first pass search.

Legend

A legend, explaining how font style and colour are used to convey information about individual peptide matches can be displayed or hidden by clicking on the sub-heading. Red indicates the top-ranking peptide match for the query. Bold indicates significant (score greater than homology threshold). Italic font is used for a duplicate match (more than one query matches to the same peptide sequence with the same modifications, and charge state).

Format Controls

The formant controls enable the report format to be modified. After making changes, press the "Format As" button to reload the report using the new settings.

  • Significance threshold – The default significance threshold is p < 0.05. You can change this to any value in the range 0.99 to 1E-18. However, it is better to set a target FDR.
  • Maximum number of families – The default is AUTO (0), which will display all of the families that have at least one peptide match with a significant score. Enter a positive integer if you wish to re-specify the number of protein families to report. Of course, the total number of families actually found by the search may be less.
  • Target FDR – Target false discovery rate. Mascot automatically adjusts the significance threshold to reach the target FDR. This control is enabled for automatic target-decoy searches. If your search doesn’t contain decoy results, please repeat the search.
  • FDR type – Whether target FDR is for PSM FDR or sequence FDR.
  • Display non-significant matches – When unchecked, only significant matches are displayed. This makes the report shorter and more readable. Check this box to display all matches, even those with low scores that may be false. (This control can be replaced by an edit box that allows you to set an explicit expect value or score cut-off. For more information, search the Installation & Setup manual for DisplayNonSignificantMatches.)
  • Min. number of sig. unique sequences The default setting is 1, which means a family member could be a ‘one-hit wonder’. Setting a higher value will usually have a dramatic effect on the number of protein hits and the protein false discovery rate.
  • Dendrograms cut at – Causes all dendrograms to be cut at the specified score.
  • Preferred taxonomy – Select a preferred taxonomy when searching a database with a taxonomy index

Some controls are only displayed for certain types of search.

Preferred taxonomy

This control allows you to specify a preferred taxonomy for the anchor protein in cases where there is a choice of indistinguishable proteins.

Imagine we are studying dormice, which are not well represented in any protein database. We choose the broader taxonomy of Rodentia so that we can get matches to homologous proteins from other rodents. But, if a hit contains same-set matches to proteins from rat, mouse and dormouse, we can ensure the dormouse entry will be selected as the anchor protein by specifying Gliridae as the Preferred Taxonomy.

Another situation where Preferred Taxonomy can come in useful is for a database like NCBI nr, where each entry represents multiple proteins. By default, it is always the first protein in the title line that is selected as anchor protein. You might search with a taxonomy filter of dog and pull out an entry for a protein that was found in both cat and dog and happened to have cat listed first. Setting a Preferred Taxonomy of dog will ensure the dog accession and description are selected for display in such cases.

The control is always available except if the result file comes from Mascot 2.3 or earlier. These old results files require that the databases that were used in the search are online. Otherwise the control will be hidden, because there would be no way to retrieve the required taxonomy information.

Note that the default taxonomy list shipped with Mascot is limited to a small number of well characterised organisms, and this doesn’t include either cat or dog. So, for the second example, you would need to edit the file called taxonomy in the Mascot config directory to add the required entries. For example, the categories under mammals in the default file might look like this:

Title:. . . . . . . . . . . . Mammalia (mammals)
Include: 40674
Exclude:
*
Title:. . . . . . . . . . . . . . Primates
Include: 9443
Exclude:
*
Title:. . . . . . . . . . . . . . . . Homo sapiens (human)
Include: 9606
Exclude:
*
Title:. . . . . . . . . . . . . . . . Other primates
Include: 9443
Exclude: 9606
*
Title:. . . . . . . . . . . . . . Rodentia (Rodents)
Include: 9989
Exclude:
*
Title:. . . . . . . . . . . . . . . . Mus.
Include: 10088
Exclude:
*
Title:. . . . . . . . . . . . . . . . . . Mus musculus (house mouse)
Include: 10090
Exclude:
*
Title:. . . . . . . . . . . . . . . . Rattus
Include: 10114
Exclude:
*
Title:. . . . . . . . . . . . . . . . Other rodentia
Include: 9989
Exclude: 10088, 10114
*
Title:. . . . . . . . . . . . . . Other mammalia
Include: 40674
Exclude: 9443, 9989
*

To add dog to the list of choices, enter the text shown in red

Title:. . . . . . . . . . . . . . . . Other rodentia
Include: 9989
Exclude: 10088, 10114
*
Title:. . . . . . . . . . . . . . Canis familiaris
Include: 9615
Exclude:
*
Title:. . . . . . . . . . . . . . Other mammalia
Include: 40674
Exclude: 9443, 9989, 9615
*

The NCBI Taxonomy Browser is invaluable for looking up TaxID codes and finding where a particular organism fits into the tree of life. It also lists the number of entries in GenBank for each taxonomy, which is a useful way to discover whether a particular taxonomy might be too narrow. Never choose a taxonomy that has less than two thousand proteins; move to a higher level so as to search a reasonable number of entries.

UniGene index

Only displayed for a search of a nucleic acid database when UniGene index files have been configured.

Choose the UniGene index to be used to cluster the proteins into gene based families.

Error tolerant matches

Only displayed for an automatic error tolerant search.

A drop-down list can be used to switch between displaying reliable matches (the default), displaying just the standard matches, and displaying all matches regardless of quality.

Refine results with machine learning

Database search results can be optionally refined with machine learning. This is a powerful technique especially with ‘difficult’ data sets like endogenous peptides, very large databases and metaproteomics. The option “Refine results with machine learning” is only shown if the search has more than 750 queries and the database more than 100 sequences.

When refine results with machine learning is selected, Mascot runs additional steps at the end of the database search. Several metrics (machine learning features) are calculated for each peptide match, such as precursor mass error, charge state, missed cleavages, amount of fragment intensity matched and average MS/MS fragment mass error. These are passed to Percolator, which trains a semi-supervised machine learning model. The model finds an optimal separation between target and decoy matches using all the available features.

Percolator often gives a significant improvement to peptide identifications. This is because it leverages extra context that is unavailable to the database search engine. For example, precursor mass error of incorrect matches is typically randomly distributed, while correct matches cluster around zero. Percolator uses this clustering for separating correct and incorrect matches in a multidimensional space.

Mascot also ships with MS2Rescore, a modular and user-friendly platform for AI-assisted rescoring of peptide identifications. MS2Rescore includes two prediction systems, DeepLC and MS2PIP.

Select a DeepLC model for retention times based on your LC and enzyme. Mascot uses DeepLC to predict the retention times of target and decoy peptide matches based on sequence, fixed/variable modifications and charge state. The difference between observed and predicted retention time is used as a Percolator feature. Predicted retention time provides information about peptides that is orthogonal to the other metrics.

Select an MS2PIP model for spectral similarity based on your instrument type and enzyme. Mascot uses MS2PIP to predict the MS/MS fragmentation spectra of target and decoy peptide matches based on sequence, fixed/variable modifications and charge state. Several correlation metrics are calculated between the observed and predicted spectra and these are used as Percolator features. Spectral correlation typically enhances sensitivity, because it makes greater use of peak intensity than the Mascot ions score.

Quantitation controls

If the search used a quantitation method specifying Multiplex or Reporter protocol, there will be an additional block of quantitation related controls, as described here. Checkboxes can be used to toggle the display of protein and peptide ratios.

If the search included the auto-decoy option, false discovery rate information is displayed in the final section of the header.

Protein Families

Proteins are grouped into families using a novel hierarchical clustering algorithm. If the family contains a single member, the accession string, protein score and description are listed. If the family contains multiple members, the accessions, scores and descriptions are aligned with a dendrogram, which illustrates the degree of similarity between members.

Expanding the family

To see complete information about the proteins in the family, and the peptide matches assigned to each family member, click on the family number link.

Clicking on the accession string link will load a Protein View report. For each of the proteins, beside the protein score and mass, there are counts of the number of matches and the number of distinct sequences. In each column, the first number is the total count while the number in parentheses is the count for matches above the significance threshold. If Display non-sig. matches is deactivated, which is the default, the two numbers will always be the same.

Peptide match table

The peptide match table contains the following columns:

  1. Query number, hyperlinked to Peptide View.
  2. Dupes is a count of the number of additional matches to the same peptide sequence with the same modifications and charge. Click on the triangle link to expand these duplicate matches, which will have the same or lower score than the one first listed
  3. Experimental m/z value
  4. Experimental m/z transformed to a relative molecular mass
  5. Relative molecular mass calculated from the matched peptide sequence
  6. Difference (error) between the experimental and calculated masses
  7. Number of missed cleavage sites
  8. Ions score – If there are duplicate matches to the same peptide, then the lower scoring matches are shown in brackets.
  9. Expectation value for the peptide match. (The number of times we would expect to obtain an equal or higher score, purely by chance. The lower this value, the more significant the result).
  10. Rank of the peptide match, (1 to 10, where 1 is the best match). If there is a triangle to the left of the rank, the row can be expanded to show the alternative matches for this query.
  11. A letter U if the peptide sequence is unique to one protein family member.
  12. A column is displayed for each family member selected by its checkbox immediately above the table. A marker is shown if the peptide match is present in the protein. Where a column represents more than one protein, and the peptide is found in some of these proteins, but not all, the marker is grey. Otherwise, it is black.
  13. Sequence of the peptide in 1-letter code. The residues that bracket the peptide sequence in the protein are also shown, delimited by periods. If the peptide forms the protein terminus, then a dash is shown instead.
  14. Any variable modifications used to obtain the match

If the peptide sequence is modified, each affected residue is underlined. Details of the modification will be displayed if the mouse cursor is rested over the residue. If multiple matches to a query have identical scores, i.e. the sequences are different, but are identical in mass spectrometry terms, they are collapsed into a single consensus peptide sequence for display. The residues that differ between the matches are displayed in lower case. For example, if deamidation was a variable modification and one protein contained the sequence FASFIDK and another protein in the same family contained FASFINK (deamidation at N) then the table would display FASFIdK. Click on the rank to expand the alternative matches for the query and the actual peptide sequences will be displayed.

Tab controls

Controls at the top and bottom of the tab can be used to move between pages and change the number of protein families per page. There are also buttons to expand and collapse all of the information on the page.

At the top of the tab, there are text search controls. You can search the report for proteins and peptides containing numerical or text values.

Search for proteins by:

  • Accession
  • Description
  • Family number
  • Page number

Search for peptides by:

  • Query number
  • Observed m/z
  • Mr(expt)
  • Mr(calc)
  • Sequence
  • Fixed modification
  • Variable modification

If the target is found, the first occurrence will be highlighted and, if necessary, the page will scroll and expand. If the target is found in multiple locations in a single family, all matches will be highlighted. Choose Next or Previous to jump forward or backward to other instances of the target in other families. If the target is not found, a button provides the option to try the search in the Unassigned tab.

Report Builder

The Report Builder tab is part of the Protein Family Summary report. It allows you to build a customised table of protein hits, which is particularly useful if you need a minimal list of proteins for a publication. You can choose which columns to include and their order, filter out proteins that are of no interest, such as one-hit wonders, and export the table in CSV format directly to Excel.

Table of protein hits

The table has one row for each of the top-level protein hits, sometimes called anchor proteins. Since a family can contain several family members, grouped because they have shared peptides, there will be as many rows in the table as there are family members in the report. If the search results contain same-set proteins, and a preferred taxonomy has been selected, this will also apply to the selection of anchor proteins for the table. You can sort the table by clicking on a column header. The currently active sort order is shown by an arrow in the relevant column; up means ascending, down means descending. The table can be exported as CSV with one click, with sort order preserved.

report builder

Column selection

The columns are mostly self-explanatory, and tooltip help is displayed when the mouse pointer rests over a column header. Clicking on a family number hyperlink will jump to the family displayed in the proteins tab. Clicking on an accession hyperlink will load a Protein View report. If the search included quantitation using Reporter or Multiplex protocols, protein level quantitation information is also available. To select and re-order columns, expand the Columns section.

report builder

Use the Arrangement drop down list to load a saved arrangement or choose <custom> to configure the report by moving columns between the two lists. The columns are categorised into groups. The basic set of columns that are always available are under “Protein hits”. Quantitation results create a set of columns for each reported ratio. In the enabled list, you can select individual columns, CTRL+click multiple columns, or SHIFT+click a range of columns and use the up and down arrows to change their relative position. The changes take effect when you choose Apply.

If Mascot security is enabled and there is an arrangement you might want to use again, you can save the arrangement in your security session. After choosing Apply, give the arrangement a name and choose Save. Saved arrangements can be loaded using the drop down list mentioned earlier. If you want the arrangement to be available to all users, choose ‘Show column string’, copy the text, and use it to create an additional ReportBuilderColumnArrangement_N entry in mascot.dat.

report builder

Filtering protein hits

The Filters section enables protein rows to be dropped according to multiple criteria. Configure the first term and apply it by choosing Filter. The table will be reloaded and additional controls displayed to allow another term to be added, if required. Often, a single term requiring each protein to have significant matches to at least two distinct sequences will be all that is needed. On the other hand, for a quantitation report, it is might be very useful to create a table such as this, limited to proteins that are significantly up-regulated in the 117 channel and significantly down-regulated in the 116 channel.

report builder

Unassigned Peptide Matches

The unassigned list contains peptide matches that are not assigned to protein families. In some cases, there may be no match at all, and only the observed m/z value and the experimental Mr will be listed against the query number. In other cases, details of the top scoring peptide match will be listed.

The list is split into pages. Controls at the top and bottom of the tab can be used to move between pages and change the number of matches per page. The Sort unassigned control allows the sort order to be changed. Descending score makes it easy to see whether there are any good matches. If so, you will want to increase the number of protein families or set it to AUTO so as to pull these matches into the main body of the report. Ascending query number is the same as ascending precursor Mr. Descending intensity allows you to find spectra with intense peaks that have failed to get a match. These could be candidates for de novo sequencing.

The unassigned table contains the following columns:

  1. Query number, hyperlinked to Peptide View.
  2. Experimental m/z value
  3. Experimental m/z transformed to a relative molecular mass
  4. Relative molecular mass calculated from the matched peptide sequence
  5. Difference (error) between the experimental and calculated masses
  6. Number of missed cleavage sites
  7. Ions score
  8. Expectation value for the peptide match. (The number of times we would expect to obtain an equal or higher score, purely by chance. The lower this value, the more significant the result).
  9. Rank of the peptide match, (1 to 10, where 1 is the best match). This will always be 1 for matches in the unassigned list. If there is a triangle to the left of the rank, the row can be expanded to show the alternative matches for this query
  10. Sequence of the peptide in 1-letter code with modified residues underlined.
  11. Any variable modifications used to obtain the match

When you load the peptide view for an unassigned query, it takes the first protein containing the matched peptide. This may or may not be the protein that would be selected as the anchor protein if the formatting was changed in a way that caused the match to be pulled into a family in the Proteins tab.

At the top of the tab, there are text search controls. You can search the unassigned list by query number, mass, m/z value, and peptide sequence. Select the category, enter text in the edit field and choose Filter. If the text is found, the complete unassigned list is filtered to display matching queries. Choose Clear to return to the standard display. If the text is not found, a button provides the option to try the search in the Proteins tab.

URL Switches

There are a number of switches to modify the format of the result reports. Many of these have a global default, set by a parameter in the Options section of mascot.dat. These defaults can be changed in an individual report using the format controls, or by appending the relevant switch to the report URL. Switches take the form label=value and the delimiter between switches is an ampersand (&). For example, if the report URL was:

http://local-server/mascot/cgi/master_results.pl?file=../data/20040121/F001847.dat

The type of report could be changed by appending "REPTYPE=protein":

http://local-server/mascot/cgi/master_results.pl?file=../data/20040121/F001847.dat&REPTYPE=protein

Labels and values are not case sensitive. Note that many labels begin with an underscore character. Values that are not literal strings are shown in italics.

URL arguments relating to quantitation are described here

master_results_2.pl

URL mascot.dat master_results_2.pl Value Description
report   Yes auto Report all significant hits
N Report N hits
_showallfromerrortolerant ShowAllFromErrorTolerant Yes 1 Set value to 1 to report all matches from an error tolerant search, including the garbage, (default 0)
_onlyerrortolerant   Yes 1 Set value to 1 to report only error tolerant matches from an automatic error tolerant search, (default 0)
_noerrortolerant   Yes 1 Set value to 1 to suppress error tolerant matches from an automatic error tolerant search, (default 0)
_show_decoy_report   Yes 1 Set value to 1 to display the report for an automatic decoy database search, (default 0)
_sigthreshold SigThreshold Yes N Probability to use for the significance threshold. Range is 0.99 to 1E-18, (default 0.05).
_sortunassigned SortUnassigned Yes scoredown Sort unassigned matches by descending score, (default)
queryup Sort unassigned matches by ascending query number
intdown Sort unassigned matches by descending intensity
_ignoreionsscorebelow IgnoreIonsScoreBelow Yes N Values greater than 0 and less than 1 act as an expect value threshold, and the scores for any peptide matches with higher expect values are set to 0, so that they disappear from the report. Values of 1 or more act as a score threshold, and any peptide matches with lower scores suppressed. A value of -1 means set the threshold to the value of _sigthreshold. Floating point number, (default 0.0).
_alwaysgettitle   Yes 1 Set to 1 to force reports to fetch Fasta titles from database when they are not included in the result file, (default 1).
percolate Percolator Yes 1 Set value to 1 to re-rank results using Percolator, (default 0).
percolate_rt PercolatorUseRT Yes 1 Obsolete. Set value to 1 to include retention time feature when using Percolator, (default 0).
_proteinfamilyswitch ProteinFamilySwitch Yes 0 The number of MS-MS spectra required for displaying the Protein Family Summary report. Set to 0 to force results to be always displayed as Protein Family Summary, (default 1).
_prefertaxonomy   Yes N 1-based integer index into the list of taxonomies in the Mascot taxonomy file. 0 means no preference.
group_family   Yes 0 Set to 0 to disable family grouping.
_minpeplen MinPepLenInPepSummary Yes N Peptides shorter than this are ignored for protein inference purposes. Positive, non-zero integer.
min_num_sig_unique_seqs   Yes N Proteins will only be reported if they contain significant matches to at least this number of distinct peptide sequences. Positive, non-zero integer.