It is generally not useful to have a results page with a list of many (possibly hundreds) of highly homologous proteins. In most cases, it is far better to group these proteins together. The goal of grouping is to answer the following question: "which minimal set of proteins completely accounts for the peptides found in the experimental data?".
There are two separate algorithms for grouping proteins in Mascot Parser. The first option MSRES_GROUP_PROTEINS uses Occam's Razor and was introduced in the first version of Mascot Parser. The second option MSRES_CLUSTER_PROTEINS is more inclusive and is the algorithm used for the MS-MS search results in Mascot Server 2.3.
The flag MSRES_GROUP_PROTEINS can be used with both ms_peptidesummary and ms_proteinsummary objects.
In this grouping mode, Occam's Razor is ruthlessly applied to the results, so that proteins which span the same set or a subset of peptides can be collapsed into a single entry in the hit list.
There is a subtle difference between the test used to group proteins for a peptide mass fingerprint search and for other searches. If there are any MS-MS data, then proteins are grouped together if the peptide sequences for each match are the same. If there is no MS-MS data, proteins are grouped together when the input masses of peptide matches are the same.
The flag MSRES_CLUSTER_PROTEINS is only supported with ms_peptidesummary.
One of the disadvantages of the MSRES_GROUP_PROTEINS
algorithm is that a protein may be excluded from being a subset or sameset protein because it has just one 'new' peptide that prevents it from being grouped. The MSRES_CLUSTER_PROTEINS
mode groups proteins by considering shared peptide matches above homology threshold. Each protein hit consists of a 'family' of proteins, where each protein in the family shares at least one peptide match above homology threshold with another family member.
Proteins in the hit can be accessed in the following way:
Protein | Description | getGrouping returns | getSimilarProteins returns | |
1.1 | A | This is the 'lead' or 'master' protein for hit 1 [returned by getHit(1)] | GROUP_NO | nothing |
B | A protein with the same peptide matches as Protein A [returned by getNextSimilarProteinOf("A", 1, 1)] | GROUP_COMPLETE | Protein A | |
C | A protein that contains a subset of the peptides in Protein A and Protein I [returned by getNextSubsetProteinOf("A", 1, 1)] | GROUP_SUBSET | Protein A Protein I | |
D | A protein that contains the same set of the peptides in Protein C [returned by getNextSimilarProteinOf("C", 1, 1)] | GROUP_COMPLETE | Protein C | |
1.2 | E | This is another family member of the first hit [returned by getNextFamilyProtein(1,1)] | GROUP_FAMILY | nothing |
F | A protein with the same peptide matches as Protein E [returned by getNextSimilarProteinOf("E", 1, 1)] | GROUP_COMPLETE | Protein E | |
G | A protein that contains a subset of the peptides in Protein E [returned by getNextSubsetProteinOf("E", 1, 1)] | GROUP_SUBSET | Protein E | |
H | A protein that contains the same set of the peptides in Protein G [returned by getNextSimilarProteinOf("G", 1, 1)] | GROUP_COMPLETE | Protein G | |
1.3 | I | This is another family member of the first hit [returned by getNextFamilyProtein(1,2)] | GROUP_FAMILY | nothing |
J | A protein with the same peptide matches as Protein I [returned by getNextSimilarProteinOf("I", 1, 1)] | GROUP_COMPLETE | Protein I | |
K | A protein that contains a subset of the peptides in Protein I [returned by getNextSubsetProteinOf("I", 1, 1)] | GROUP_SUBSET | Protein I | |
C | A protein that contains a subset of the peptides in Protein A and protein I [returned by getNextSubsetProteinOf("I", 1, 2)] | GROUP_SUBSET | Protein A Protein I | |
L | A protein that contains the same set of the peptides in Protein K [returned by getNextSimilarProteinOf("K", 1, 1)] | GROUP_COMPLETE | Protein I | |
2.1 | X | This is the 'lead' or 'master' protein for hit 2 [returned by getHit(2)] | GROUP_NO | nothing |
Note that Protein C is a subset of Protein A and Protein I. An alternative way of showing subsets is to have a single list at the end of the family. For this case, it is probably easier to call getNextSubsetProtein(hit, id, searchWholeFamily) with searchWholeFamily == true
, which returns each subset protein in the family, one at a time. See also getSimilarProteins(), which returns the list of proteins that a subset protein belongs to (in some sense the 'superset' proteins of the subset protein).
For family grouping, it is important to call getNextSimilarProteinOf() and getNextSubsetProteinOf() rather than getNextSimilarProtein() and getNextSubsetProtein(). Otherwise the list of similar proteins will always be for the 'lead' protein.
In peptide summary only (using either method of grouping), any peptides shorter than minPepLenInPepSummary
will be ignored when grouping proteins together. You need to specify this value when creating an ms_peptidesummary
object. The default value may be obtained from ms_mascotoptions::getMinPepLenInPepSummary().
Assume that
We say that protein 'B' matches a subset of peptides. Mascot Parser allows you to treat these subset matches in three separate ways:
Display the proteins that matched a subset of the peptides 'under' the main protein. For the standard Mascot Server reports, this is implemented by specifying "ShowSubSets 1" in the Options section of mascot.dat
. In Mascot Parser, specify MSRES_SHOW_SUBSETS when creating the ms_peptidesummary
or ms_proteinsummary
object.
Discard the proteins that matched a subset of the peptides 'under' the main protein. For the standard Mascot Server reports, this is implemented by specifying "ShowSubSets 0" in the Options section of mascot.dat
. In Mascot Parser, do not specify MSRES_SHOW_SUBSETS when creating the ms_peptidesummary
or ms_proteinsummary
object.
ms_peptidesummary
or ms_proteinsummary
object. If you have a primary hit with (say) 100 peptide matches, you may be very interested in subset proteins with 99 matches, but not in ones that have 1 or 2 matches; these just clutter up the report if you use MSRES_SHOW_SUBSETS
. On the other hand, if you have a primary hit with (say) 2 peptide matches, you are more likely to be interested in sub-set proteins with just 1 match. The function ms_mascotresults::setSubsetsThreshold() was introduced in Mascot 2.2 to facilitate including or removing these less interesting subset proteins.