Why group proteins together?

It is generally not useful to have a results page with a list of many (possibly hundreds) of highly homologous proteins. In most cases, it is far better to group these proteins together. The goal of grouping is to answer the following question: "which minimal set of proteins completely accounts for the peptides found in the experimental data?".

There are two separate algorithms for grouping proteins in Mascot Parser. The first option MSRES_GROUP_PROTEINS uses Occam's Razor and was introduced in the first version of Mascot Parser. The second option MSRES_CLUSTER_PROTEINS is more inclusive and is the algorithm used for the MS-MS search results in Mascot Server 2.3.

Using MSRES_GROUP_PROTEINS

The flag MSRES_GROUP_PROTEINS can be used with both ms_peptidesummary and ms_proteinsummary objects.

In this grouping mode, Occam's Razor is ruthlessly applied to the results, so that proteins which span the same set or a subset of peptides can be collapsed into a single entry in the hit list.

There is a subtle difference between the test used to group proteins for a peptide mass fingerprint search and for other searches. If there are any MS-MS data, then proteins are grouped together if the peptide sequences for each match are the same. If there is no MS-MS data, proteins are grouped together when the input masses of peptide matches are the same.

Using MSRES_CLUSTER_PROTEINS

The flag MSRES_CLUSTER_PROTEINS is only supported with ms_peptidesummary.

One of the disadvantages of the MSRES_GROUP_PROTEINS algorithm is that a protein may be excluded from being a subset or sameset protein because it has just one 'new' peptide that prevents it from being grouped. The MSRES_CLUSTER_PROTEINS mode groups proteins by considering shared peptide matches above homology threshold. Each protein hit consists of a 'family' of proteins, where each protein in the family shares at least one peptide match above homology threshold with another family member.

Proteins in the hit can be accessed in the following way:

	Protein	Description	getGrouping returns	getSimilarProteins returns
1.1	`A`	This is the 'lead' or 'master' protein for hit 1 [returned by getHit(1)]	GROUP_NO	nothing
	`B`	A protein with the same peptide matches as Protein A [returned by getNextSimilarProteinOf("A", 1, 1)]	GROUP_COMPLETE	Protein A
	`C`	A protein that contains a subset of the peptides in Protein A and Protein I [returned by getNextSubsetProteinOf("A", 1, 1)]	GROUP_SUBSET	Protein A Protein I
	`D`	A protein that contains the same set of the peptides in Protein C [returned by getNextSimilarProteinOf("C", 1, 1)]	GROUP_COMPLETE	Protein C
1.2	`E`	This is another family member of the first hit [returned by getNextFamilyProtein(1,1)]	GROUP_FAMILY	nothing
	`F`	A protein with the same peptide matches as Protein E [returned by getNextSimilarProteinOf("E", 1, 1)]	GROUP_COMPLETE	Protein E
	`G`	A protein that contains a subset of the peptides in Protein E [returned by getNextSubsetProteinOf("E", 1, 1)]	GROUP_SUBSET	Protein E
	`H`	A protein that contains the same set of the peptides in Protein G [returned by getNextSimilarProteinOf("G", 1, 1)]	GROUP_COMPLETE	Protein G
1.3	`I`	This is another family member of the first hit [returned by getNextFamilyProtein(1,2)]	GROUP_FAMILY	nothing
	`J`	A protein with the same peptide matches as Protein I [returned by getNextSimilarProteinOf("I", 1, 1)]	GROUP_COMPLETE	Protein I
	`K`	A protein that contains a subset of the peptides in Protein I [returned by getNextSubsetProteinOf("I", 1, 1)]	GROUP_SUBSET	Protein I
	`C`	A protein that contains a subset of the peptides in Protein A and protein I [returned by getNextSubsetProteinOf("I", 1, 2)]	GROUP_SUBSET	Protein A Protein I
	`L`	A protein that contains the same set of the peptides in Protein K [returned by getNextSimilarProteinOf("K", 1, 1)]	GROUP_COMPLETE	Protein I
2.1	`X`	This is the 'lead' or 'master' protein for hit 2 [returned by getHit(2)]	GROUP_NO	nothing

Note that Protein C is a subset of Protein A and Protein I. An alternative way of showing subsets is to have a single list at the end of the family. For this case, it is probably easier to call getNextSubsetProtein(hit, id, searchWholeFamily) with searchWholeFamily == true, which returns each subset protein in the family, one at a time. See also getSimilarProteins(), which returns the list of proteins that a subset protein belongs to (in some sense the 'superset' proteins of the subset protein).

For family grouping, it is important to call getNextSimilarProteinOf() and getNextSubsetProteinOf() rather than getNextSimilarProtein() and getNextSubsetProtein(). Otherwise the list of similar proteins will always be for the 'lead' protein.

minPepLenInPepSummary

In peptide summary only (using either method of grouping), any peptides shorter than minPepLenInPepSummary will be ignored when grouping proteins together. You need to specify this value when creating an ms_peptidesummary object. The default value may be obtained from ms_mascotoptions::getMinPepLenInPepSummary().

Handling 'subset' proteins

Assume that

protein 'A' has three peptide matches (to queries 1, 2 and 3); and
protein 'B' has only two peptide matches (to queries 1 and 3).

We say that protein 'B' matches a subset of peptides. Mascot Parser allows you to treat these subset matches in three separate ways:

Display the proteins that matched a subset of the peptides 'under' the main protein. For the standard Mascot Server reports, this is implemented by specifying "ShowSubSets 1" in the Options section of mascot.dat. In Mascot Parser, specify MSRES_SHOW_SUBSETS when creating the ms_peptidesummary or ms_proteinsummary object.
Discard the proteins that matched a subset of the peptides 'under' the main protein. For the standard Mascot Server reports, this is implemented by specifying "ShowSubSets 0" in the Options section of mascot.dat. In Mascot Parser, do not specify MSRES_SHOW_SUBSETS when creating the ms_peptidesummary or ms_proteinsummary object.
Treat proteins that matched a subset of the peptides as separate, unique proteins with no relation to the main protein. In Mascot Parser, specify MSRES_SUBSETS_DIFF_PROT when creating the ms_peptidesummary or ms_proteinsummary object.

If you have a primary hit with (say) 100 peptide matches, you may be very interested in subset proteins with 99 matches, but not in ones that have 1 or 2 matches; these just clutter up the report if you use MSRES_SHOW_SUBSETS. On the other hand, if you have a primary hit with (say) 2 peptide matches, you are more likely to be interested in sub-set proteins with just 1 match. The function ms_mascotresults::setSubsetsThreshold() was introduced in Mascot 2.2 to facilitate including or removing these less interesting subset proteins.