Creating a list of confidently identified proteins
Listing confidently identified proteins can be done very easily using Report Builder:
- Run the search as a target-decoy search, which is the default (in earlier versions of Mascot, select the Decoy checkbox when submitting the search)
- Open the result report
- In format controls, set the peptide FDR to 1%
- Optionally, choose to refine results with machine learning
- After the report reloads, switch to the Report Builder tab
- Either choose a suitable Min. number of sig. unique sequences in format controls, or set the filter ‘Num of significant unique sequences’ in Report Builder
- Optionally, expand the columns section and choose which columns you require and their order
- Print the table as PDF or export it as CSV (click on the Export as CSV button at the bottom of Report Builder)
Peptide FDR
The default significance threshold for a Mascot search is usually 0.05 and this will often give a peptide FDR in the region of 5%. If the actual FDR is excessively high or low, or if some other value of FDR is required, the target FDR can be adjusted to achieve the required value.
At steps 4-5, you may discover that you have too few matches to get an accurate peptide FDR. Since the main purpose of the exercise is to avoid reporting proteins based on false peptide matches, as long as you have very few false peptide matches, this won’t be a problem.
Protein FDR
The basis of protein FDR is the number of proteins inferred in the target database compared with the number in the decoy database. Conceptually, this is similar to peptide FDR, but counting proteins calls for a number of additional definitions and assumptions. These are described on the protein FDR help page.
Often, database redundancy causes protein inference ambiguity, meaning that we could account for the peptide evidence using different sets of proteins. A protein FDR of 1% only tells us that 1% of the proteins listed are wholly false. This doesn’t mean the other 99% are "correct". In particular, a family member may represent a number of same-set and near same-set proteins. Unless the PSMs provide near complete coverage, two same-set proteins could have major differences in the regions for which we have no evidence. In such cases, it is important to remember that a protein accession in the summary report doesn’t mean "this is the correct protein", it means "the correct protein is likely to be very similar to one of the set of proteins represented by this family member".
Protein FDR is controlled by two settings: peptide FDR and Min. number of sig. unique sequences.
Should you filter out ‘one-hit wonders’?
The default for Min. number of sig. unique sequences is 1, which means Mascot is reporting ‘one-hit wonders’. For large searches, conventional wisdom is that it is safer to exclude these, but it is not always necessary.Consider a search with 10,000 significant peptide matches. While we might be happy to report these results with a 1% peptide FDR, meaning 100 false peptides, few of us would be comfortable reporting 100 false proteins. Filtering out ‘one-hit wonders’ by requiring significant matches to more than one peptide sequence only works well if the number of false matches is small compared with the number of database entries.
At a rough approximation, you don’t have to worry about this if the number of target matches at 1% FDR is less than the number of target database entries. You can find the count of entries in the report header. If you used a taxonomy filter, it is the count after the filter that matters.
If the numbers are similar, or if the number of matches is greater than the number of entries, you need to use the Poisson distribution to decide where to draw a line. There are many online calculators, or you can use this spreadsheet. If you plug in the number of entries in SwissProt mouse (16,727), 18,000 true and 180 false matches, you’ll see that 178 entries get 1 false match and only 1 entry gets 2, by chance. In such a case, setting ‘Num of significant unique sequences’ > 1 is a safe choice. If you increase the number of false matches by a factor of 10, you’ll see that 87 entries have 2 matches and 3 entries have 3. If these were the numbers for your search, you might want to set ‘Num of significant unique sequences’ > 2.
Example
The table below shows the precise counts of peptides sequences and proteins for 1% protein FDR for an example search:
Min. sig. seq. | Sig. thresh. | Target seq. | Decoy seq. | Peptide seq. FDR | Target prot. | Decoy prot. | Protein FDR |
---|---|---|---|---|---|---|---|
1 | 0.00215 | 7763 | 41 | 0.53% | 3832 | 39 | 1.02% |
2 | 0.16 | 10065 | 618 | 6.14% | 2281 | 23 | 1.01% |
On the first row are the results where one-hit wonders are retained and a stringent peptide FDR is set to control false positives.
On the second row, one-hit wonders are filtered out. The significance threshold can be set quite high, resulting in high peptide FDR, while still reaching 1% protein FDR.
In fact, we can report a lot more proteins at 1% FDR by retaining the one-hit wonders. This is partly because the numbers of peptides and proteins being reported are both small compared with the size of the database. It is also a function of the peptide match score distribution. If we were to search a much smaller database, keeping everything else the same, we might find the situation would reverse, and we could report more proteins for a given FDR by setting Min. number of sig. unique sequences to a higher value.
Note that Min. number of sig. unique sequences only affects the count of proteins. It doesn’t change the count of significant peptides. Significant peptide matches assigned to proteins that are dropped are moved to the unassigned list.
Other filters
Many other useful filters are available. For example, you can filter by database so as to remove contaminants from the final table. If this is important work, it can be interesting to load the report for the decoy matches (the link is in the decoy section), apply the same filters, and see how many false proteins you would report, if any.