Quantify then identify or identify then quantify?
Quantitation in Mascot Distiller is identification driven. All of the MS/MS data is searched then all of the identified peptides that meet the requirements of the method are quantified.
Conventional wisdom is that a feature driven approach is more efficient: first look for features that indicate up or down regulation between samples, then target these for quantitation. The reasoning is perfectly clear if we think of using 2D gels to separate and visualise complex mixtures of proteins. Comparing the gel images is an extremely powerful way of spotting (excuse the pun) the few proteins that have been up or down regulated. The natural workflow is to visualise first then use mass spectrometry to identify the proteins in the spots of interest.
Yet, what proportion of work today is done using 2D gels? The trend towards shotgun experiments has been very marked in recent years. Is the feature driven approach relevant to shotgun proteomics?
Quantitation in bottom-up, discovery proteomics rests on the assumption that peptide abundance is a good surrogate for protein abundance. This is a safe bet if you can identify and quantify a large number of peptides for the protein of interest. We take the average or median peptide ratio and don’t worry too much about the outliers. On the other hand, it is a very risky proposition when you measure only a handful of peptides. The precise sources of error depend on whether the quantitation experiment is label free, label before digestion, or label after digestion. Whatever your choice, you might be able to eliminate some factors, but not all, and there is always the problem of unrecognised interferences in the spectra that skew peak area measurement. All peptide-based quantitation data sets show substantial scatter in the measured values, and we can only have confidence in a protein ratio when there are enough peptide ratio measurements to characterise the shape of the distribution.
So, while we might start with a feature driven approach, this doesn’t give any improvement in efficiency, because we still have to identify and quantify all of the other peptides to discover whether an apparently up or down regulated peptide truly represents the abundance of the protein or whether it is the result of some variability in digestion or labelling or modification. If we don’t do this, every outlier peptide will be taken to represent a regulated protein, and the majority of these will be wrong.
A recent article from Nonlinear Dynamics suggests that the quantify first approach "promotes identification of low abundance peptides [...] which have a low probability of being selected for MS/MS scans in the data driven acquisition approach". Sounds good in theory, but what is the reality? The weaker the signal, the more likely it is to show variation due to noise or counting statistics, so the initial list of features may be quite a long one. We generate an inclusion list and analyse the sample a second time, aiming to get MS/MS for each of the interesting features. If we could accept protein quantitation based on a single peptide, we could stop there, but this is not the case. To obtain meaningful results, we need to measure as many additional peptides as possible for each protein. This means that, for each identified peptide, we have to make a list of all possible proteins and, for each of these proteins, calculate masses for all of the peptides expected from digestion. Potentially, a very long list. We then analyse the sample a third time, with this new inclusion list, trying to measure enough peptides for each protein to be confident in the protein abundance. A heroic experiment, which I suspect few people attempt.
Keywords: quantitation
I agree that identify then quantify (ItQ) brings an interesting list usually. If not, quantify then identify (QtI) is a good option. Nowadays people want to look at the bottom of the basket – well, I should have replaced “nowadays” by “always”. I think that QtI is valuable. Regarding your post, there is a missing point, the border line proteins. There are/should be proteins that could have been ItQed with 1 peptide, and QtI could bring one or some more peptides for such proteins. As you pointed, inclusion list is done once, sometimes twice, rarely more, unless you have a strong believe in the study. I think there is room for both approaches.
The most important point in my opinion is the protein and quantitation inference. Protein inference seems to be based only on identification, and quantitation is used only for ratio, later. Protein inference from peptides alone was alright when people are comparing lists of proteins. Now it’s the quantitation age! Quantitation should be incorporated in the protein inference and shared peptides should be taken into account every time it’s possible. A few recent articles address this way hopefully.
Thanks for article. I have been working on refining the “quantify then identify-QI” approach as I do believe, done properly, it is the most efficient way to find regulated proteins. The issues of outliers (and protein inference) does make things tricky, but here is my thought, are these challenges really dealt with much better in the standard identify as much as possible and quantify approach (IQ). In both approaches one may ask “How many peptides belonging to a protein need to be identified and quantified before you can accurately reveal outliers? 2, 6, 20 peptides?” Yes with the QI approach you may be led down the wrong path by only focussing on the changing peptides that me be outliers (due to technical variation). But if you keep clear of proteins quantified with only a single peptide and use peptides that are changing across multiple biological replicates (with low p values) is the QI approach really going to be more misleading than the standard IQ approach?
If you observe 20 peptides of different sequence from the same protein and all are significantly regulated then I agree this is strong evidence that the protein is regulated. And, with 20 peptides, the protein inference is likely to be secure. If it was 6 peptides, maybe the confidence level would be a little lower, especially on the protein inference. If 2, then very risky on both counts. Guess it comes down to whether it is acceptable to obtain results from only the most abundant proteins in a discovery experiment.