Machine learning core features
One of the requirements of refining results with machine learning is a set of features (or metrics) for each peptide-spectrum match. When enabled, Percolator uses the features in finding an optimal separation between correct and incorrect matches.
Two types of features are available: core features calculated by Mascot; and features predicted from physico-chemical properties of peptides. Features calculated by Mascot are described below. For predicted features, please refer to Predicting retention time and spectral similarity with MS2Rescore.
Core features calculated by Mascot
Core features calculated by Mascot are always available and always enabled. The complete set of features that can be made available to Percolator is defined in code. You can choose a sub-set of these features using a setting in the Options section of the Mascot configuration file, mascot.dat. The default setting, as shipped, is:
PercolatorFeatures dM, mScore, MIT, MHT, peptideLength, z1, z2, z4, z7, isoSysDM, isoSysDMppm, isoSysDMz, 12C, mc0, mc1, mc2, varmods, varmodsCount, totInt, intMatchedTot, relIntMatchedTot, RMS, RMSppm, meanAbsFragDa, meanAbsFragPPM, rawScore
Features are calculated by a Mascot executable called ms-createpip.exe. The full list of supported features are:
Feature name | Description |
---|---|
retentionTime | Retention time in seconds if available |
dM | Calculated minus observed peptide mass in Da |
mScore | Mascot score (always on) |
lgDScore | Mascot score minus Mascot score of next best non-isobaric peptide hit |
mrCalc | Calculated Mr |
charge | Charge |
dMppm | Calculated minus observed peptide mass in ppm |
absDM | Absolute value of calculated minus observed peptide mass in Da |
absDMppm | Absolute value of calculated minus observed peptide mass in ppm |
isoDM | Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da |
isoDMppm | Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm |
isoDmz | Absolute value of calculated minus observed peptide m/z |
isoSysDM | Same as isoDM but corrected for systematic offset across all peptide matches. |
isoSysDMppm | Same as isoDMppm but corrected for systematic offset across all peptide matches. |
isoSysDmz | Same as isoDMz but corrected for systematic offset across all peptide matches. |
mc | Number of missed cleavages (always 0 if no enzyme) |
varmods | Number of modified sites divided by number of modifiable sites (set to 0 if number of modifiable sites is 0) |
varcount | Number of distinct varmods present |
varmodsCount | The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as varmodsCount=1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as varmodsCount=4. |
modifiable | Total number of modifiable sites |
modified | Total number of modified residues and terminii |
totInt | Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value |
intMatchedTot | Log total matched ion intensity |
relIntMatchedTot | Total matched ion intensity divided by total ion intensity as a percentage (no logs involved) |
fragDeltaMed | Median value of all matched fragment errors in Da |
fragDeltaIqr | Interquartile range value of all matched fragment errors in Da |
fragDeltaMedPPM | Median value of all matched fragment errors in ppm |
fragDeltaIqrPPM | Interquartile range value of all matched fragment errors in ppm |
fragDeltaPolyFit | 2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100 |
longest | Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched |
fracIonsMatched | Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv) |
matchedIntensity | Matched ion intensity, reported separately for each ion series, as with fracIonsMatched |
qmatch | The number of peptide matches for which an ms-ms match was attempted |
MIT | Mascot identity threshold |
MHT | Mascot homology threshold |
peptideLength | Peptide length |
z1 | 1 if charge = 1 |
z2 | 1 if charge = 2 or 3 |
z4 | 1 if charge = 4, 5, or 6 |
z7 | 1 if charge = 7 or more |
12C | 1 if peptide mass is 12C value (no isotope error) |
mc0 | 1 if missed cleavages = 0 or if no enzyme |
mc1 | 1 if missed cleavages = 0 or 1 |
mc2 | 1 if missed cleavages = 2 or more |
RMS | RMS m/z error for matched fragments |
RMSppm | RMS ppm error for matched fragments |
meanAbsFragDa | Mean absolute m/z error for matched fragments |
meanAbsFragPPM | Mean absolute PPM error for matched fragments |
rawscore | Simple binomial score using matches to main series sequence ions and p = 2*ITOL*n/100 where n is the number of peaks selected in each 100 Da bin |
peptide | The peptide string that was matched interpolated with numbers to represent modifications, e.g. X.DAKAAM1AGRLM1IR.X |
proteins | A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list |