Data file format
A Mascot data file is a plain text (ASCII) file containing peak list information and, optionally, search parameters.
For a Peptide Mass Fingerprint, the file should contain a list of peptide mass values, one per line, optionally followed by white space and a peak area or intensity value. The Mascot generic format (MGF) is recommended for PMF searches.
For an MS/MS Ions Search, the data file must contain one or more MS/MS peak lists. The recommended format is the Mascot generic format (MGF). In MGF, each MS/MS dataset is a list of pairs of mass and intensity values, delimited by BEGIN IONS and END IONS statements. Mascot also supports mzML (.mzML).
Earlier versions of Mascot Server supported a range of proprietary file formats. These obsolete data file formats are still available in Mascot Server 3.0, but they are hidden by default. Support for the obsolete formats may be removed in a future release.
Mascot Generic Format
The following paragraphs illustrate the data file formats by means of examples. The rules which Mascot follows when parsing a data file provide an alternative description of what is and is not acceptable.
The Mascot generic format for a data file submitted to Mascot is:
[Embedded Parameter(s)]
Query 1
[Query 2]
.
.
.
[Query N]
Blank lines can be used anywhere to improve readability. Square brackets indicate optional elements; they should not be included in an actual data file.
Comment lines beginning with one of the symbols #;!/ can be included, but only outside of the BEGIN IONS and END IONS statements that delimit an MS/MS dataset.
A data file may include embedded search parameters. Most embedded parameters can only appear once, at the head of the data file. Certain parameters can appear within an MS/MS dataset.
If there is a conflict between the values of the embedded parameters and values entered into search form fields, the embedded parameters always take precedence. The search form fields are essentially defaults for values missing from the data file.
Peptide Mass Fingerprint
In the case of a Peptide Mass Fingerprint, each query is just a single peptide m/z value, with an optional second value for peak area or intensity. For example:
764.2
1231.0
1284
1944.8
2020.2
2100.35
Or
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566
If your MS data system outputs additional values on each line, these will be ignored.
There are two ways to change default search parameters. One way is using the search form fields. The other is to place embedded parameters at the beginning of the data file. For example:
COM=Digest #A6345
CLE=Lys-C
CHARGE=1+
PFA=1
764.2 2010
1231.0 2345
1284 456
1944.8 1012
2020.2 23
2100.35 566
The embedded parameters (COM, CLE, CHARGE, PFA) over-ride the entries in the corresponding form fields, if any. All of the other search parameters default to the search form settings.
A peptide mass fingerprint data file can only contain peptide mass fingerprint queries. Sequence queries or MS/MS datasets are not permitted.
MS/MS Ions Search
For an MS/MS Ions Search, each query represents a complete MS/MS spectrum, and is delimited by a pair of statements: BEGIN IONS and END IONS.
The search form defaults can be over-ridden by including embedded parameters at the beginning of the data file. Parameters specified in the search form or the data file header apply to the entire search. Within each MS/MS query, the mass of the precursor peptide(s) must be specified using one or more PEPMASS parameters. Precursor intensity and charge can be specified by including additional values on the PEPMASS line, delimited by white space. Specifying multiple PEPMASS lines for a query is useful with chimeric spectra.
Certain additional parameters can be specified at query level, between BEGIN IONS and END IONS, as shown in the table below. Parameters within an MS/MS query only apply locally, to the one spectrum. In the case of the CHARGE parameter, this means that you can have a global CHARGE setting, either from the search form or from a parameter at the head of the data file, as well as a local setting in one or more of the MS/MS queries.
This can be useful if the mass spectrometer data system cannot always determine precursor charge state correctly. For example, the global setting could be 2+ and 3+. When an unambiguous charge state can be determined, the correct charge is written to the local CHARGE parameter. Parameters within an MS/MS query must always be at the beginning, immediately following the BEGIN IONS tag. They cannot appear within or following the fragment ion list. For example:
COM=10 pmol digest of Sample X15
ITOL=1
ITOLU=Da
MODS=Carbamidomethyl (C)
IT_MODS=Oxidation (M)
MASS=Monoisotopic
USERNAME=Lou Scene
USEREMAIL=leu@altered-state.edu
CHARGE=2+ and 3+
BEGIN IONS
TITLE=Spectrum 1
PEPMASS=983.6
846.60 73
846.80 44
847.60 67
.
.
.
1640.10 291
1640.60 54
1895.50 49
END IONS
BEGIN IONS
TITLE=Spectrum 2
PEPMASS=1084.9
SCANS=3
RTINSECONDS=25
345.10 237
370.20 128
460.20 108
.
.
.
1673.30 1007
1674.00 974
1675.30 79
END IONS
BEGIN IONS
TITLE=Spectrum 3
PEPMASS=1244.7
SCANS=29-34
RTINSECONDS=95-97
.
.
.
In the fragment ion list, the first value is fragment m/z, the second intensity, and the third fragment charge.
Fragment ion intensity information is very important. Mascot will iteratively select sub-sets of the most intense peaks, looking for the group which most clearly discriminates the score of the top matched protein. There is an upper limit of 10,000 peaks per individual MS/MS spectrum. If you see an error message reporting that this limit has been exceeded, it almost certainly means that your data are profile data, and not peak lists. It is very unlikely that a single MS/MS spectrum could ever contain more than 1000 genuine peaks, never mind 10,000.
It is possible for an MS/MS ions search data file in the Mascot generic format to include sequence queries and peptide mass fingerprint queries.
Here is a rather baroque example:
# following lines define parameters.
# NB no spaces allowed on either side of the = symbol
COM=My favourite protein has been eaten by an enzyme
CLE=Trypsin
CHARGE=2+
# following line will be treated as a peptide mass
1024.6
# following line is a sequence query, which must
# conform precisely to sequence query syntax rules
2321 seq(n-ACTL) comp(2[C])
# so is this
1896 ions(345.6:24.7,347.8:45.4, ... ,1024.7:18.7)
# An MS/MS ions query is delimited by the tags
# BEGIN IONS and END IONS. Space(s)
# are used to separate mass and intensity values
BEGIN IONS
TITLE=The first peptide - dodgy peak detection, so extra wide tolerance
PEPMASS=896.05 25674.3
CHARGE=3+
TOL=3
TOLU=Da
SEQ=n-AC[DHK]
COMP=2[H]0[M]3[DE]*[K]
240.1 3
242.1 12
245.2 32
.
.
.
1623.7 55
1624.7 23
END IONS
Embedded Search Parameters
Search parameters can be embedded into the data file or entered in the search form query window using the following parameter labels. In the absence of an embedded parameter, the default value is the setting of the corresponding search form field.
The FORMAT parameter is used to identify obsolete MS/MS dataset formats. It can appear once only, at the start of the file. If there is no FORMAT parameter, the default is Mascot generic format (MGF).
If the peak list format is not MGF, then parameters can only appear once, in the data file header, before the peak list begins.
For an MGF peak list, parameters with a tick in the Header column of the table below can appear in the header and those with a tick in the Local column can appear in the local scope of a single MS/MS query (spectrum). That is, after the BEGIN IONS line and before the fragment mass and intensity values.
Name | Description | Header | Local | Choices/Range | Notes |
---|---|---|---|---|---|
ACCESSION | Database entries to be searched | List of double quoted, comma separated values | |||
CHARGE | Peptide charge | 1- | M-H- on PMF form | ||
Mr | |||||
1+ | MH+ on PMF form | ||||
N- to N+ where N is an integer and combinations | Not PMF | ||||
CLE | Enzyme | Trypsin etc., as defined in enzymes file | No default, so must be specified | ||
COM | Search title | Applies to the whole search | |||
CUTOUT | Precursor removal | Pair of comma separated integers | MIS only | ||
COMP | Amino acid composition | ||||
CROSSLINKING | Crosslinking method | as defined in crosslinking.xml | MIS only | ||
DB | Database | As defined in mascot.dat | |||
DECOY | Perform decoy search | 0 (false) | Default | ||
1 (true) | |||||
ERRORTOLERANT | Automatic second pass search of selected modification classes | 0 (false) | Default | ||
1 (true) | Not PMF | ||||
ET_CLASSIFICATIONS | Restrict error tolerant search space | Zero or more classifications as defined in unimod_2.xsd | |||
ETAG | Error tolerant sequence tag | A single query can have multiple ETAGs | |||
FORMAT | MS/MS data file | Mascot generic | Default | ||
mzML (.mzML) | |||||
Sequest (.DTA) | Obsolete | ||||
Finnigan (.ASC) | Obsolete | ||||
Micromass (.PKL) | Obsolete | ||||
PerSeptive (.PKS) | Obsolete | ||||
Sciex API III | Obsolete | ||||
Bruker (.XML) | Obsolete | ||||
mzData (.XML) | Obsolete | ||||
FRAMES | NA translation | Comma separated list of frames | Default is 1,2,3,4,5,6 | ||
INSTRUMENT | MS/MS ion series | Default | Default | ||
ESI-QUAD-TOF etc., as defined in fragmentation_rules | |||||
ION_MOBILITY | Drift time | floating point number | |||
IT_MODS | Variable Mods | As defined in unimod.xml | |||
ITOL | Fragment ion tol. | Unit dependent | |||
ITOLU | Units for ITOL | ppm | |||
Da | |||||
mmu | |||||
LIBRARY_SEARCH | Allow search to include spectral libraries | 0 (false) | Default | ||
1 (true) | |||||
LOCUS | Hierarchical scan range identifier | string | MIS only | ||
MASS | Mono. or average | Monoisotopic | |||
Average | |||||
ML_ADAPTER_PARAM | Parameters to pass to machine learning adapters | May appear zero or more times. Allowed values are defined in ML_adapters.toml config file. | MODS | Fixed Mods | As defined in unimod.xml |
MULTI_SITE_MODS | Allow two modifications at a single site | 0 (false) or 1 (true) | default 0 | ||
PEP_ISOTOPE_ERROR | Misassigned 13C | 0 to 2 | MIS only | ||
PEPMASS | Peptide mass | >100 | optionally followed by intensity and charge; multiple lines allowed if chimeric spectrum | ||
PERCOLATE | Refine results with machine learning | 0 (false) | Default | ||
1 (true) | |||||
PFA | Partials | integer, 0 to 9 | default 1 | ||
PRECURSOR | Precursor m/z | >100 | |||
QUANTITATION | Quantitation method | as defined in quantitation.xml | MIS only | ||
RAWFILE | Raw file identifier | string | MIS only | ||
RAWSCANS | Native scan range identifiers | a[:b] | MIS only | ||
REPORT | Obsolete | ||||
REPTYPE | Obsolete | ||||
RTINSECONDS | Retention time or range (in seconds) | a[-b] | MIS only | ||
SCANS | Scan number or range | v[-w] | MIS only | ||
SEARCH | Type of search | PMF | |||
SQ | = MIS | ||||
MIS | = SQ | ||||
SEG | Protein mass (kDa) | Empty or >0 | |||
SEQ | Amino acid sequence | A single query can have multiple SEQs | |||
TAG | Sequence tag | A single query can have multiple TAGs | |||
TARGET_FDR_PERCENT | Target FDR | Floating point number between 0 and 100 | default 0 | ||
TAXONOMY | Taxonomy | As defined in taxonomy file | |||
TITLE | Query title | Applies to a single spectrum | |||
TOL | Peptide mass tol. | Unit dependent | |||
TOLU | Units for TOL | % | |||
ppm | |||||
mmu | |||||
Da | |||||
USER00 to USER12 | Uncommitted parameters | ||||
USEREMAIL | User email | ||||
USERNAME | User name |
Search parameters that override global defaults set in the Options section of mascot.dat are prefixed OPTION_. These parameters can only appear in the peak list header.
OPTION_DechargeFragmentPeaks overrides DechargeFragmentPeaks. If the MGF peak list contains charge information for fragments, this positive integer is the maximum absolute charge state to be decharged, default 10. A value of 0 means ignore the fragment charge state. Peaks will be decharged to MH+ or MH- values when three conditions are satisfied: (i) fragment charge information is present in the peak list, (ii) the MH+ or MH- value will be less than 16384, (iii) the MH+ or MH- value will be less than that of the precursor.
OPTION_MaxPepNumVarMods overrides MaxPepNumVarMods. The maximum number of different variable mods allowed in a single peptide match.
OPTION_MaxPepNumModifiedSites overrides MaxPepNumModifiedSites. The maximum number of sites carrying variable mods allowed in a single peptide match.
OPTION_MaxPepModArrangements overrides MaxPepModArrangements. The maximum number of arrangements of variable mods tested to obtain a single peptide match.
Specifying a scan or time range
Although scan and retention time information is not used directly in the Mascot search, it can be very useful for applications that import the Mascot search results. Two obvious cases are quantitation and refining results using machine learning. If a peak list contains data from multiple raw files, annotating scan and retention time information in a structured and non-verbose manner can become complicated. The MGF format includes a choice of parameters for this purpose:
RTINSECONDS Anything from a single retention time to a complex list of retention time ranges. This parameter is for passing machine readable information, not for display, so there is no RTINMINUTES, etc. When there are multiple raw files, there can be multiple RTINSECONDS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RTINSECONDS[0]
SCANS Anything from a single scan number to a complex list of ranges, e.g. SCANS=1278,1280-1284,1290-1294,1298. When there are multiple raw files, there can be multiple SCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. SCANS[3].
RAWSCANS Identifiers corresponding to the data structure in the raw file. A two letter abbreviation followed by a number for each level of the hierarchy and a colon is used to delimit the start and end of a range. When there are multiple raw files, there can be multiple RAWSCANS entries in a single query, each with a zero-based index that relates to a specific raw file, e.g. RAWSCANS[1].
For example, AB Sciex Analyst scans are characterised by a triplet of period, experiment, and cycle, which is represented as pd1cy2ex3.
- Analyst pd1cy2ex3
- Masslynx fn2ix1
- LCMS Solution sg1ev4sn53
- Kratos Axima wlJ5
- Generic (scan number) – Xcalibur, mzXML, Bruker .yep/.baf, Agilent QTOF sn492
RAWFILE An identifier to relate a query back to one or more raw files. Can be a file name or file path or anything else that is meaningful to the downstream application.
LOCUS A hierarchical identifier used mainly by AB Sciex software. A unique combination of file, sample, period, cycle, and experiment might be represented as 2.1.1.24.1. Mascot treats this as a string and simply passes it through to the result file, so the content can be anything meaningful to the downstream application
When there are multiple raw files, adding an index is the most concise way of connecting queries and raw file(s). For example, an MGF peak list from Distiller for a multi-file project might look like this:
_DISTILLER_RAWFILE[0]={1}C:\data\replicate\Orbi_0319_01.RAW
_DISTILLER_RAWFILE[1]={1}C:\data\replicate\Orbi_0319_02.RAW
_DISTILLER_RAWFILE[2]={1}C:\data\replicate\Orbi_0319_08.RAW
.
.
.
BEGIN IONS
TITLE=22927: Scan rt=4669.74 from file [2]
PEPMASS=797.36086 89994.258
CHARGE=2+
SCANS[2]=48055
RAWSCANS[2]=sn8964
RTINSECONDS[2]=4669.736
227.05463 199.54773
242.21568 120.42233
.
.
.
This is fine if a single application creates the merged peak list. It cannot be used so easily when one application creates a peak list from each file and a second application independently merges these peak lists into a single search. In such cases, the RAWFILE or LOCUS parameter can be used to embed an identifier into each query as the peak list is created. This identifier then travels with the query as the peak lists are merged and is written to the search result file by Mascot.
Ion mobility
The first part of the argument must be an exact copy of an existing PEPMASS line, while the second part is the drift time as a floating point number.
This is OK:
PEPMASS=498.34 25674.3 2+
ION_MOBILITY=498.34 25674.3 2+ 1.5
This is also OK:
PEPMASS=498.34 25674.3
ION_MOBILITY=498.34 25674.3 1.5
This is an error because PEPMASS includes charge but ION_MOBILITY does not :
PEPMASS=498.34 25674.3 2+
ION_MOBILITY=498.34 25674.3 1.5
Mascot doesn’t use the value in matching, but just passes it through to the query section in the results file.
Intensity values
The MGF format allows intensity information to be associated with peptide and fragment m/z values. It doesn’t specify what these values represent, which is determined by the peak picking software. They could be peak height or peak area and they could be for the 12C peak or for the complete isotope distribution. Units are generally arbitrary and absolute values have no meaning.
During a Mascot search, subsets of the most intense peaks are selected and scored iteratively, looking for the best score, which presumably corresponds to an optimum separation of signal peaks from noise peaks. In the result report for an MS/MS search, the spectra in the unassigned list can be sorted by precursor intensity (in case it is of interest to see which are the strongest spectra that failed to get a significant match). For these purposes, as long as the intensity values are derived in a consistent manner, it doesn’t greatly matter what they represent.
If the peak list is being used for quantitation, then the origin of the intensity values will be of greater interest. If Mascot Distiller is being used for peak picking, a setting in preferences can be used to choose between S/N, which behaves like height, or area under the complete isotope distribution. However, Distiller can also be configured to pass through centroid values direct from the raw data, in which case the intensity will be whatever value was assigned by the instrument data system. This is only relevant for MS2 quantitation (iTRAQ / TMT). Distiller MS1 quantitation is always based on integrating survey scan intensity across the elution profile of the precursor, and this information is not present in the peak list used for the search.
The Rules
- Filename extensions are not significant.
- Numeric values must be non-localised US ASCII. That is, the decimal separator must be a period and the thousands separator, if any, must be a comma. Leading white space is acceptable on lines that start with a number.
- Parameter labels are not case sensitive. Parameter values may be case sensitive. Case is preserved for parameter values which are free text strings. There must be no leading space before a parameter label and no space either side of the = symbol
- Parameters at the head of the data file apply to the entire search and over-ride the default settings provided by the search form fields.
- In the absence of a FORMAT parameter, the default format is Mascot generic.
- Mascot generic format permits an MS/MS search to include peptide mass fingerprint queries and sequence queries.
- In Mascot generic format, each MS/MS spectrum is delimited by BEGIN IONS and END IONS statements. There is a line for each fragment ion peak, containing an m/z and intensity value, separated by white space. Fragment ion m/z values must be positive, non-zero values. Intensities must be positive values. The third value is fragment charge, which is optional. Any additional values or text are ignored.
- Parameters between the BEGIN IONS and END IONS statements only apply to the local MS/MS query. At least one PEPMASS parameter is required, all others are optional. Parameters within an MS/MS query must appear before the fragment ion data. If an MS/MS query has no fragment ions, it is treated as a PMF query.
- Most parameters can only appear at the head of the file, prior to any query data. The exceptions are PEPMASS, TITLE, SCANS, RTINSECONDS, RAWFILE, LOCUS, and RAWSCANS which can only appear within an MS/MS query block, and CHARGE, INSTRUMENT, IT_MODS, TOL, and TOLU, which can appear in either place. SEQ, COMP, TAG and ETAG can appear within an MS/MS query block or as qualifiers to a mass value using the Sequence Query syntax. When IT_MODS are specified within an MS/MS query block, they are appended to any IT_MODS specified at the head of the file or in the search form.
- Blank lines can be used anywhere to improve readability.
- Lines that start with one of the symbols # ; ! / are comment lines and are ignored. Comments cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block
- A SEARCH type must be defined, (PMF, SQ or MIS). The default is determined by the search form used to upload the file. Like any other parameter, this can be over-ridden by including a SEARCH parameter in the file header.
- A peptide mass fingerprint (PMF) search can only contain PMF queries. This allows for a relaxed syntax in which any line starting with a number is assumed to be a query. The first number is parsed as a peptide m/z value and the second number, if any, is parsed as a peak area or intensity. The rest of the line is ignored. Peptide m/z values must be equivalent to 100 <= Mr <= 16000.
- MS/MS searches can contain MS/MS data in proprietary formats only if this is declared with a FORMAT parameter. Mixing proprietary formats, or including non-MS/MS queries in a proprietary format file, is not allowed.
- User parameters are any parameters named USER\d\d (where \d is a digit) or any name beginning with an underscore except for the following, which are reserved:
_INSIGHT_*
_INTEGRA_*
_DAEMON_*
_DISTILLER_*
_SERVER_*
User parameters cannot be used between the BEGIN IONS and END IONS statements delimiting an MS/MS query block.
mzML (.mzML)
Mascot supports mzML version 1.1.0. Follow the link for a schema document and further information.
mzML format can contain centroided spectra or profile data. Mascot only supports centroided spectra. If you submit profile data, you will get very poor results. And, if any peak list has more than 10,000 masses, the search may terminate with an error. Check your peak picking settings carefully. If in doubt, try processing the file with Mascot Distiller.