Export search results
This utility enables Mascot search results to be exported in a variety of "machine readable" formats. When used interactively, the file format is chosen and customised using a web browser form, displayed by choosing Export Search Results in the format controls of a results report and pressing Format As. In addition, the utility can be executed by scripts, with the options specified on the command line.
Previous versions of Mascot also supported pepXML and DTASelect formats, which are described in obsolete export formats.
Mascot XML and Mascot CSV
The Mascot XML and Mascot CSV formats contain identical information. XML is ideal for importing into a relational database. CSV can be opened in spreadsheets such as Microsoft Excel.
For a Peptide Mass Fingerprint, the result information is structured in a very similar way to a Concise Protein Summary report.
For search results that include MS/MS data, you can choose whether to structure the protein list and associated peptide matches in a similar way to a Protein Family report or the (obsolete) Select Summary report. To create an export that contains information equivalent to a particular Mascot HTML report, the settings of the format controls must match, plus:
Type of search | HTML Report | Threshold type | Protein Scoring | Same-sets | Sub-sets | Group proteins |
---|---|---|---|---|---|---|
PMF | Concise Protein Summary | N/A | N/A | checked | 1 | N/A |
MS/MS | Peptide Summary | Identity | As format controls | checked | As format controls | not checked |
MS/MS | Protein Family Report | Homology | MudPIT | checked | 1 | checked |
XML
Precise details for individual data items in the XML export, such as the data type and whether it is optional, can be found in the XML schema mascot_search_results_2.xsd, (documentation). For general XML Schema considerations, see the XML section further down this page. Documentation was auto-generated using xs3p.
Protein quantitation information is only available if peptide quantitation has been selected.
CSV
The CSV file contains identical data, organised for display as a spreadsheet. The column headers are tabulated on this page. If you need to change the delimiter to something other than a comma, edit export_dat_2.pl and change the value of $delimiter, near the top of the script.
There are no column headers for quantitation information exported in CSV format. Each row contains peptide quantitation data consisting of pairs of ratio names and ratio values followed by pairs of component names and component intensities. Values are post-normalisation and post-isotope correction. Protein level information follows, in-row, after the last peptide match of each protein. Details can be found on this page.
Usage
For interactive use, the controls are divided into blocks, with the first block corresponding to the format controls of a results report.
The Optional Search Information block controls which ancillary information is exported. Most of the options are self-explanatory.
The data items in the Header section are:
- Search title
- Timestamp (W3C Date and Time format, e.g. 2005-03-12T08:29:11Z)
- User
- Report URI (URL or relative path, if executed at command line)
- MS data path or URL
- Search type
- Mascot version
- Database
- Fasta file
- Total sequences
- Total residues
- Sequences after taxonomy filter
- Number of entries searched in error tolerant mode (if applicable)
- Number of queries
- Warnings messages from the search (as required)
According to the type of search, the Header section may be followed by:
- Decoy search summary statistics
- Fixed modifications
- Variable modifications
- Search parameters
- Format parameters
- Quantitation normalisation factors
- A list of cross-links, with specificities and monolink numbering
For speed and efficiency, leave the checkboxes marked with asterisks under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further information on the use of these checkboxes).
mzIdentML
mzIdentML is the data exchange standard for database search results developed by the PSI Proteomics Informatics Standards Group. Originally, it was to be called analysisXML.
Precise details for individual data items, such as the data type and whether it is optional, can be found in the XML schemas of mzIdentML 1.1.0 and mzIdentML 1.2.0. For general XML schema considerations, see the XML section further down this page.
A semantic validator for mzIdentML documents was developed by Andreas Bertsch as part of the OpenMS project. It is currently unavailable, but check the PSI-MS website for alternatives.
Usage
For speed and efficiency, leave all the checkboxes under Optional Protein Hit Information unchecked. (See Optional Protein Hit Information for further information on the use of these checkboxes).
Under Query Level Information, check Matched Fragment Ions to output tables of matching experimental and calculated m/z values for each peptide match. This is obviously time consuming and causes a substantial increase in the size of the output file. Check Export data for all Queries to output details for every MS/MS spectrum, including those that got no match to an exported protein and those that got no match at all. Again, this is time consuming and causes a substantial increase in the size of the output file.
mzTab
mzTab is a simple tab-delimited format for database search results developed by the PSI Proteomics Informatics Standards Group. Precise details for individual data items, such as the data type and whether it is optional, can be found in the specification document on the PSI-MS website.
Usage
For speed and efficiency, leave starred checkboxes under Protein Hit Information unchecked.
Mascot Search Results (MSR) file
Mascot Search Results (MSR) is a relational database saved in SQLite format, introduced in Mascot Server 3.0. If the search results are saved in MSR format, the file can be exported verbatim using this option.
For security reasons, downloading is restrited to result files in the daily directories under the Mascot data directory.
If you have refined results using machine learning, note that the data exported as the MSR file does not contain the Percolator pip or pop files. You are better off exporting the results as mzIdentML or Mascot XML, which contain the refined results.
Only MSR files can be exported in MSR format. If the search results are saved in the old .dat format, Mascot does not automatically convert it to MSR.
Mascot DAT File
Mascot DAT file is an old results file format used by Mascot Server 2.8 and earlier. If your results are saved in this, select this option to export the data verbatim.
If the results are saved in Mascot Search Results (MSR) format, then when you select this option, Mascot automatically converts the results to the old .dat format. This option is intended for backwards compatibility with third party software.
For security reasons, downloading is restrited to result files in the daily directories under the Mascot data directory.
If you have refined results using machine learning, note that the data exported as the .dat file does not contain the Percolator pip or pop files. You are better off exporting the results as mzIdentML or Mascot XML, which contain the refined results.
MGF Peak List
A convenient way to extract the peak list from a search result file. This is useful when you export an mzIdentML file, because the mzIdentML schema does not support inclusion of the peak list. Required for submitting search results to PRIDE.
xiVIEW
Search results in CSV format that include intact crosslinked peptides can be exported to xiVIEW for visualisation. Note that xiVIEW also supports mzIdentML, which contains more complete search data.
xiVIEW CSV
The basic steps using xiVIEW CSV format are:
- Export the search as MGF in Mascot query order.
- Export the search as FASTA.
- Export the search as xiVIEW CSV.
- Log in to your xiview.org account.
- Upload the three files.
xiVIEW CSV supports:
- Intact crosslinks
- Monolinks
- Protein ambiguity (sameset proteins)
- Any protein accession as long as the correct FASTA file is uploaded too
At the time of writing, xiVIEW CSV does not support:
- Looplinks
- Large data sets, (if the CSV file is more than about 2 MB, the Javascript code on xiview.org grinds to a halt)
Note also:
- Looplinks are exported as monolinks, with only one end marked in the peptide.
- In the CSV format, linear matches with looplinks are omitted from the export. Only crosslinked pairs of peptides are included.
- In the CSV format, Mascot exports (query – 1) in the ScanNumber column. This is necessary for alignment with the MGF file (in query order) in xiVIEW.
- If any crosslinked match has monolinks, looplinks or variable mods, after uploading the CSV file and the MGF file, xiVIEW will prompt you to enter their masses. Mascot exports modifications as deltas, so these are easy to copy and paste. This step is not necessary when using mzIdentML.
- xiVIEW doesn’t recognise terminal modifications. If there is a modified N-term, it’s reported as an N-terminal residue mod. Same for C-term. If a peptide has both N-term and N-terminal residue mod, the mod IDs are concatenated in the xiVIEW interface.
xiVIEW mzIdentML
We recommend exporting data as mzIdentML when uploading to xiVIEW. The basic steps using mzIdentML are:
- Export the search as MGF in original input order.
- Export the search as mzIdentML. Under Protein Information, check Protein Sequence. No need to check Matched Fragment Ions.
- Log in to your xiview.org account.
- Upload the two files.
Note also:
- Looplinks are exported as monolinks, with only one end marked in the peptide.
- xiVIEW doesn’t recognise terminal modifications. If there is a modified N-term, it’s reported as an N-terminal residue mod. Same for C-term. If a peptide has both N-term and N-terminal residue mod, the mod IDs are concatenated in the xiVIEW interface.
Optional Protein Hit Information
Only a limited amount of information about a protein hit is saved to a Mascot result file. For example, the protein sequence is not saved because this would make the result files unacceptably large. When missing information is required for a Mascot report, it has to be retrieved from the compressed database files.
Even though a single call for missing information may take only a fraction of a second, and is not noticeable when loading a Mascot report, this can become a problem if creating an export file requires thousands of calls. It is important to be aware of this, and not waste time retrieving information that is not actually required. This is a particular issue for an export format that represents "raw" result information, like pepXML. A list of all the proteins that contain all the peptides that had any matches to any of the spectra can be an extremely long list.
Description
The Fasta description line is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries to guess which protein hits will appear in the reports and saves their Fasta description lines to the result file. However, the actual hit list depends on many factors, and some hits may be missed, requiring the descriptions to be retrieved from the compressed database files.
Protein Mass
The protein mass is saved for all peptide mass fingerprint protein hits. For an MS/MS search, Mascot tries to guess which protein hits will appear in the reports and saves their masses to the result file. However, the actual hit list depends on many factors, and some hits may be missed, requiring the masses to be retrieved from the compressed database files.
On the Matrix Science public web site, the description and mass of a protein can only be exported if this information was saved to the result file. The following protein hit information options are not available on the public web site, and attempting to use them will have no effect.
Percent coverage
Percent coverage is never saved to the result file. It is calculated on the fly from the length and the set of peptides assigned to the protein.
Length in residues
Length in residues is never saved to the result file. It must be retrieved from the compressed database files.pI
pI is never saved to the result file. The protein sequence must be retrieved from the compressed database files and the pI value calculated.
Taxonomy
Taxonomy is never saved to the result file. It must be retrieved from the compressed database files.
Taxonomy ID
Taxonomy ID is never saved to the result file. It must be retrieved from the compressed database files.
Protein sequence
The entire protein sequence is never saved to the result file. It must be retrieved from the compressed database files.
Command Line Execution
Result file conversion can be automated by using the export script as a command line utility. It must be executed on the Mascot server using the Mascot copy of the Perl interpreter and with cgi as the current directory. The command line arguments are URL-style name=value pairs, e.g. for Linux:
../perl64/bin/perl export_dat_2.pl file=../data/20240223/F004651.dat do_export=1 export_format=CSV … other arguments … pep_scan_title=1 > ../data/20240223/F004651.csv
For Windows, replace forward slashes with back slashes.
To obtain the command-line arguments for a given output, use the form based interface to adjust the settings then choose "Show command line arguments". The command line can then be copied and pasted as required. To direct the output to a file, add a > symbol followed by the path to the output file, as in the example above.
Note that, if Mascot security is enabled, the arguments will include a security session ID, e.g. sessionid=billy_299425468615895. You don’t need to provide a session ID for command line operations, so best to drop this argument in case it becomes invalid before you try to use it.
The script should never output the HTML for the interactive download button when called from the command line, but this can happen if it is called from a CGI script. If this is a problem, add the argument generate_file=1.
Required Arguments
- do_export
- must be 1 to export results
- export_format
- XML or CSV or pepXML or DTASelect or MascotDAT or mzIdentML or mzTab or MGF
- file
- relative or absolute path to result file
XML Schema
Versioning
The Mascot Search Results XML schema uses versioning to avoid applications breaking when the schema is updated. The schema definition is identified by a major version number and a minor version number.
When a change is made to the schema, and any instance document that was valid against the previous schema could become invalid, the major version number will be incremented. An example of such a change would be that a new type or element is added to the schema that is not optional. If a change is made to a schema that cannot break the validity of any existing document, such as adding a new type or element that is optional, then the minor version will be incremented.
There will be a separate schema file and name space for each major version and the file name contains the major version number. The schema also includes the major and minor version numbers as attributes of the top level element. An application that parses an instance document should compare the major and minor version attribute values against those which it was coded to support. It should not rely on an XML parser to verify the version numbers against the schema encoded restrictions, since the schema definition file used by the parser may be newer than when the application was written.
Validation
The instance documents created by this export utility have been validated against the corresponding schema definitions using XMLSpy. The following web tools can also be used:
No complex software is ever completely free of bugs. If you find an XML file created by the Export Search Results utility that fails to validate against the corresponding schema definition, please email full details to support@matrixscience.com and we will try to fix the problem as rapidly as possible.
On the other hand, if the XML file validates, but an error is reported by the application reading the file, then this is a bug in the application. In the first instance, please report this to the authors of the application.
Useful Resources
Standards and design:
- W3C schema documents:
- HP: XML Schema best practices
- IBM Developerworks – Tip: Namespaces and versioning
- Ronald Bourret: XML Namespaces FAQ
- XML Schema: Understanding Namespaces
- Unicode standard
Programming:
- Schema support in Xerces-C++ – includes SAX2 and DOM examples for overriding xsi:schemaLocation. Xerces 2.6 adds support for grammar caching which could be used to do the same thing by preloading the known schema files.
- Properties supported by Xerces2-J – it should be possible to use the external-schemaLocation property in much the same way as the C++ version to override the schema locations. Alternatively, you could use grammar caching instead.
- MSXML provides an XMLSchemaCache object for preloading schemas. There are DOM examples and it should be fairly similar in SAX2 except you would use the schemas property of ISAXXMLReader.