Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by Ville Koskinen (May 20, 2022)

mzIdentML 1.2

Mascot Server supports several formats for exporting database search results. One of them is the Proteomics Standards Initiative’s mzIdentML. The Mascot 2.8.1 patch release upgrades the file format version to mzIdentML 1.2. You can now export crosslinked search results in XML, CSV, xiVIEW CSV and mzIdentML format. When you export standard database searches, error tolerant searches and spectral library searches, you can choose between mzIdentML 1.1 and 1.2 depending on your requirements.

What’s new in mzIdentML

mzIdentML has been around for several years. Version 1.0 of the specification was published in 2009, but the more widespread version is mzIdentML 1.1, published in 2011 and supported since Mascot Server 2.4.1. The format is XML-based, where additional data structure and constraints are specified in a schema document and enforced by the official validator program. We ran a short series on PSI file formats on this blog back in 2015, which is still an accurate description of the 1.1. format.

The format changes in mzIdentML 1.2 have to do with the interpretation of controlled vocabulary (CV) terms in specific use cases. The XML structure stays the same, so the new version is otherwise backwards compatible and mostly forwards compatible. Software that reads mzIdentML 1.2 can read 1.1 files without difficulty, and software that reads the 1.1 format can read most data from 1.2 files. For this reason, there wasn’t an immediate impetus for jumping to the next version.

Crosslinked peptides

The main reason is crosslinking support. mzIdentML 1.2 specifies a number of new CV terms and specific ways of encoding crosslinked peptide matches. All the peptide-level metadata for intact crosslinks, detailed in section 5.2.9 of the specification, have been implemented in Mascot 2.8.1. A few minor points are worth noting.

The specification allows but does not require encoding protein-level interaction evidence. We decided to omit it, as Mascot currently stops short of inferring protein-level interactions from crosslinked peptide matches.

The specification allows encoding crosslinking modifications using both XLMOD and Unimod. Crosslinking support in Unimod was added some time after the specification was published, and it’s what Mascot uses and reports. This isn’t a problem: the mzIdentML validator accepts Unimod crosslinking modifications and monolinks, and they are also accepted by xiVIEW.

If you’ve defined a custom linker, this is exported as an unknown modification but with the right monoisotopic mass. There is currently no mechanism in mzIdentML to specify the linker composition or other attributes when it’s not defined in a public database, but this is no different to custom variable modifications.

Finally, looplinks are encoded as variable modifications. The other end of the link is omitted. We initially tried encoding it as a userParam, but the schema does not allow userParams under the Modification element. Unimod monolinks have a similar issue: the correct Unimod record ID and delta are exported, but the monolink code is not, so a client program reading the file needs to infer the monolink code from the delta.

The PSI group is working on a small specification update, mzIdentML 1.2.1, which will address the issues with monolink and looplink encoding. Another welcome addition proposed for 1.2.1 is the ability to encode linker directionality, which is currently specified in the Mascot crosslinking method.

Protein relationships

mzIdentML 1.1 introduced several CV terms for encoding relationships between proteins and protein groups. The encoding was recommended and Mascot outputs the recommended terms, but it was not a hard requirement.

mzIdentML 1.2 formalises the required terms. Only a small change was needed in Mascot for four new CV terms: cluster identifier, leading protein, group representative and non-leading protein. In short, the protein family number is the cluster identifier, and family members are marked as leading proteins. Same-set, sub-set and intersection proteins are non-leading proteins. The first family member, which is usually the protein hit with most peptide evidence, is marked as the group representative. The encoding is otherwise the same as in mzIdentML 1.1. A useful summary is provided in the Protein Reporting Rosetta Stone spreadsheet available on the PSI website.

Spectral library matches

mzIdentML 1.2 has a short updated section concerning spectral library matches. Mascot already encodes spectral library matches analogously to FASTA matches, so nothing had to be changed. More information can be found in Exporting spectral library search results.

Site analysis

mzIdentML 1.2 includes new syntax for encoding site analysis results and modification localisation scores. Unfortunately, the syntax is not applicable to Mascot’s site analysis. If a peptide match has multiple modifications, Mascot gives a confidence score to the set of modified sites. The mzIdentML 1.2 syntax only allows reporting a score for single sites. We’ve reported this as issue 112 on the project’s GitHub repository. For now, if you need site analysis results, please export the search as mzIdentML 1.1 or as CSV/XML.

xiVIEW and PRIDE support

As with any ‘new’ file format, it takes a while before it is widely supported. xiVIEW recommends using mzIdentML 1.2, and we’ve worked with Juri Rappsilber’s group to ensure Mascot is fully compatible with xiVIEW. Thanks in particular to Colin Combe!

The main consumer of mzIdentML files is the PRIDE repository. Initial support for mzIdentML 1.2 was added in the PRIDE Submission Tool in version 2.4.17 (July 2020), and the latest version (2.5.4) gives no validation errors with files exported from Mascot. PRIDE Inspector 2.5.4 can read mzIdentML 1.2 files but doesn’t display crosslinked matches correctly. Backend improvements are apprently being planned for PRIDE that will enable indexing and visualising crosslinked matches. The advantage of formal specifications is that once the PRIDE implementation is done, Mascot already exports the data in the format expected by PRIDE.

Keywords: , , ,

Comments are closed.