Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by Ville Koskinen (June 17, 2015)

PSI file formats, part 2: validation

The first part listed a number of ways for generating mzIdentML files and named a few pieces of software capable of reading and processing them. This part of the series discusses a rather technical issue with mzIdentML files, namely validity, and how it can affect you as a user.

Validation is somewhat tangled with submitting mzIdentML files to proteomics repositories, such as PRIDE or ProteoRed, so in the next part we will briefly look at repository submission requirements and how they are related to mzIdentML validation.

What does valid mean?

mzIdentML is an open format, in the sense that its specification is public and it is developed by a committee through a peer review process. There is no one authority who would curate a reference implementation of the file format. This is both a blessing and a curse. The advantage is that there is no vendor lock-in, as anyone can read the specification document and write software capable of processing the files. The drawback is that it is humanly impossible to define precisely every aspect of the format, so each implementor will have to and does make their own interpretations in grey areas.

This is where the concept of valid file steps in. Recall from part 1 that the mzIdentML format is like an onion: at the core is the basic tree-like structure (XML), then the relationships between nodes (XSD) and finally the layer of metadata (CV). Validity is defined in terms of the layers. An mzIdentML file is valid if 1) it is an XML file and 2) its nodes follow the XSD and 3) its metadata follows the CV. This definition sounds a bit facile, but it is the technical meaning of validity. Step 1 (XML validation) is called well-formedness, and step 2 (XSD validation) is sometimes referred to as semantic validation or schema validation.

Validation theory

Validation become increasingly complicated as you move “outwards” in the layers. Checking for well-formedness is fast, and there is no room for grey areas; either the file is or is not XML. A single character in the wrong place could render it invalid XML. These kinds of errors usually happen because of filesystem corruption or a similar condition, or in the course of developing software that writes mzIdentML.

XSD validation is straightforward, but is computationally more intensive. The validator has to check many kinds of hierarchical and relational assertions about nodes, so it may have to scan the document multiple times. It is possible for the XSD encoding to have ambiguities, although I’m not aware of any. If the file fails to validate against the XSD, it is most likely because of an error in the software that produced the file.

The controlled vocabulary part is where the grey area begins. The CV is actually composed of two files: the so-called OBO file that defines the vocabulary (the terms and meanings of the metadata), and a mapping file that enumerates the requirements and restrictions on vocabulary use by node type and position in the XML document. (Incidentally, the mapping file itself is an XML file with its own XSD.) In order to validate the mzIdentML file’s metadata, the validator must read the OBO file and the mapping file, and then check each rule in the mapping file against the mzIdentML file contents. The use of the CV terms is somewhat open to interpretation, because both the vocabulary and the mapping are still being developed, and you can use different versions of the CV with different mzIdentML files (the CV version used is stored in the mzIdentML header). Additionally, there can be different mapping files for different uses, such as MIAPE compliance; we’ll come to that later.

(For the interested reader, a recent article by Mayer et al. (Biochim Biophys Acta 1844(1):98-107, 2014) contains an in-depth description of controlled vocabularies in proteomics.)

Validation in practice

When do you need to worry about validation? Basically, when something breaks. Software processing mzIdentML will always do a basic well-formedness check as part of reading the file, and often (but not always) also validate against the XSD. If the file is not valid, you may get a long and detailed error message about a particular element on line N of the file, or you may get a simple “cannot read the file” error. As for the grey area around the CV, if the writer and reader use different versions of the CV or interpret terms differently, it is usually possible to load the file without warnings, but some portions or pieces of data may be unavailable.

It is an instructive exercise to validate an mzIdentML file by hand – and by hand I mean using standard validation tools, as opposed to letting your application software do it for you. I was not able to find tools for CV validation, so let us look only at XML and XSD validation.

A suitable command-line tool is xmlstarlet, which has both Windows and Linux versions. The Linux version is often available in the distribution repository. To check that a file is an XML file, issue the following command:

xmlstarlet val -e -w F981133.mzid 

(On some Linux distributions, the command name can be aliased to xml.) The -e flag increases error message verbosity and the -w flag checks for well-formedness. If the file is valid, the message is succinct:

F981133.mzid - valid

Otherwise, you could see something like F981133.mzid:10.136: Opening and ending tag mismatch, followed by lengthy details about which character on which line in particular is wrong.

To check the file against the XSD, first download the XSD file, then give both files as input to xmlstarlet:

xml val -e -s mzIdentML1.1.0.xsd F981133.mzid 

The -s switch to xmlstarlet actually validates both the XML and XSD layers, but it’s often easier to separate the two by checking for XML validity first. If the file is valid (in the XSD sense), you’ll get the same success message as above. If the file isn’t valid, the error could be something like

F981133.mzid:10.129: Element '{http://psidev.info/psi/pi/mzIdentML/1.1}cvList':
Character content other than whitespace is not allowed because the content type
is 'element-only'.

The number of different kinds of error is rather large and require quite a bit of knowledge about XML to intepret, so it’s best to contact the vendor of the software that generated the file and cite the error message.

XML validation tools, like xmlstarlet, are useful in cases where your reader software does not give a useful error message, or if the file is extremely large. In most other cases, you are better off with the mzIdentMLValidator program, which validates not only the XML and XSD layers but also the CV layer. The validator has fairly straightforward user interface, and the error messages have less jargon than xmlstarlet. I’ve found it necessary to disable “Use remote (OLS) ontologies” in version 1.3.3, as otherwise validation sometimes stalls for no reason. There is an online version of the validator, but it makes sense to use the standalone version for all but the smallest of files.

The rule of thumb is: If the file validates, the problem is in the reader software. If the file does not validate, the problem is in the writer software. It’s not a hard and fast rule, though. For example, if you export mzIdentML from Mascot 2.3 and try to load it in software that only supports mzIdentML 1.1, you may get a cryptic error message about incorrect XSD namespace. Although the Mascot-exported file is valid XML and validates against the mzIdentML 1.0 XSD, it is not valid against the 1.1 XSD. In this case, neither Mascot nor the reader software is incorrect, because they make different assumptions about file format version. And as mentioned, different CV versions can create grey areas where it is difficult so say whose interpretation is correct.

If you feel fired up about XML validation, I can only recommend the free resources at the World Wide Web Consortium (W3C) website, such as the XML Schema Primer. You may also be interested in the OBO format specification, which is the file format used for the controlled vocabulary. The format is based on another W3C recommendation, the OWL Web Ontology Language. (And OBO, by the way, stands for Open Biomedical Ontologies.) In the next part, we’ll discuss mzIdentML requirements in proteomics repositories.

Keywords: , , , ,

Comments are closed.