Posted by Ville Koskinen (December 17, 2018)

Don’t wait: use spectral libraries now

The Human Proteome Project 2018 special issue in the Journal of Proteome Research contains a report from the 2017 Dagstuhl Seminar on Computational Proteomics. The paper by Deutsch et al. is titled Expanding the Use of Spectral Libraries in Proteomics, and the authors identify several challenges that slow down spectral library adoption. I’d like to address their main points.

Adoption in existing workflows

The authors note that adopting spectral library searching in existing workflows can be difficult. But it’s not difficult at all with Mascot Server. Spectral library searching is integrated seamlessly from the search form all the way to the results report and data export. Additionally, applications that use Mascot Parser to read Mascot results files can access integrated library search results with little effort. An example of this augmented approach is creating and using a spectral library of contaminants. There’s really no reason why you couldn’t start using spectral libraries today.

“Big data” community library?

The far greater challenge identified in the paper is sharing and making use of other people’s spectral libraries. As Deutsch et al. point out, the choice of spectra really depends on who the library is for. Should the library contain unidentified spectra? Is it for DIA? Choose the best replicate or create consensus spectra? What about experimental variability? Endogenous peptides? Quantitation? Crosslinking?

The authors present a vision for a big data community library – which I would call a spectral repository – that contains ‘everything’. Importing and curation of results is fully automated, the pedigree of each spectrum can be traced precisely and each spectrum represents the ultimate synthesis for how the peptide fragments. They identify many of the massive integration headaches between spectral libraries collected in unconnected experiments on diverse instruments in various labs by different people at disparate times, and continue to warn that “even on a single instrument, natural drift in calibration can lead to some differences in the spectra collected at different times”.

Assuming there is value in such a heterogenous collection of spectra, the prescribed technical solution is a new file format that captures all the required metadata in a machine-readable form. While it’s a necessary precondition, the problem with such a spectral repository is not technical. It’s organisational. Think of all the processes that must be in place for it to work (quality assurance, governance, establishing requirements and standards, measuring and evaluating goals, …). I doubt these could be codified and automated to everyone’s satisfaction, or indeed that it should all be left for computers to manage.

The authors also note that “a curated list of commonly observed spectra that are unidentified but known to be often misidentified, leading to erroneous conclusions, would be an especially valuable addition”. Indeed! Mascot has no difficulty matching to unannotated or unidentified library spectra, and libraries like this that solve specific problems would be a very welcome addition in any researcher’s toolbox.

New, standardised file format?

Is a new file format needed? At the moment, there are several, mutually incompatible spectral library file formats. Mascot integrates NIST’s well-established MS PepSearch, which supports the MSP file format. Each spectrum is a standalone entry and contains, at minimum, a peak list and a few key-value fields describing the precursor mass and peptide sequence. One of the fields can contain custom subfields, which gives some extensibility, and most subfields are documented in the MSP format specification.

MSP files have no metadata describing the spectra as a collection. Some of this is on the NIST library download website (like the Yeast Ion Trap library) but much is stored in Excel spreadsheets or journal articles or technical reports. Individual MSP entries contain many parameters that describe how the spectrum was made (e.g. Spec=Consensus, clustering parameters like Dotfull=0.885/0.022, mean search engine score) and its provenance (Sample=1/umich_yeast_protein_kinase_flag_none,1,1), which refer back to the metadata stored outside the library.

We’ve followed the same approach in Mascot. Libraries created from search results embed metadata at a similar level of detail. The MSP format is flexible enough, and although it has some shortcomings, there was no need to invent something new. Additionally, Database Manager logs contain the full audit trail for the library, including which match filters were used, which results files were processed and which peptide matches considered for each spectrum, all with links to the search reports.

The danger with a new standard is that old formats might never really die out. An alternative to a complete redesign is to standardise the collection-level or library-level metadata in machine-readable, vendor-neutral tokens and allow it to be saved in a file separate from the spectral library. For example, Database Manager could export such a file easily, which you could zip up with the MSP file. Similarly, NIST could encode existing metadata in the new format and distribute it as a separate download. Improvements in the spectral library format itself could be made independently, following each vendor’s usual practices for incremental development. Wouldn’t this be quicker and more likely to get everyone on board than starting from scratch?

Deutsch et al. concede that “the best strategy may be to develop a standard archival format”, which would ensure data portability between systems. We’re watching developments in this area with interest.

Keywords: database manager, PSI, spectral library

Comments are closed.

Matrix Science

Don’t wait: use spectral libraries now

Adoption in existing workflows

“Big data” community library?

New, standardised file format?