Posted by Ville Koskinen (November 16, 2022)

Paleoproteomics

Paleoproteomics is a growing application area for mass spectrometry. Its cross-disciplinary remit includes analysis of ancient proteins (bone, skin, silk), ancient proteomes (enamel, egg shells, plant seeds) and most ambitiously ancient metaproteomes (dental calculus, food remains). The recent review by Warinner et al. in Chemical Reviews has excellent coverage not just of the varied applications but also the sample processing and data analysis challenges. The article is well worth a read. We’ll highlight a few of the key challenges regarding data analysis.

Illustration: A small ceramic bowl containing sun-dried fish and a strip of textile, found in Upper Egypt and on display at the British Museum. (Image courtesy of the British Museum, asset number 544687001.)

Sample collection and protein detection

Number one issue with paleoproteomics is sample collection and preparation, specifically being able to detect any protein at all. Anything that has been surrounded by unfavourable matter like soil, microbes or water for hundreds of years (or millions of years) may have very little protein left due to diagenesis. The proteins or peptides may also have undergone all kinds of chemical modification or non-specific cleavage.

In cases where enough protein or peptide is available, mass spectrometers are now sensitive enough to routinely detect them. Data-dependent acquisition (DDA) is commonly used, but neither DDA nor data-independent acquisition (DIA) is optimal for paleoproteomics, and there is rarely enough sample to waste on multiple/parallel reaction monitoring. Warinner et al. highlight the need for improvements in LC, ion mobility and other separation techniques to enhance MS/MS reproducibility.

Sequence databases

Next in difficulty is sequence databases, specifically lack of species coverage. The issue is the same as with ordinary metaproteomics. Some branches of the tree of life are hugely over represented in NCBI nr and UniProt, while there are few or no sequences for many historically or geographically important species, like local vegetables, mammals, birds or fish. Metagenomics (or paleogenomics) can be used for creating a sample-specific sequence database, or you can try de novo sequencing. However, while it’s exciting to identify a peptide that doesn’t map to any known protein sequence, it doesn’t exactly help in characterising the original protein or determining the animal or plant species.

The very best thing to do is make more comprehensive sequence databases from as wide a variety of species as feasible. Even if the target species is not in the database, it helps to have something from the same genus or family. Warinner et al. highlight the Earth Biogenome Project, the Vertebrate Genomes Project and the Darwin Tree of Life project. All are excellent solutions. A diverse sampling of genomes across the animal and plant kingdoms would be invaluable not just for paleoproteomics but any kind of metaproteomic analysis or where the genome of a related species is unsequenced. The size of the resulting database is not an issue for Mascot, which can handle sequence databases of any size, even billions of proteins, although proper taxonomy filtering is essential in such databases.

Assuming a comprehensive enough database exists, a hard problem in paleo-metaproteomics is taxonomic classification. If the sample was digested with trypsin, Mascot’s protein inference combined with UniPept can be beneficial. UniPept allows easy lookup to determine whether a tryptic peptide is unique to a species. If not unique, Warinner et al. note cases where external context can be used for positive inference. For example, a peptide shared by both wild and domestic sheep can be attributed to domestic sheep in a sample from a site in North America, which lacks wild sheep populations. A lot of it is currently down to detective work.

Protein degradation

Many diagenetic modifications like deamidation are so common that they can be used to authenticate ancient proteins. If an identified protein does not have a certain degree of expected damage, it could be a lab contaminant. The review points out the high proportion of unmatched MS/MS spectra in paleoproteomics samples, which may be due to modifications not searched for. The Mascot error tolerant search is well suited for picking up unknown modifications, amino acid substitutions and non-specific cleavage. We developed a new statistical model for error tolerant searching in Mascot Server 2.8, which further increases confidence in modification assignment and amino acid substitutions.

Additionally, we improved the default Percolator feature set shipped with Mascot, which greatly increases identification rates of peptides with non-specific cleavage. This is illustrated in the earlier blog article Identify more HLA peptides. The “wall clock” time required for a semi-specific or no-enzyme search is still substantial, but you can speed it up by adding more CPUs to your Mascot licence.

Another problem with protein degradation is unexpected crosslinking or condensing into novel chimeric structures. Might Mascot’s intact crosslink identification help?

Spectral libraries

Warinner et al. don’t discuss the use of spectral libraries, but there are two obvious use cases.

First one is creating a spectral library for contaminants. Paleoproteomics samples have contaminants common with ordinary lab work, but they also suffer from all kinds of ambient contamination. Some of these could be unique to an archaeological site or the object’s handling history, but if there is any commonality or repeated sampling from similar conditions, might it help to create a spectral library for them? Mascot supports integrated database and library searching, where contaminant spectra could match in the library and the rest in the sequence database.

Second use case could be developing a spectral library or libraries for species identification. Paleoproteomics is often limited to studying collagen, which tends to survive best. Warinner et al. discuss a few examples of collagen peptides acting as biomarkers for specific species, such as differentiating sheep from goat. Assuming reproducibility is good enough, could consensus spectra for the biomarkers be collected in a spectral library? This could then be searched much more quickly and with more confidence than a general database search.

Peptide mass fingerprinting

Much of paleoproteomics nowadays uses LC-MS/MS. There are still a number of use cases for MALDI-TOF and peptide mass fingerprinting (PMF), which goes by the name ZooMS. An example is determining the animal species used in old manuscripts written on parchment, where sampling must be non-invasive and only a smattering of molecules off the parchment surface can be collected. Not enough sample for LC-MS/MS, but maybe enough to get a peptide mass fingerprint.

Mascot has supported PMF since the very beginning, and many of the searches using the free Mascot on our public website are PMF. So, it’s a bit shocking to read in Warinner et al.’s review that PMF analysis in paleoproteomics is done by manual peak annotation! Why not use a PMF search engine? Mascot PMF search supports all the usual “paleo” variable modifications, and Mascot Distiller can easily be used for peak picking and visualising the MS scan(s). Answers in comments or by e-mail, please.

Keywords: error tolerant, metaproteomics, paleoproteomics, PMF, spectral library

One comment on “Paleoproteomics”

Ville Koskinen on March 7, 2023 at 10:29 said:

The Guardian newspaper reports on a project by the Francis Crick Institute and the Natural History Museum, another sign of growing interest in paleoproteomics:

https://www.theguardian.com/science/2023/mar/05/new-analysis-of-ancient-human-protein-could-unlock-secrets-of-evolution

Matrix Science