Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by John Cottrell (April 9, 2013)

Non-standard amino acid residues

Mascot only supports the 26 letters of the Latin alphabet as one-letter codes in sequence database entries. And, it is case-insensitive, so you cannot use (say) R and r for different residues. This is quite a limitation if you want to create a custom database that encodes non-standard or modified residues.

It isn’t a concern if you search only public databases. Databases from NCBI/GenBank, Uniprot and EBI/EMBL use the IUPAC one-letter code. This codes for the 20 standard amino acids plus U for selenocysteine and O for pyrrolysine. The remaining four letters are reserved for ambiguity: B = D or N, J = I or L, X = unknown, Z = E or Q. Other residues or modifications may be described in the annotations, but they cannot be represented in the actual sequence. (For an interesting discussion of more exotic alphabets, see this paper.)

What are your options if you are creating a custom database and want to hijack some of these codes for other purposes? Within Mascot, the compositions of the amino acid residues are taken from a file called unimod.xml in the Mascot config directory, which contains a dump of the information in the public Unimod database. Because Unimod is primarily a database of modifications, it only defines the 20 standard amino acids, and the non-standard residues and the ambiguity codes are undefined in unimod.xml. It may be that O and U will be added to Unimod at some stage, but not the ambiguity codes.

The interpretation of B, U, X, and Z is hard-coded in Mascot; defining these codes in unimod.xml will have no effect. In a default Mascot installation, O and J are undefined, effectively having zero mass. This is because O and J were little used when Mascot was first released. Official approval of the use of O for pyrrolysine dates from 2009 and the use of J for I or L is still not part of the IUPAC standard. However, we need to tidy this up. Only B, X, and Z should be hard coded, while J, O and U should default to (iso)leucine, pyrrolysine and selenocysteine, but with some mechanism to allow them to be re-defined.

As things are, you can easily re-define the compositions of the 20 standard amino acids by editing unimod.xml, although this will be inadvisable in most cases. You can also add definitions to unimod.xml for J and O. In SwissProt 2013_04, there are just 29 O’s and no J’s, so you won’t miss much, but note that the current NCBInr has 15309 J’s, which is a surprisingly large number.

A good example of re-defining a code was the use of J as an unconditional cleavage site in MSIPI. This database was created by appending tryptic peptide sequences spanning known cSNPs, N-terminus peptides, and other variants to the protein sequences from IPI. The additional peptides were delimited with the letter J and ‘released’ by defining a custom enzyme that cleaved both C-term and N-term to J in addition to tryptic cleavage. The mass of J was not important, but would ideally be set to 113 so as to function as the I or L ambiguity code in other databases. Unfortunately, MSIPI became obsolete when IPI was discontinued.

If you know of other examples of public databases that use non-standard codes or have an application where it would be useful to be extend the one-letter code to more than 26 letters, please leave a comment

Keywords: , ,

3 comments on “Non-standard amino acid residues

  1. Emanuele Alpi on said:

    Hello,
    regarding O- and U-containing sequences, the statement “you won’t miss much” may not fit sequence providers that are willing to use every bit of peptide level evidence coming from peptide and protein MS for annotation purposes.
    Best Regards

    Emanuele Alpi

    • Bethany Ahlers on said:

      Hi, I am trying add a modification of Dehydroalanine-Selenocysteine to my Mascot search engine. With this, I am trying to define the specificity site as U, however I do not see it as an option. How do I add this amino acid to the list?

      Thanks in advance!
      Bethany

      • John Cottrell on said:

        There is a later blog article that should answer this question: Selenocysteine (February 16, 2016)