The minutiae of database management
The two major inputs into a database search are sequence databases and mass spectrometry data. Management of the databases in Mascot Server has evolved and improved over the years and newer versions. There are still a few issues that come up regularly and I am going to cover them in this article.
Common error messages
When you activate a new database, Mascot Server runs a test search to confirm everything is ok. For very large databases like NCBIprot or Uniref100 this test search can time out:
Error [M00103 - Job -11 - X02168:monitor] - Thu Aug 04 02:43:34 2020 -
Monitor test search on database NCBIprot timed out
The fix is to increase MonitorTestTimeout in Configuration Editor->Configuration Options. Double it and click on the “Retry” link for the database status page or “Recompress” in the Database Manager.
Mascot Server has a soft limit of 256 sequence databases by default. This includes active and inactive database and spectral library definitions. If you exceed this number, you will see the following error:
Error [M00043 - Job -11 - X00241:ms_controlfile] - Wed Dec 18 13:31:22 2019 -
Maximum number of active databases has been exceeded.
The database 'New_Database' will not be available.
The easy fix is to increase MaxDatabases to a higher value in Configuration Editor->Configuration Options. Note, that increasing the value does result in Mascot using more RAM, so you don’t want to set it to an unnecessarily high value. You’ll also need to restart the Mascot monitor service after changing the value.
Over time the search results and cache directory accumulate which can eventually fill up the disk. We have a tidy_data.pl script that compresses older results and empties the cache minimising disk usage. If you are cleaning up the mascot\data directory manually be careful not to delete the “data\test” directory. The “test” directory is very important as it contains the test search results for each database which Mascot checks when starting up. It also contains a template file for the test searches, the do_not_delete.asc file. If this file is missing you will see the following error when adding a new database or when restarting the Mascot Server:
Cannot start new database. Missing source test .asc file:
'../data/test/do_not_delete.asc' [M00532]<BR>
To repair the server go to the mascot\data directory and check that the “test” directory exists. In the test directory make sure that the “do_not_delete.asc” file exists. You can download a replacement here.
General tips
Every FASTA database entry must have a unique identifier. Mascot selects an identifier from the FASTA title line using an accession parse rule, which is chosen when the database is configured. After activating the database to bring it online you may see an error reporting the accession number as over 50 characters:
Error [M00421 - Job -16 - X00293:compress] - Wed Jan 6 16:52:41 2020 - Warning
- accession [scaffold1-1000001_1005000-Diaphorina_103[3419-3460]] is longer than 50 characters
The fix is to go back to the configuration details and change the parse rule. On rare occasions you may need to create a new parse rule or it may not be possible to find a parse rule that results in unique accession numbers less than 50 characters. This can happen with custom databases generated from genomic code or ones that contain entries drawn from databases that use different formats for the description lines. In this case we can provide a script that adds unique identifiers to each entry.
There is a soft limit on the length of individual sequences. The default is 80,000 in Mascot 2.7 and 50,000 in earlier versions. If you exceed the default, you will see the following error:
Error [M00238 - Job -16 - X00290:fasta] - Wed Nov 11 12:56:20 2019
- Sequence with more than 80000 residues ignored. (Accession number: NCLIV_chrIa-0R)
In this case you can edit the Configuration Editor->Configuration Options->MaxSequenceLen option. For the same reason as the MaxDatabases option we recommend not setting it unnecessarily high although values of 1 million or more are quite workable on today’s computers with multigigabyte RAM. Sometimes the database you have is a single-entry complete genome database with many millions of nucleotides. This is not ideal for searching as everything will match the single entry. We provide a script to split the database into multiple overlapping entries that make interpreting the search results more user friendly.
Custom databases
If you are working with non lab species like agricultural plants or barnyard animals we recommend setting up a custom proteome database using the uniprot proteome template. Alternatively, if you want to use the NCBIprot accession numbers download the sequences from https://www.ncbi.nlm.nih.gov/protein. In the latter case, type in the species name or taxonomy ID, filter as desired, click “Send to”, choose File, format FASTA.
If you are going to create a custom database, use a simple header line for each entry with a space between the accession string and the description, like “>ACCESSION Description”. This will make it very easy to add to Mascot Server using the ‘simple_AA_template’ predefined template. A mixture of very complex accession number formats and inconsistent formatting between entries leads to some of the problems mentioned above.
Non-standard amino acids are supported in databases by using the letters J, O and U which can be redefined as needed, although U is typically used for selenocysteine. For example, the non-standard residues can be used for modeling glycosidic cleavage ions from N-linked and O-linked glycans.
Mascot Server also supports the searching of nucleic acid databases and will perform a 6 frame translation on the fly. When a stop codon is encountered it leaves a gap and immediately re-starts translation. Note that some third party software like Proteome Discover does not support searching of nucleic acid databases so you will need to search the data directly on the Mascot Server.
Extended support for old modifications names in spectral libraries
A new feature was added to Mascot Server 2.7 to support non-standard modification names in spectral libraries. Old modification names or acronyms that do not meet the PSI modification nomenclature guidelines live on in old data sets, some of which have been used to create spectral libraries that are publicly available from NIST. Mascot Server 2.7 introduced an aliases file that maps the old names on to the newer PSI names. Details are in the February 2020 newsletter tip.
Keywords: configuration files, database manager, Fasta, sysadmin