Mascot Parser supports two types of cache file, which speed up access to very large results files.
ms_mascotresfile_dat
), i.e. the .dat file. ms_mascotresults
/ms_peptidesummary
). Mascot Server 3.0 introduces a new file format, Mascot Search Results (MSR). The MSR file is an SQLite database, which provides fast random access to the 'raw' search results. Unlike dat28 files (described below), no resfile cache is required for MSR.
Parser supports saving protein inference results from an MSR file in a pepsum cache.
Mascot Server 2.8 and earlier saved results in a MIME-format file with .dat extension, now called the dat28 format. For new applications, we recommend using MSR files, as they are much more efficient for random access.
For dat28 files, Mascot Parser 2.3 and later support the use of cache files to speed up access. The resfile cache is an index to different sections and raw lines of the results file itself, and only needs to be rebuilt if the file changes.
Parser supports saving protein inference results from a dat28 file in a pepsum cache.
The pepsum cache contains processed and grouped protein hits and peptide data. It is created and used by ms_peptidesummary
. Without a pepsum cache, protein grouping needs to be done anew every time an ms_peptidesummary
object is created. With caching, the grouped data can be stored in the cache file, and subsequent file access can skip protein grouping entirely.
For MSR files, the pepsum cache has file extension .sdb, which is an SQLite database.
For dat28 files, the pepsum cache has file extension .cdb, which is a read-only key-value database.
In a standard Mascot Server setup, cache files are located in a directory specified by the CacheDirectory value in the options section of mascot.dat
. If CacheDirectory specifies a relative directory, then it is relative to the current working directory of the calling program. The function ms_mascotoptions::getCacheDirectory() can be used to retrieve the cache directory from mascot.dat
. In all Mascot Server scripts, the return value from this function is passed to the ms_mascotresfilebase::createResfile.
Cache files for a specific results file are put into their own directory under the directory specified by CacheDirectory
. The name of the directory is constructed by calling getMD5Sum() and passing a string comprising the filename (or filenames if combining multiple results files; see Combining multiple results files), file size and last modified date. The function ms_mascotresfile::getCacheDirectory() can be used to retrieve the directory used for cache files for a specific results file.
A number of special 'tokens' can be used to split the potentially large number of files/directories into more convenient subdirectories. The 'tokens' are generated using the strftime
function using the last modified date of .dat file. Whilst the strftime
function has a large number of options, the ones most likely to be useful are:
Other tokens accepted by strftime are documented here: http://www.cplusplus.com/reference/clibrary/ctime/strftime/
The default value is "../data/cache/%Y/%m" which specifies a new directory for each month. For example, for a file dated the 3rd February 2020, the files would be in a directory ../data/cache/2020/02
.
Mascot Parser will try to create a cache directory if it doesn't exist. If necessary, it will create the whole directory tree.
The base name for pepsum cache for dat28 format is
F01234.dat.[a-z0-9]*.cdb
The base name for pepsum cache for MSR format is
F01234.msr.[a-z0-9]*.sdb
The segment [a-z0-9]
will be replaced by exactly 26 numbers or lower-case letters created by calling getMD5Sum() on a string created using all parameters passed to the ms_peptidesummary
constructor, including the UniGene index file path.
If, for example, two reports are created for the same results file, one with a probability threshold of 0.05 and another with a threshold of 0.001, then separate cache files with unique filenames will be created, making it fast to switch between the two reports. Separate cache files are necessary, because changing any of the constructor parameters may change protein and peptide scoring, grouping, cut-off thresholds, etc.
The following flags do not affect the contents of the cache file and therefore are not included in the MD5Sum for the name:
The function ms_peptidesummary::getCacheFileName() can be called to retrieve the full or relative path to the cache file.
The default constructor for ms_mascotresfile_dat has the flag RESFILE_NOFLAG, which means that when a results file is opened, no cache will be used. Simply specify RESFILE_USE_CACHE to use a cache. This flag is a "no-op" for MSR files.
Most errors relating to the cache files are 'soft'. For example, if two applications both try and create a cache file at the same time, then the first will succeed in creating the cache file, and the second one will carry on without using a cache. If an application crashes in the middle of creating a file, then the next application to try and use the cache will re-create it. If an application fails to create the cache for any reason, it will carry on without the cache.
A resfile cache may be recreated under some conditions, even when it already exists:
The resfile cache has a maximum size of 4GB. In general, the ms_mascotresfile_dat
cache file is about 10% of the size of the results file.
The static function ms_mascotresfile_dat::willCreateCache() can be called before creating an ms_mascotresfile_dat
object to determine if a cache file will be created. This is useful in checking whether creating the object will take a long time.
The function ms_mascotresfile::getCacheFileName() can be called to retrieve the full or relative path to the cache file.
Specify MSPEPSUM_USE_CACHE when creating the ms_peptidesummary object to enable caching.
If the cache file doesn't exist, then it will be created. If it exists, then the ms_peptidesummary constructor will return without extra delay. Calling getHit() and getPeptide() will load data from the cache file on demand. This means that after the cache file has been created once, you will need much less memory to access data in the file, as the whole file does not need to be read into memory.
Different versions of Parser cache more or less data in the pepsum cache. The following list is an example of what is stored:
minProbability
value:
Assuming MSRES_DECOY is not specified, no actual decoy matches will be saved. The same is true in reverse: when using MSRES_DECOY
, protein hits from the standard search are not cached.
If a UniGene index is in use and the UniGene file changes, the cache will be rebuilt.
If you use the cache for an ms_mascotresfile_dat
object, and then create a new ms_peptidesummary object, caching will typically take 1.5 to 3 times longer than without an ms_mascotresfile_dat
cache.
The static function matrix_science::ms_peptidesummary::willCreateCache() can be called before creating an ms_peptidesummary
object to determine if there will be a delay while a cache file is created.
Once a cache file has been created, it should take less than a second to create a new ms_peptidesummary object. Calling matrix_science::ms_mascotresults::getHit() should also be very fast as this just loads data from the cache file. However, only basic information for each matrix_science::ms_protein object is loaded at this time. The first call for any particular ms_protein object to a function that takes a pepNumber argument (for example: matrix_science::ms_protein::getPeptideIonsScore() ) will be slow because this will cause a reload of data from multiple parts of the results file. Subsequent calls to any function for that ms_protein object will be fast, until matrix_science::ms_mascotresults::freeHit() is called.
Some applications just need a list of query/rank values for each protein so these values are cached separately for each top level protein and family member protein. Therefore, the first call to matrix_science::ms_protein::getPeptideQuery() or matrix_science::ms_protein::getPeptideP() will be reasonbly fast compared will calls to the other functions that take a pepNumber.
Calls to any function that takes a OneInXprobRnd
argument will be slow if 1) caching is in use and 2) the argument is not the same as when the cache file was created. Typically, this argument is 1 / results.getProbabilityThreshold()
. The value is compared to what is saved in the cache file (within a certain precision), and if the argument differs, this may trigger an expensive loop over all queries.
The affected functions are:
Parser is backwards compatible with cache files from a number of previous versions. New functionality often necessitates adding more data to new cache files to improve performance. If this data is not present in the current cache file (e.g. the cache file is from one or two versions before the current version), it is read directly from the results file or calculated on the fly.
The enum ms_peptidesummary::BUGFIX_NUM maintains a list of new functionality and performance improvements where this may be the case. If you use cache files created by a previous version of Parser, it is good practice to call ms_peptidesummary::isDataCached() with the relevant bug number to discover whether the data needed by the method you use is indeed present in the cache file. If it isn't, you can avoid the function call, prepare a progress feedback screen or recreate the cache files.
There are two lines in the options section of mascot.dat which specify what each script should do:
ResfileCache master_results.pl,master_results_2.pl,peptide_view.pl... ResultsCache master_results.pl,master_results_2.pl,peptide_view.pl...
Each script should see if it is listed in ResfileCache
, and if so, it should specify RESFILE_USE_CACHE when creating the ms_mascotresfilebase
object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResfileCache(). The procedure is correct for both MSR and dat28 files; ms_mascotresfile_msr simply ignores RESFILE_USE_CACHE if it's set.
Each script should see if it is listed in ResultsCache
, and if so, it should specify MSPEPSUM_USE_CACHE when creating the ms_mascotresults
object. The value can be retrieved from the Options section of mascot.dat by calling ms_mascotoptions::getResultsCache().
To quickly load details for a single protein (e.g. for protein_view.pl), applications prior to Mascot Parser 2.3 would generally use the singleHit parameter. However, it is faster to use a cache file if it exists.
For ms_proteinsummary
, no peptide summary cache is available, so continue to follow the instructions as in Getting a single hit from the protein summary.
For ms_peptidesummary
, the fastest method of loading details for a single protein is to use the cache if it already exists. For protein_view.pl
, it is likely that a cache has already been created by master_results.pl
or master_results_2.pl
. So, as long as the exact same flags and parameters for the ms_peptidesummary
constructor are used, access will be fast.
Protocol:
protein_view.pl
(or whatever application/script) is to use a cache file as described above. ms_peptidesummary
without the singleHit
parameter and using the same flags and parameters as in master_results.pl
or master_results_2.pl
to make sure an existing cache file is used. Make sure that MSPEPSUM_USE_CACHE is specified.