Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Posted by Ville Koskinen (August 16, 2021)

Addressing disk bottlenecks

One of the improvements in Mascot Server 2.8 is reducing the overall search time. We’ve benchmarked the new version with a variety of data sets of different sizes on a range of PCs and servers, and the typical reduction in run time is 20-35%. The incremental improvement is achieved by addressing disk access bottlenecks at the beginning, during and at end of the search.

Benchmark search

The timings in the table below are for running a search with 683,905 queries against the human proteome. The PC is a typical 4-core Intel Core i7 system. The search time is the average “wall clock” time from when the search form is submitted to when the results file is complete. Timings were taken with Mascot installed on a traditional consumer grade hard disk (HDD), and a mid-grade solid state drive (SSD).

Mascot version Disk type Search time/seconds Relative to 2.7
2.7 HDD 1435 100%
2.8 HDD 1050 73%

2.7 SSD 1137 100%
2.8 SSD 877 77%

As you can see, the new version runs the search substantially faster, both on HDD and SSD. Although the above benchmark is from a 1-CPU Mascot licence, the improvement also applies to multi-CPU licences and cluster installations.

How it works

When Mascot receives a search input file, the peak lists are parsed, formatted, sorted and split into chunks. The chunks are searched one at a time against the protein sequence database. If Mascot is running in cluster mode, the chunks are distributed among the nodes, and each node searches its set of chunks separately. At the end, the results are merged into a single results (.dat) file.

In previous versions of Mascot, the preprocessing of the search input file was a sequential, singly threaded procedure. So was the merging step at the end of each chunk and at the end of the search. The middle portion of the search – digesting proteins, iterating variable mods, fragmenting peptides, matching, scoring – is embarrassingly parallel. Disk speed is not an important factor during this CPU-intensive stage, where all available CPU cores are used for computing. Whether the singly threaded parts are a bottleneck depends on many factors, such as the relative time spent between the singly and multithreaded parts and the relative speed of the disk compared to the CPU. In general, as the size of data sets keeps increasing, disk access becomes more of a problem, despite advances in solid state disk performance.

A common solution to bottlenecks is to divide the computations among multiple threads and ensure the CPU isn’t sitting idle. The difficulty with multithreading disk access is that adding more threads tends to degrade performance. Different threads will naturally read from different parts of a file, and such random access is much slower than sequential access, even with the latest solid state disks.

The obvious alternative is to have one dedicated thread reading from disk, up to 4 threads formatting the data items and one thread writing to disk. This ensures data is read and written sequentially at the top disk throughput, and it’s the design we chose for Mascot 2.8. The number of processing threads is optimal for most systems, but can be fine tuned with a mascot.dat option. The embarrassingly parallel portion of the database search continues to use all available CPU cores. The net effect is, overall search time is reduced by 20-35% for most searches.

Should I switch to SSDs?

You may also have noticed the difference in the tabulated search timings between HDD and SSD. If your PC still has a traditional hard disk, it’s worth installing Mascot on an SSD drive and see if it makes a difference. The key is to keep the Mascot program files and the ‘data’ directory on the SSD, as this is where the disk-intensive files are stored. Sequence databases can be stored on the larger, cheaper HDD with no performance impact. The databases are memory mapped, so as long as there is enough RAM, all the frequently accessed database files are cached in memory. Whether you have HDD or SSD, updating to the new version should reduce search times.

Keywords: ,

Comments are closed.