Posted by Ville Koskinen (March 31, 2025)

Big databases and large peak lists

When it comes to sequence database searches, how large is too large to handle? This is actually two questions in one: a search can be large because the database has millions of protein sequences; or it can be large because of the number of peak lists. Mascot Server has no limits because of its design, which makes it ideal for large searches of both types. You may, however, encounter other limits in the web server, which are also discussed below.

Millions of protein sequences

When you’re studying a well-characterised organism like human or yeast, it’s very common to search just the target proteome plus a contaminants database. Any database search engine can handle a proteome or two – from a few thousand to a few tens of thousands or proteins. Things get more interesting when you scale up.

Why and when do you need a sequence database with millions of entries? A typical use case is when studying uncommon or unusual organisms that have not yet been sequenced. Bacteria water the obvious case, but others are, for example, marine mammals like California sea lions. Even today (March 2025), there are just 17 reviewed sea lion sequences in SwissProt, so you have to use either a database of unreviewed sequences or a database of all mammalian proteins.

Here’s a sample of the sizes of publicly available sequence databases whose configuration is shipped with Mascot:

SwissProt: 0.57M sequences (2025_01)
Trembl (UniProt TrEMBL): 253M sequences (2024_06)
NCBIprot (NCBI nr): 707M sequences (Jan 2025)

In terms of what Mascot can do, SwissProt is a small database. A SwissProt search of a tryptic sample from a 1h Orbitrap run will finish in minutes even on modest hardware. At the other end is the largest publicly available protein sequence database, NCBI nr, which is available in Mascot as the NCBIprot predefined definition. Mascot Server 3.x and 2.8 can handle sequence databases up to 4 billion entries, so 707M sequences is no problem.

The limit of 4 billion is very high but not infinite. It’s caused by using a 32-bit integer as an index value in one compressed file and may be lifted in a future version if there is ever a sequence database that large. (Versions 2.7 and earlier are limited to about 300-400 million sequences per FASTA file.)

Most of the time, you won’t need to search all of Trembl or all of NCBIprot. Instead, select a suitable taxonomy filter, like mammals or proteobacteria. This can still be millions (mammals) or tens of millions of sequences (proteobacteria).

How Mascot works

Mascot is designed to stream through a protein sequence database in linear order. It achieves this by ‘compressing’ the sequences (the .s00 file) and the metadata (the .a00, .b00 and .t00 files), and memory mapping the compressed files. As the program streams through the files, the operating system swaps the memory pages into and out of RAM for extremely efficient I/O. If the sequence database is larger than the amount of RAM, the operating system will automatically drop unused pages to avoid running out of memory.

Mascot digests protein sequences in parallel during the database search, not when the database is compressed. It may seem a waste of computation if the majority of searches use the same enzyme (trypsin) and same number of missed cleavages. However, digestion is only a small part of the total search duration. What you gain is flexibility: you can choose any enzyme at search submission time without having to recompress the sequence database.

If you go back 25 years, streaming was the only feasible architecture within the hardware constraints of the time, namely 32-bit systems and maximum 3GB RAM. But the design has survived, because it basically means there is no limit to the size of the sequence database. As long as you have enough disk space, Mascot can handle it – you don’t need a huge amount of RAM.

How long will the search take? If you double the size of the database, you’ll approximately double the duration of the search. It scales mostly linearly, although the behaviour can differ based on processor hardware. Actually, it’s more correct to say, if you double the size of the search space, you’ll approximately double the duration of the search. When a taxonomy filter is selected, the search scales according to the number of protein sequences searched, not the database size.

Limits imposed by computer hardware

Although Mascot Server has no limits, there are a couple of things to be aware of with computer hardware.

Make sure you have enough RAM. Ideally, if you have more RAM than the size of the sequence database, then all of it will be memory mapped and the search speed will be the highest. This is tricky with NCBIprot but perfectly feasible with “smaller” databases like Trembl (253M sequences, approx. 118GB to fit it in RAM). If you have less RAM, then Mascot will work just fine but there is some operating system overhead with disk I/O during the search.

Make sure you also have a fast processor. Ideally, high clock speed with fewer cores (max 16 cores), as this ensures a high single-thread speed. We have had good experience with AMD Ryzen 9 and ThreadRipper processors. If the processor isn’t fast enough, then bringing the database online may fail with test search timeout. One solution is to increase the MonitorTestTimeout setting in Mascot options.

Millions of MS/MS scans

The second aspect of ‘large’ is the size of the peak lists. This is the input file uploaded to Mascot Server by the client program (Mascot Daemon, Mascot Distiller, Thermo Proteome Discoverer, etc.).

The primary metric is the number of MS/MS scans. Here again, Mascot Server is designed to handle a file of arbitrary size. At the beginning of the search, Mascot splits the input into chunks of fixed size (SplitNumberOfQueries). Each chunk is searched independently and the results are combined at the end. This means the size of the input file doesn’t matter, and the search duration scales linearly according to the number of chunks.

A ballpark estimate of the typical number scans in an LC-MS/MS run can be found from the instrument’s scan speed and the gradient length. For example, Thermo Orbitrap Exploris scans at 22Hz, so a 120min gradient will have around 150k scans. In contrast, Thermo Orbitrap Astral can scan at 200Hz, so with the same gradient length, you could acquire around 1.4M scans in a single run.

The second metric is the average size, in bytes, of a single peak list. As long as your peak picking is working and correctly removes noise peaks, an MS/MS spectrum of a real peptide won’t have more than 100-200 peaks. When these are encoded as numbers in a text file, the average size is around 2kB. So, a file of 1 million peak lists shouldn’t be larger than 2GB.

Limits imposed by the web server

There are a few limits imposed by the web server.

The most common limit with the web server is the peak list upload size. On Windows, Microsoft IIS is limited to 2GB uploads, so this is the maximum size of the MGF file. It corresponds to about 1 million MS/MS scans. If you’re using Mascot Daemon to submit the search, one solution is to install Daemon on the same PC as Mascot Server, and it will bypass the web server. Another solution is to switch to the Apache web server.

On Linux, the most common web server is Apache httpd. Some Linux distributions apply an upload limit in Apache configuration. Check the value of LimitRequestBody: if it’s less than 2GB (2147483648), double or triple it. Alternatively, set it to 0 (unlimited). Mascot saves the uploaded peak lists in the ‘daily’ directory in mascot/data, so there is no risk of running out of space on the /tmp partition.

Some client programs, like Thermo Proteome Discoverer, are able to split the MGF file and search it in chunks. If you’re not using the machine learning integration, this is a satisfactory solution. However, if you have Mascot Server 3.1 or later and you have configured Mascot to refine the results using machine learning, you should disable any input chunking in Proteome Discoverer, as it will negatively impact the machine learning results.

Keywords: Fasta, pc hardware, sysadmin

Matrix Science