Mascot: The trusted reference standard for protein identification by mass spectrometry for 25 years

Reporting quantitation datasets with Mascot Distiller

Mascot Distiller quantitation reports are written in Python and use Mascot Parser to access the search and quantitation results. Python is a commonly used programming language and using Mascot Parser provides simple access to both the search and quantitation results.

Distiller reports include ANOVA, Box-Plot, various clustering reports and volcano plots, and an option to export protein and peptide quantitation data in a format which can be easily imported into commonly used tools such as R and Perseus. For the full Distiller quantitation help and tutorials, start Distiller and press F1, or open the menu About, Mascot Distiller Help. Please contact us for a 30-day trial.

Running Reports

Reports are run using a wizard interface, where you specify the parameters used to run the report. There are a number of common options on most reports. For example, you will often be given a choice of exporting any graphs in one of the following formats:

  • Scalable Vector Graphics (.svg)
  • Portable Network Graphics (.png)
  • Interactive Javascript

The .svg and .png outputs are static images, but the interactive Javascript option uses the plotly library to enable interactive features such as zooming and tooltips. You can optionally export the graph into the Plotly web application to enable further editing and annotation, greatly simplifying the process of creating publication ready figures.

Several of the statistical reports, such as ANOVA, PCA and Hierarchical clustering, require that the proteins used have no missing values. These reports offer the following options for handling proteins with missing data:

  • Delete proteins with missing data
  • Enter a fixed value
  • Impute missing values using K-Nearest Neighbours

If you choose to enter a fixed value or impute values using K-Nearest Neighbours, you’ll be asked to set the maximum number of missing values to replace or impute – any proteins with more missing values than the specified limit will be removed from the report.

These types of reports also allow you to specify a contaminants database. Any protein matches identified from the selected database will be excluded from the report.

Example LFQ dataset PXD026930

This example is a label-free quantitation dataset PXD026930 from the PRIDE repository. The dataset is from a publication that looks at the role of alanyl-tRNA synthetases in S. cerevisiae. Aminoacyl-tRNA synthetases are essential enzymes linked with neurological disorders in humans. The publication shows that the mutations have more general effects in yeast on the amino acid control pathway and heatshock response.

The experiment looked at three yeast strains, wild-type and two (C719A and G906D) which have mutations in alanyl-tRNA synthetases.

Cultures were grown at 30°C, sampled and then increased to 37°C for 2 hours before sampling again. The wild-type and G906D at 30°C were sampled 3 times, but C719A and G906D at 37°C were sampled only twice for a total of 15 analyses.

We reprocessed the raw data using Mascot Distiller, searched the generated peaklists with Mascot Server using the search settings described in the PRIDE project, then carried out quantitation in Mascot Distiller. Ratios were taken for each mutant strain at each temperature against the equivalent wild-type sample, and the median protein ratio calculated. Because the same amount of protein was loaded for each sample, we’d expect the average protein ratio for each sample to be unchanged – that is a value of 1 when the ratio to wild type is calculated – so we enabled normalization using the median ratio. The effect of this can be seen by running the ‘Box plot’ report. Figure 1 below shows the box plots for protein ratios with and without normalization enabled.

Figure 1: Protein ratios plotted as a box plot with protein ratio normalization A) disabled and B) enabled.
Figure 1: Protein ratios plotted as a box plot with protein ratio normalization A) disabled and B) enabled.

Figure 2 below shows the component plot from the PCA report. Contaminant database matches were excluded and K-Nearest Neighbour imputation for up to two missing values were specified when running the report. We can see from the component plot that component 1 clearly separates the C719A samples from the G906D samples. Component 2 separates most of the 30°C samples from the 37°C, with the exception of the G906D 30°C replicate 2 sample which has a positive value for component 2 while all the other 30°C samples have negative values.

Figure 2: Component plot from the PCA analysis.
Figure 2: Component plot from the PCA analysis. The image has been edited and the different groups and outlier G906D sample have been highlighted after the report was run.

To see a broader picture of proteins which define the different mutants and temperatures, we can try a report like the hierarchical clustering report. Unfortunately, with a dataset of this size, there are so many proteins essentially unchanged between the wild-type and mutant strains and different growth temperatures that you don’t see any meaningful grouping of proteins or samples in the generated dendrograms and heatmap.

Another approach is to carry out an ANOVA test, which allows us to group the different samples into the separate groups (in this case, there are four groups comprised of the two different mutants at the two different growth temperatures). This allows us to identify proteins which are significantly different between and within the different groups. The output of the ANOVA can then be used to run a hierarchical clustering report if we select the “ANOVA plus clustering” report. The dendrograms and heatmap produced from this report are shown in Figure 3 below:

Figure 3: Output of ANOVA plus clustering report
Figure 3: Output of ANOVA plus clustering report. Four groups were defined (G906D 30°C, G906D 37°C, C719A 30°C and C719A 37°C) and a significance threshold of 5% selected. Calculated p-values were corrected for multiple testing using the Benjamini-Hochberg procedure. Up to two missing values were imputed using K-Nearest Neighbours.

As you can see from Figure 3, the two mutants are strongly differentiated by several groups of proteins. For example, there are a number of differences in the responses of various metabolic pathways such as Alcohol dehydrogenase 3 which is strongly upregulated in G906D but strongly downregulated in C719A at both 30 and 37°C compared to wild-type.

Differences between the different growth temperatures for a single mutant type are also present. For example, there are clear differences between C719A grown at 30 and 37°C for proteins such as LHP1_YEAST, IF4F2_YEAST, PTH2_YEAST and CISY1_YEAST.

Distiller ships with another commonly used plot, the volcano plot, which can be used to show significantly up or down regulated proteins. The paper uses a volcano plot of the results from the C719A mutant grown at 37°C compared to wild-type to highlight to highlight 5 proteins from the same pathway which were downregulated in the mutant. Figure 4 shows the equivalent plot generated by Mascot Distiller using the interactive Javascript option, which adds tooltips to each of the data points in the plot with the protein accession, and then selecting the option to open the plot in the third party plot.ly web application to further annotate the graph, showing how you can easily create publication ready figures using these tools:

Figure 4: Output of the Volcano plot report for one of the C719A mutants compared to wild type grown at 37°C
Figure 4: Output of the Volcano plot report for one of the C719A mutants compared to wild type grown at 37°C, the plot then uploaded to the plot.ly web application for further annotation. Five proteins highlighted in the publication have been highlighted using the tools provided by plot.ly.

Custom reports

You can add custom reports to the system by writing your own Python reports. Reports are comprised of two files – an XML file which defines the wizard displayed in Mascot Distiller, and the Python script itself. Search and quantitation results are accessed using Mascot Parser, which is installed with the embedded copy of Python used by Distiller.

To get started, you’ll need a good working knowledge of the Python programming language, and of our Mascot Parser library. An Integrated Development Environment (IDE) is recommended, such as Visual Studio Code.

Mascot Distiller ships with an embedded version of Python and includes a number of useful additional libraries which can be used to manipulate and format data, create graphs, export in any custom format or even upload data to another system via http etc. The additional libraries included are:

  • Mascot Parser – Our library for accessing Mascot search and quantitation results
  • statsmodels – A library for conducting statistical tests
  • Plotly – A graphing library which allows for user interaction using Javascript
  • SciPy – A library of algorithms for scientific computing
  • NumPy – A library for array creation and manipulation
  • pandas – A library for data analysis and manipulation
  • matplotlib – A graphing and visualisation library
  • scikit-learn – A machine learning library
  • seaborn – A data visualisation library based on matplotlib

These libraries should be sufficient for almost any kind of reporting and visualisation you can think of.

To simplify developing a Distiller Python report, we also supply several helper scripts, which you’ll find in the “reports” directory of your Mascot Distiller installation.

  • LoadQuantitation.py – includes methods to load search and quantitation results into msparser and returning them, reducing and simplifying the number of steps required to access your results
  • CreateQuantDataFrames.py – includes methods to load protein and peptide quantitation data into lists and arrays
  • WriteReports.py – includes methods to output logging and progress information from the script back to Mascot Distiller

You can use the Python supplied with Mascot Distiller as your development environment, which will ensure your script has access to the runtime libraries available in the Distiller GUI. To do this, select the python executable in the “python-3.6.5-embed-win-amd64″ directory of your Mascot Distiller installation as the Python environment in your IDE (your IDE should have instructions on how to do this). If you’re developing on a different PC, download and run the Mascot Distiller installer. This will install Mascot Distiller in viewer mode, which includes the embedded Python and will allow you to develop and test your custom reports.

A Distiller report consists of two files: an XML file, which defines the report inputs, and the actual Python script. The XML file can optionally define controls for a Wizard interface in the GUI, like dropdown menus and checkboxes. You can find the schema file which defines the report XML at “C:\ProgramData\Matrix Science\Mascot Distiller\schema\distiller_report_definition_1.xsd” on your Mascot Distiller workstation. The schema file is fully documented, but if you have any questions, please let us know at support@matrixscience.com

Below is a link to a tutorial which takes you through the steps required to create a Python report and enable it on your local Mascot Distiller installation. The example script described by the tutorial calculates the Average (top-3) protein intensity for all sample components and generates a CSV export file. This report will be included in the next release of Mascot Distiller. In the meantime, you can download the tutorial pdf and report script files: