TopPIC Suite

1 Overview

To run TopPIC suite, a computer with at least 4 GB memory and a 64-bit Linux or Windows operating system is required. TopFD, TopIndex, TopPIC, TopMG, and TopDiff provide a command line interface for both Linux and Windows users as well as a graphical user interface (GUI) for Windows users. Please see our tutorials for more details of the GUI.

2 TopFD

TopFD (Top-down mass spectral Feature Detection) is a software tool for top-down spectral deconvolution, which groups top-down mass spectral peaks into isotopic envelopes and converts isotopic envelopes to monoisotopic neutral masses. In addition, it extracts proteoform features from MS1 spectra.

2.1 Input

The input of TopFD is mass spectrometry data files in the mzML format. Raw mass spectral data generated from various mass spectrometers can be converted to mzML files using msconvert.

2.2 Output

TopFD outputs two LC-MS feature text files with a file extension "feature", one LC-MS feature file with a file extension "xml", and two deconvoluted mass spectral data files in the msalign format with a file extension "msalign", which is similar to the MGF file format. In addition, TopFD creates a folder containing JavaScript files for spectral visualization.

For example, when the input file name is spectra.mzML, the output includes:

  • spectra_ms1.feature: a feature file containing LC-MS features.
  • spectra_ms2.feature: a feature file containing MS/MS scan IDs and their corresponding LC-MS feature IDs.
  • spectra_feature.xml: a feature file containing LC-MS features in the xml format.
  • spectra_ms1.msalign: a list of deconvoluted MS1 spectra.
  • spectra_ms2.msalign: a list of deconvoluted MS/MS spectra.
  • spectra_html: a folder containing JavaScript files for MS1 and MS/MS spectral visualization.

2.3 Command line usage

To run TopFD, open a terminal window and run the following command.

topfd [options] spectrum-file-names

Options

-h [ --help ]

Print the help message.

-a [ --activation ] <CID|ETD|HCD|MPD|UVPD|FILE>

Set the fragmentation method(s) of MS/MS spectra. When "FILE" is selected, the fragmentation methods of spectra are given in the input spectrum data file. Default value: FILE.

-c [ --max-charge ] <a positive integer>

Set the maximum charge state of precursor and fragment ions. The default value is 30.

-m [ --max-mass ] <a positive number>

Set the maximum monoisotopic mass of precursor and fragment ions. The default value is 70,000 Dalton.

-e [ --mz-error ] <a positive number>

Set the error tolerance of m/z values of spectral peaks. The default value is 0.02 m/z.

-r [ --ms-one-sn-ratio ] <a positive number>

Set the signal/noise ratio for MS1 spectra. The default value is 3.

-s [ --ms-two-sn-ratio ] <a positive number>

Set the signal/noise ratio for MS/MS spectra. The default value is 1.

-o [ --missing-level-one ]

Specify that the input file does not contain MS1 spectra.

-n [ --msdeconv ]

Use the MS-Deconv score (see paper) to rank isotopic envelopes. If -n is not selected, the default EnvCNN score (see paper) is used to rank isotopic envelopes.

-w [ --precursor-window ] <a positive number>

Set the precursor isolation window size. The default value is 3.0 m/z. When the input file contains the information of precursor windows, the parameter will be ignored.

-t [ --ecscore-cutoff ] <a positive number in [0, 1]>

Set the ECScore cutoff value for proteoform features. Default value is 0.5.

-b [ --min-scan-number ] <1|2|3>

The minimum number of MS1 scans in which a proteoform feature is detected. The default value is 3.

-i [ --single-scan-noise ]

Use the peak intensity noise levels in single MS1 scans to filter out low intensity peaks in proteoform feature detection. The default method is to use the peak intensity noise level of the whole LC-MS map to filter out low intensity peaks.

-f [ --additional-feature-search ]

Perform additional feature search for MS/MS scans that do not have detected proteoform features in their precursor isolation windows. In additional search, the signal noise ratio is set to 0, the mininum scan number is set to 1, and the ecscore cutoff is set to 0.

-d [ --disable-final-filtering ]

Skip the final filtering of isotopic envelopes in MS/MS spectra.

-u [ --thread-number ] <a positive integer>

Number of CPU threads used in spectral deconvolution. Default value: 1.

-g [ --skip-html-folder ]

Skip the generation of HTML files for visualization.

Examples

Deconvolute a centroid data file spectra.mzML and output five files: spectra_ms1.feature, spectra_ms2.feature, spectra_feature.xml, spectra_ms1.msalign, and spectra_ms2.msalign.

topfd spectra.mzML

Deconvolute two centroid data files spectra1.mzML and spectra2.mzML and output five files for each input data file.

topfd spectra1.mzML spectra2.mzML

Deconvolute all centroid data files in the current folder.

topfd *.mzML

Deconvolute a centroid data file spectra.mzML and skip the final filtering and skip the generatation of the HTML folder for visualization.

topfd -d -g spectra.mzML

Deconvolute a centroid data file spectra.mzML using 4 CPU threads and MS-deconv score.

topfd -u 4 -n spectra.mzML

Deconvolute a centroid data file spectra.mzML that does not contain MS1 spectra.

topfd -o spectra.mzML

Deconvolute a centroid data file spectra.mzML. In proteoform feature identification, each proteoform feature is required to be detected in at least one MS1 scan and the ECScore cutoff is set to 0.2. This settings will increase the number of reported proteoform features.

topfd -t 0.2 -b 1 spectra.mzML

Deconvolute a centroid data file spectra.mzML with a signal/noise ratio 2 for MS1 spectra.

topfd -r 2 spectra.mzML

Deconvolute a centroid data file spectra.mzML with the following settings: the maximum charge state: 50, the maximum mass: 30,000 Dalton, and the signal/noise ratio for MS/MS spectra: 2.

topfd -c 50 -m 30000 -s 2 spectra.mzML

3 TopIndex

3.1 Input

The input is a protein sequence database file in the FASTA format.

3.2 Output

The output is a folder containing protein sequence index files. For example, when the input file name is proteins.fasta, the output folder is proteins.fasta_idx.

3.3 Command line usage

To run TopIndex, open a terminal window and run the following command.

topindex [options] database-file-name

Options

-h [ --help ]

Print the help message.

-f [ --fixed-mod ] <C57|C58|a fixed modification file>

Set fixed modifications. Three available options: C57, C58, or the name of a text file specifying fixed modifications (see an example file). When C57 is selected, carbamidomethylation on cysteine is the only fixed modification. When C58 is selected, carboxymethylation on cysteine is the only fixed modification.

-n [ --n-terminal-form ] <a list of allowed N-terminal forms>

Set N-terminal forms of proteins. Four N-terminal forms can be selected: NONE, NME, NME_ACETYLATION, and M_ACETYLATION. NONE stands for no modifications, NME for N-terminal methionine excision, NME_ACETYLATION for N-terminal acetylation after the initiator methionine is removed, and M_ACETYLATION for N-terminal methionine acetylation. When multiple forms are allowed, they are separated by commas. Default value: NONE,M_ACETYLATION,NME,NME_ACETYLATION.

-d [ --decoy ]

Use a shuffled decoy protein database to estimate spectrum and proteoform-level FDRs. When -d is chosen, a shuffled decoy database is automatically generated and appended to the target database. Index files for the concatenated database are generated.

-e [ --mass-error-tolerance ] <a positive integer>

Set the error tolerance for precursor and fragment masses in ppm. Default value: 10.

-u [ --thread-number ] <a positive integer>

Set the number of threads used in the computation. Default value: 1. About 0.5 GB memory is required for each CPU thread.

Examples

Generate index files for a protein database file proteins.fasta using default parameters.

topindex proteins.fasta

Generate index files for a protein database file proteins.fasta using carbamidomethylation as the fixed modification, and N-terminal methionine excision and N-terminal methionine acetylation as the N-terminal forms.

topindex -f C57 -n NME,M_ACETYLATION proteins.fasta

Generate index files for a protein database file proteins.fasta using a target-decoy concatenated database, a mass error tolerance of 5 ppm, and 4 CPU threads.

topindex -d -e 5 -u 4 proteins.fasta

4 TopPIC

4.1 Input

  • A protein database file in the FASTA format
  • A mass spectrum data file in the msalign format
  • A text file containing LC-MS feature information (optional)
  • A text file of fixed PTMs (optional)
  • A text file of variable PTMs (optional)
  • A text file of PTMs for the characterization of unexpected mass shifts (optional)

4.2 Output

TopPIC outputs four tab separated value (TSV) files, two XML files, and a collection of HTML files for identified proteoforms. For example, when the input data file is spectra_ms2.msalign, the output includes:

  • spectra_ms2_toppic_prsm.tsv: a TSV file containing identified proteoform spectrum-matches (PrSMs) with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • spectra_ms2_toppic_prsm_single.tsv: a TSV file containing identified proteoform spectrum-matches (PrSMs) with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • spectra_ms2_toppic_proteoform.tsv: a TSV file containing identified proteoforms with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • spectra_ms2_toppic_proteoform_single.tsv: a TSV file containing identified proteoforms with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • spectra_ms2_toppic_proteoform.xml: an XML file containing identified proteoforms with the E-value or proteoform-level FDR cutoff.
  • spectra_ms2_toppic_prsm.xml: an XML file containing all identified PrSMs without clustering and filtering.
  • spectra_html/toppic_prsm_cutoff: a folder containing JavaScript files of identified PrSMs using the E-value or spectrum-level FDR cutoff.
  • spectra_html/toppic_proteoform_cutoff: a folder containing JavaScript files of identified PrSMs using the E-value or proteoform-level FDR cutoff.
  • spectra_html/topmsv: a folder containing HTML files for the visualization of identified PrSMs.
To browse identified proteins, proteoforms, and PrSMs, use a chrome browser to open the file spectrum_html/topmsv/index.html. Google Chrome is recommended (Firefox and Edge are not recommended).

When the input contains two or more data files, TopPIC outputs four TSV files, two XML files, and a collection of HTML files for each input file. When a file name is specified for combined identifications, it combines spectra and proteoforms identified from all the input files, removes redundant proteoform identifications, and reports four TSV files, two XML files, and a collection of HTML files for the combined results. For example, when the input is spectra1_ms2.msalign and spectra2_ms2.msalign and the combined output file name is "combined," the output files are:

  • combined_ms2_toppic_prsm.tsv: a TSV file containing PrSMs identified from all the input files with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • combined_ms2_toppic_prsm_single.tsv: a TSV file containing PrSMs identified from all the input files with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • combined_ms2_toppic_proteoform.tsv: a TSV file containing proteoforms identified from all the input files with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • combined_ms2_toppic_proteoform_single.tsv: a TSV file containing proteoforms identified from all the input files with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • combined_ms2_toppic_proteoform.xml: an XML file containing proteoforms identified from all the input files with the E-value or proteoform-level FDR cutoff.
  • combined_ms2_toppic_prsm.xml: an XML file containing all identified PrSMs without clustering and filtering.
  • combined_html/toppic_prsm_cutoff: a folder containing JavaScript files of PrSMs identified from all the input files using the E-value or spectrum-level FDR cutoff.
  • combined_html/toppic_proteoform_cutoff: a folder containing JavaScript files of PrSMs identified from all the input files using the E-value or proteoform-level FDR cutoff.
  • combined_html/topmsv: a folder containing HTML files for the visualization of identified PrSMs.

4.3 Command line usage

To run TopPIC, open a terminal window and run the following command.

toppic [options] database-file-name spectrum-file-names

Options

-h [ --help ]

Print the help message.

-a [ --activation ] <CID|HCD|ETD|UVPD|FILE>

Set the fragmentation method(s) of MS/MS spectra. When "FILE" is selected, the fragmentation methods of spectra are given in the input spectrum data file. Default value: FILE.

-f [ --fixed-mod ] <C57|C58|a fixed modification file>

Set fixed modifications. Three available options: C57, C58, or the name of a text file containing the information of fixed modifications (see an example file). When C57 is selected, carbamidomethylation on cysteine is the only fixed modification. When C58 is selected, carboxymethylation on cysteine is the only fixed modification.

-n [ --n-terminal-form ] <a list of allowed N-terminal forms>

Set N-terminal forms of proteins. Four N-terminal forms can be selected: NONE, NME, NME_ACETYLATION, and M_ACETYLATION. NONE stands for no modifications, NME for N-terminal methionine excision, NME_ACETYLATION for N-terminal acetylation after the initiator methionine is removed, and M_ACETYLATION for N-terminal methionine acetylation. When multiple forms are allowed, they are separated by commas. Default value: NONE,M_ACETYLATION,NME,NME_ACETYLATION.

-s [ --num-shift ] <0|1|2>

The maximum number of unexpected mass shifts in a PrSM. Default value: 1.

-m [ --min-shift ] <a number>

The minimum value for unexpected mass shifts (in Dalton). Default value: -500 Dalton.

-M [ --max-shift ] <a number>

The maximum value for unexpected mass shifts (in Dalton). Default value: 500 Dalton.

-S [ --variable-ptm-num] <a number>

The maximum number of variable PTM sites in a proteoform-spectrum-match. Default value: 3.

-b [ --variable-ptm-file-name] a variable PTM file

Specify a text file containing the information of varaible PTMs (see an example variable PTM file).

-d [ --decoy ]

Use a shuffled decoy protein database to estimate spectrum and proteoform-level FDRs. When -d is chosen, a shuffled decoy database is automatically generated and appended to the target database before database search, and FDRs are estimated using the target-decoy approach.

-e [ --mass-error-tolerance ] <a positive integer>

Set the error tolerance for precursor and fragment masses in part-per-million (ppm). Default value: 10.

-p [ --proteoform-error-tolerance ] <a positive number>

Set the error tolerance for identifying PrSM clusters (in Dalton). Default value: 1.2 Dalton.

-t [ --spectrum-cutoff-type ] <EVALUE|FDR>

Set the spectrum-level cutoff type for filtering PrSMs. Default value: EVALUE.

-v [ --spectrum-cutoff-value ] <a positive number>

Set the spectrum-level cutoff value for filtering PrSMs. Default value: 0.01.

-T [ --proteoform-cutoff-type ] <EVALUE|FDR>

Set the proteoform-level cutoff type for filtering proteoforms and PrSMs. Default value: EVALUE.

-V [ --proteoform-cutoff-value ] <a positive number>

Set the proteoform-level cutoff value for filtering proteoforms and PrSMs. Default value: 0.01.

-A [ --approximate-spectra ]

Use approximate spectra to increase the sensitivity in protein filtering (see this paper for details).

-l [ --lookup-table ]

Use a lookup table method for computing E-values. It is faster than the default generating function approach, but it may reduce the number of identifications.

-B [ --local-ptm-file-name ] <a common modification file>

Specify a text file containing a list of common PTMs for proteoform characterization. The PTMs are used to identify and localize PTMs in reported PrSMs with unknown mass shifts. See an example file.

-H [ --miscore-threshold ] <a number between 0 and 1>

Set the MIScore threshold (see paper) for filtering results of PTM characterization. Default value: 0.15.

-u [ --thread-number ] <a positive number>

Set the number of threads used in the computation. Default value: 1.

-r [ --num-combined-spectra ] <a positive integer>

Set the number of combined spectra. The parameter is set to 2 (or 3) for combining spectral pairs (or triplets) generated by the alternating fragmentation mode. Default value: 1.

-c [ --combined-file-name ] <a filename>

Specify an output file name for combined identifications when the input consists of multiple data files.

-x [ --no-topfd-feature ]

Specify that there are no TopFD feature files for proteoform identification.

-k [ --keep-temp-files ]

Keep intermediate files.

-K [ --keep-decoy-ids ]

Keep decoy identifications.

-g [ --skip-html-folder ]

Skip the generation of HTML files for visualization.

Examples

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file spectra_ms2.feature (reported by TopFD). The user does not need to specify the feature file name. TopPIC will automatically obtain the feature file name from the spectrum file name spectra_ms2.msalign.

toppic proteins.fasta spectra_ms2.msalign

Search two deconvoluted MS/MS spectrum files spectra1_ms2.msalign and spectra2_ms2.msalign against a protein database file proteins.fasta with feature files. In addition, all identifications are combined and reported using a file name "combined."

toppic -c combined proteins.fasta spectra1_ms2.msalign spectra2_ms2.msalign

Search all deconvoluted MS/MS spectrum files in the current folder against a protein database file proteins.fasta with feature files.

toppic proteins.fasta *_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta without feature files.

toppic -x proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file and a fixed modification: carbamidomethylation on cysteine.

toppic -f C57 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. In an identified proteoform, at most 2 mass shifts are allowed and the maximum allowed mass shift value is 10,000 Dalton.

toppic -s 2 -M 10000 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. Two variable PTMs: oxidation on M and methylation on K are used. The modification file two_var_mods.txt can be found here.

toppic -b two_var_mods.txt proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. The error tolerance for precursor and fragment masses is 5 ppm.

toppic -e 5 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. Use the target decoy approach to compute spectrum level and proteoform-level FDRs, filter identified proteoform spectrum-matches by a 5% spectrum level FDR, and filter identified proteoforms by a 5% proteoform-level FDR.

toppic -d -t FDR -v 0.05 -T FDR -V 0.05 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign with alternating CID, HCD, and ETD spectra against a protein database file proteins.fasta with a feature file. Combine alternating CID, HCD, and ETD spectra to increase proteoform coverage.

toppic -r 3 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. After proteoforms with unexpected mass shifts are identified, TopPIC matches the mass shifts to four common PTMs: acetylation, phosphorylation, oxidation and methylation, and uses an MIScore cutoff 0.1 to filter reported PTM sites. The modification file common_mods.txt can be found here.

toppic -B common_mods.txt -H 0.1 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file.Use 6 CPU threads to speed up the computation.

toppic -u 6 proteins.fasta spectra_ms2.msalign

5 TopMG

5.1 Input

  • A protein database file in the FASTA format
  • A mass spectrum data file in the msalign format
  • A text file of variable PTMs
  • A text file of fixed PTMs (optional)
  • A text file containing LC-MS feature information (optional)

5.2 Output

TopMG outputs two TSV files, an XML file, and a collection of HTML files for identified proteoforms. For example, when the input mass spectrum data file is spectra_ms2.msalign, the output includes:

  • spectra_ms2_topmg_prsm.tsv: a TSV file containing identified PrSMs with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • spectra_ms2_topmg_prsm_single.tsv: a TSV file containing identified PrSMs with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • spectra_ms2_topmg_proteoform.tsv: a TSV file containing identified proteoforms with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • spectra_ms2_topmg_proteoform_single.tsv: a TSV file containing identified proteoforms with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • spectra_ms2_topmg_proteoform.xml: an XML file containing identified proteoforms with the E-value or proteoform-level FDR cutoff.
  • spectra_ms2_topmg_prsm.xml: an XML file containing all identified PrSMs without clustering and filtering.
  • spectra_html/topmg_prsm_cutoff: a folder containing JavaScript files of identified PrSMs using the E-value or spectrum-level FDR cutoff.
  • spectra_html/topmg_proteoform_cutoff: a folder containing JavaScript files of identified PrSMs using the E-value or proteoform-level cutoff.
  • spectra_html/topmsv: a folder containing HTML files for the visualization of identified PrSMs.
To browse identified proteins, proteoforms, and PrSMs, use a chrome browser to open the file spectra_html/topmsv/index.html. Google Chrome is recommended (Firefox and Edge are not recommended).

When the input contains two or more spectrum files, TopMG outputs two TSV files, an XML file, and a collection of HTML files for each input file. When a file name is specified for combined identifications, it combines spectra and proteoforms identified from all the input files, removes redundant proteoform identifications, and reports two TSV files, an XML file, and a collection of HTML files for the combined results. For example, when the input is spectra1_ms2.msalign and spectra2_ms2.msalign and the combined output file name is "combined," the output files are:

  • combined_ms2_topmg_prsm.tsv: a TSV file containing PrSMs identified from all the input files with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • combined_ms2_topmg_prsm_single.tsv: a TSV file containing PrSMs identified from all the input files with an E-value or spectrum-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • combined_ms2_topmg_proteoform.tsv: a TSV file containing proteoforms identified from all the input files with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, all the proteins are reported.
  • combined_ms2_topmg_proteoform_single.tsv: a TSV file containing proteoforms identified from all the input files with an E-value or proteoform-level FDR cutoff. When an identified proteoform is shared by multiple proteins, only one protein is reported.
  • combined_ms2_topmg_proteoform.xml: an XML file containing proteoforms identified from all the input files with the E-value or proteoform-level FDR cutoff.
  • combined_ms2_topmg_prsm.xml: an XML file containing all identified PrSMs without clustering and filtering.
  • combined_html/topmg_prsm_cutoff: a folder containing JavaScript files of PrSMs identified from all the input files using the E-value or spectrum-level FDR cutoff.
  • combined_html/topmg_proteoform_cutoff: a folder containing JavaScript files of PrSMs identified from all the input files using the E-value or proteoform-level cutoff.
  • combined_html/topmsv: a folder containing HTML files for the visualization of identified PrSMs.

5.3 Command line usage

To run TopMG, open a terminal window and run the following command.

topmg [options] database-file-name spectrum-file-names

Options

-h [ --help ]

Print the help message.

-a [ --activation ] <CID|HCD|ETD|UVPD|FILE>

Fragmentation method of MS/MS spectra. When FILE is used, fragmentation methods of spectra are given in the input spectral data file. Default value: FILE.

-f [ --fixed-mod ] <C57|C58|a fixed modification file>

Set fixed modifications. Three available options: C57, C58, or the name of a text file specifying fixed modifications (see an example file). When C57 is selected, carbamidomethylation on cysteine is the only fixed modification. When C58 is selected, carboxymethylation on cysteine is the only fixed modification.

-n [ --n-terminal-form ] <a list of allowed N-terminal forms>

Set N-terminal forms of proteins. Four N-terminal forms can be selected: NONE, NME, NME_ACETYLATION, and M_ACETYLATION. NONE stands for no modifications, NME for N-terminal methionine excision, NME_ACETYLATION for N-terminal acetylation after the initiator methionine is removed, and M_ACETYLATION for N-terminal methionine acetylation. When multiple forms are allowed, they are separated by commas. Default value: NONE,M_ACETYLATION,NME,NME_ACETYLATION.

-d [ --decoy ]

Use a shuffled decoy protein database to estimate spectrum and proteoform level FDRs. When -d is chosen, a shuffled decoy database is automatically generated and appended to the target database before database search, and FDR rates are estimated using the target-decoy approach.

-e [ --mass-error-tolerance ] <a positive integer>

Set the error tolerance for precursor and fragment masses in ppm. Default value: 10 ppm.

-p [ --proteoform-error-tolerance ] <a positive number>

Set the error tolerance for identifying PrSM clusters (in Dalton). Default value: 1.2 Dalton.

-M [ --max-shift ] <a number>

Set the maximum absolute value for unexpected mass shifts (in Dalton). Default value: 500 Dalton.

-t [ --spectrum-cutoff-type ] <EVALUE|FDR>

Set the spectrum level cutoff type for filtering PrSMs. Default value: EVALUE.

-v [ --spectrum-cutoff-value ] <a positive number>

Set the spectrum level cutoff value for filtering PrSMs. Default value: 0.01.

-T [ --proteoform-cutoff-type ] <EVALUE|FDR>

Set the proteoform level cutoff type for filtering proteoforms and PrSMs. Default value: EVALUE.

-V [ --proteoform-cutoff-value ] <a positive number>

Set the proteoform level cutoff value for filtering proteoforms and PrSMs. Default value: 0.01.

-i [ --mod-file-name ] <a modification file>

Specify a text file of variable PTMs. See an example file.

-u [ --thread-number ] <a positive number>

Set the number of threads used in the computation. Default value: 1.

-x [ --no-topfd-feature ]

Specify that there are no TopFD feature files for proteoform identification.

-D [ --use-asf-diagonal ]

Use the ASF-DIAGONAL method for protein sequence filtering. The default filtering method is ASF-RESTRICT. When -D is selected, both ASF-RESTRICT and ASF-DIAGONAL will be used. The combined approach may identify more PrSMs, but it is much slower than using ASF-RESTRICT only. See this paper for more details.

-P [ --var-ptm ] <a positive number>

Set the maximum number of variable PTM sites in a proteoform. Default value: 5.

-s [ --num-shift <0|1|2>

Set the maximum number of unexpected mass shifts in a proteoform. Default value: 0.

-w [ --whole-protein-only ]

Report only proteoforms of whole protein sequences.

-c [ --combined-file-name ] <a filename>

Specify an output file name for combined identifications when the input consists of multiple spectrum files.

-k [ --keep ]

Keep intermediate files.

-K [ --keep-decoy-ids ]

Keep decoy identifications.

-g [ --skip-html-folder ]

Skip the generation of HTML files for visualization.

Advanced options

-j [ --proteo-graph-dis ] <a positive number>

Set the length of the largest gap in constructing proteoform graphs. Default value: 40. See this paper for more details.

-G [ --var-ptm-in-gap ] <a positive number>

Set the maximum number of variable PTM sites in a gap in a proteoform graph. Default value: 5. See this paper for more details.

Examples

To use the following examples, the current folder needs to contain a variable modification file variable_mods.txt. (See an example.)

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file spectra_ms2.feature. The user does not need to specify the feature file name. TopMG will automatically obtain the feature file name from the spectrum file name spectra_ms2.msalign.

topmg -i variable_mods.txt proteins.fasta spectra_ms2.msalign

Search two deconvoluted MS/MS spectrum files spectra1_ms2.msalign and spectra2_ms2.msalign against a protein database file proteins.fasta with feature files. In addition, all identifications are combined and reported using a file name "combined."

topmg -i variable_mods.txt -c combined proteins.fasta spectra1_ms2.msalign spectra2_ms2.msalign

Search all deconvoluted MS/MS spectrum files in the current folder against a protein database file proteins.fasta with feature files.

topmg -i variable_mods.txt proteins.fasta *_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta without feature files.

topmg -i variable_mods.txt -x proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file and a fixed modification: carbamidomethylation on cysteine.

topmg -i variable_mods.txt -f C57 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. In an identified proteoform, at most 1 unexpected mass shift and 4 variable PTMs are allowed and the maximum value for unexpected mass shifts is 10,000 Dalton.

topmg -i variable_mods.txt -P 4 -s 1 -M 10000 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. The error tolerance for precursor and fragment masses is 5 ppm.

topmg -i variable_mods.txt -e 5 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. Use the target decoy approach to compute spectrum level and proteoform level FDRs, filter identified proteoform spectrum-matches by a 5% spectrum-level FDR, and filter identified proteoforms by a 5% proteoform-level FDR.

topmg -i variable_mods.txt -d -t FDR -v 0.05 -T FDR -V 0.05 proteins.fasta spectra_ms2.msalign

Search a deconvoluted MS/MS spectrum file spectra_ms2.msalign against a protein database file proteins.fasta with a feature file. Use 6 CPU threads to speed up the computation.

topmg -i variable_mods.txt -u 6 proteins.fasta spectra_ms2.msalign

6 TopDiff

6.1 Input

  • Proteoform identification files in the XML format, e.g., spectra_ms2_toppic_proteoform.xml

6.2 Output

TopDiff outputs a TSV file containing proteoform identifications and their abundances in the input mass spectrum data. The default output file name is sample_diff.tsv.

6.3 Command line usage

To run TopDiff, open a terminal window and run the following command.

topdiff [options] spectrum-file-names

Options

-h [ --help ]

Print the help message.

-e [ --error-tolerance ] <a positive number>

Set the error tolerance for mapping identified proteoforms across multiple samples (in Dalton). Default value: 1.2 Dalton.

-t [ --tool-name ] <toppic|topmg>

Specify the name of the database search tool: toppic or topmg. Default: toppic.

-o [ --output ] <a file name>

Specify the output file name. Default: sample_diff.tsv.

Examples

Compare proteoform abundances using TopPIC identifications of two spectrum files spectra1_ms2.msalign and spectra2_ms2.msalign.

topdiff spectra1_ms2.msalign spectra2_ms2.msalign

Compare proteoform abundances using TopMG identifications of two spectrum files spectra1_ms2.msalign and spectra2_ms2.msalign.

topdiff -t topmg spectra1_ms2.msalign spectra2_ms2.msalign