TopDIA Supplemental Material

TopDIA

TopDIA is the first software tool for top-down proteoform identification using TD-DIA-MS data. TopDIA generates pseudo non-multiplexed MS/MS spectra from TD-DIA-MS data by integrating algorithms for detecting and matching proteoform and fragment features.

  • Code Availability: TopDIA has been made available as part of TopPIC suite and can be downloaded from https://github.com/toppic-suite/toppic-suite/releases/tag/v1.7_DIA.

  • Executables: You can download the zipped executable files using for Windows and for Linux/MAC .

  • Evaluation Scripts: Logistic regression model training and evaluation scripts are made available as a GitHub repository and are available at https://github.com/ARBasharat/TopDIA_Evaluation_Scripts.

  • Data: The data files have been made available for E. coli data sets used in the study and can be accessed using RAW and mzML files.

  • Identification Results: Proteoform identification results obtained for TD-DDA-MS and TD-DIA-MS E. coli data sets are available at: link.

  • Model Training Data: Data used to train logistic regrssion model is available at link.

  • Testing Data: Data used to compare the performance of E. coli data generated using TD-DDA-MS and TD-DIA-MS is available at link.


  • TopDIA Manual

    1 Input

    The input of TopDIA is mass spectrometry data files in the mzML format. Raw mass spectral data generated from various mass spectrometers can be converted to mzML files using msconvert.

    2 Output

    TopDIA outputs LC-MS feature files for MS1 and MS/MS data with a file extension "csv", and two deconvoluted mass spectral data files in the msalign format with a file extension "msalign", which is similar to the MGF file format. In addition, TopDIA outputs pseudo-MS/MS spectral data in the msalign format with a file extension "pseudo_ms2.msalign"

    For example, when the input file name is spectra.mzML, the output includes:

    • spectra_frac_ms1.csv: a feature file containing LC-MS features.
    • spectra_frac_ms1.mzrt.csv: a feature file containing time and m/z coordinates of LC-MS features.
    • spectra_isolationWindow_frac_ms2.csv: a feature file containing LC-MS/MS features from an isolation window.
    • spectra_isolationWindow_frac_ms2.mzrt.csv: a feature file containing time and m/z coordinates of LC-MS/MS features from an isolation window.
    • spectra_ms1.msalign: a list of deconvoluted MS1 spectra.
    • spectra_ms2.msalign: a list of deconvoluted MS/MS spectra.
    • spectra_pseudo_ms2.msalign: a list of deconvoluted non-multiplexed pseudo-MS/MS spectra.
    • spectra_html: a folder containing JavaScript files for MS1 and MS/MS spectral visualization.

    3 Command line usage

    To run TopDIA, open a terminal window and run the following command.

    topdia [options] spectrum-file-names

    Options

    -h [ --help ]

    Print the help message.

    -a [ --activation ] <CID|ETD|HCD|MPD|UVPD|FILE>

    Set the fragmentation method(s) of MS/MS spectra. When "FILE" is selected, the fragmentation methods of spectra are given in the input spectrum data file. Default value: FILE.

    -c [ --max-charge ] <a positive integer>

    Set the maximum charge state of precursor and fragment ions. The default value is 30.

    -m [ --max-mass ] <a positive number>

    Set the maximum monoisotopic mass of precursor and fragment ions. The default value is 70,000 Dalton.

    -e [ --mz-error ] <a positive number>

    Set the error tolerance of m/z values of spectral peaks. The default value is 0.02 m/z.

    -r [ --ms-one-sn-ratio ] <a positive number>

    Set the signal/noise ratio for MS1 spectra. The default value is 3.

    -s [ --ms-two-sn-ratio ] <a positive number>

    Set the signal/noise ratio for MS/MS spectra. The default value is 1.

    -n [ --msdeconv ]

    Use the MS-Deconv score (see paper) to rank isotopic envelopes. If -n is not selected, the default EnvCNN score (see paper) is used to rank isotopic envelopes.

    -w [ --precursor-window ] <a positive number>

    Set the precursor isolation window size. The default value is 4.0 m/z. When the input file contains the information of precursor windows, the parameter will be ignored.

    -t [ --ms1-ecscore-cutoff ] <a positive number in [0, 1]>

    Set the ECScore cutoff value for proteoform features. Default value is 0.

    -T [ --ms2-ecscore-cutoff ] <a positive number in [0, 1]>

    Set the ECScore cutoff value for fragment features. Default value is 0.

    -b [ --ms1-min-scan-number ] <1|2|3>

    The minimum number of MS1 scans in which a proteoform feature is detected. The default value is 2.

    -B [ --ms2-min-scan-number ] <1|2|3>

    The minimum number of MS2 scans in which a fragment feature is detected. The default value is 1.

    -i [ --single-scan-noise ]

    Use the peak intensity noise levels in single MS1 scans to filter out low intensity peaks in proteoform feature detection. The default method is to use the peak intensity noise level of the whole LC-MS map to filter out low intensity peaks.

    -p [ --ms1-intensity-correlation-cutoff ] <a positive number in [0, 1]>

    Set the MS1 seed envelope intensity correlation cutoff value for extracting proteoform features. The default value is 0.5

    -P [ --ms2-intensity-correlation-cutoff ] <a positive number in [0, 1]>

    Set the MS2 seed envelope intensity correlation cutoff value for extracting fragment features. The default value is 0.

    -v [ --pseudo-cutoff ] <a positive number in [0, 1]>

    Set the Pseudo Score cutoff value for generating pseudo-MS/MS spectrum. The default value is 0.55.

    -V [ --pseudo-peak-number ]

    The minimum number of peaks in pseudo-MS/MS spectrum. The default value is 25.

    -d [ --disable-final-filtering ]

    Skip the final filtering of isotopic envelopes in MS/MS spectra.

    -u [ --thread-number ] <a positive integer>

    Number of CPU threads used in spectral deconvolution. Default value: 1.

    -g [ --skip-html-folder ]

    Skip the generation of HTML files for visualization.

    Examples

    Deconvolute a centroid data file spectra.mzML and output feature (.csv) files, spectra_ms1.msalign, spectra_ms2.msalign, and spectra_pseudo_ms2.msalign

    topdia spectra.mzML

    Deconvolute a centroid data file spectra.mzML. In pseudo-MS/MS spectrum generation, each proteoform feature is required to be detected in at least one MS1 scan and the ECScore cutoff for proteoform features is set to 0.2.

    topdia -t 0.2 -b 1 spectra.mzML



    Use TopDIA to analyze E. coli data

    For example, to analyze DIA-MS data file covering m/z range [720, 800] for DIA-Test-Replicate-1, process mzML file with the following settings: the maximum charge state: 60, and use single scan noise intensity during proteoform and fragment feature extraction

    topdia -c 60 -i 20231117_DIA_720_800_rep2.mzML


    This will generate 20231117_DIA_720_800_rep2_pseudo_ms2.msalign file which will be used by TopPIC for proteoform identification using the following parameters: no TopFD feature files, use a shuffled decoy protein database to estimate spectrum and proteoform level FDRs, a text file containing the information of varaible PTMs and an E. coli protein database.

    toppic -x -d -t FDR -T FDR -b var_mods.txt Ecoli.fasta 20231117_DIA_720_800_rep2_pseudo_ms2.msalign