A Nextflow workflow to filtering a large FASTA file based on the quality PSMs returned by a proteomics search (currently comet + percolator). The quality scores for determining which proteins to include are user-tunable parameters.
The workflow currently runs:
- msconvert (if raw files are used as input)
- comet
- filterPin (removes non-rank one inputs to percolator)
- percolator
- filterFasta (keeps only proteins with at least one quality peptide from comet/percolator)
- fastaFixer (removes duplicate entries and entries with invalid residues)
- decoyFastaGenerator.pl from the TPP
This workflow accepts the following parameters:
comet_params-requiredPath to the comet params file to use for the search. See https://raw.githubusercontent.com/mriffle/nf-filter-fasta/main/example_files/comet.params for examplecomet.params.fasta-requiredPath to the original, unfiltered FASTA filespectra_dir-requiredPath to a directory containing either raw or mzML files. If mzML files are found, raw files will be ignored.email- To whom a completion email should be sent. Exclude this parameter to send no email. Default is to send no email.psm_qvalue_filter- PSMs with a q-value greater than this will be excluded when finding quality peptides. Default:0.01peptide_qvalue_filter- Peptides with a q-value greater than this will be excluded when finding quality peptides. Default:0.01distinct_peptide_count- Proteins with fewer than this many peptides will be excluded from final FASTA. Default:3decoy_prefix- Generated decoys will have this as a prefix in their name. Default:DEBRUIJNfinal_fasta_base_name- Use this name as the base name of the generated FASTA. If left out, will use base name of input FASTA.mzml_cache_directory- The cache directory to use when converting raw files to mzML. Default:/data/mass_spec/nextflow/nf-filter-fasta/mzml_cachepanorama_cache_directory- The cache directory to use when downloading raw files from PanoramaWeb. Default:/data/mass_spec/nextflow/panorama/raw_cache
Use the following command(s) to run the workflow:
-
To ensure latest version of workflow is installed:
nextflow pull -r main mriffle/nf-filter-fasta -
To run the workflow specifying parameters on command line:
nextflow run -r main mriffle/nf-filter-fasta --comet_params /path/to/comet.params --spectra_dir /path/to/mzml_files --fasta /path/to/file.fasta -
To run workflow using a configuration file:
Create configuration file called
pipeline.configin this example (can be called anything). You can put any of the parameters above in it as:params { comet_params = '/path/to/comet.params' fasta = '/path/to/file.fasta' spectra_dir = '/path/to/mzml_files' psm_qvalue_filter = 0.05 }Then run the workflow using:
nextflow run -r main mriffle/nf-filter-fasta -c pipeline.config
The output of the pipeline will be placed in the results/nf-filter-fasta directory (relative to where the workflow was run). Assuming your FASTA file was named myname.fasta the output files include:
fasta/myname.filtered.fasta- The FASTA file that has been filtered using comet/percolator results.fasta/myname.filtered.fixed.fasta- The above file after it has been "fixed" (any duplicate entries removed and sequences containing invalid residues removed).fasta/myname.filtered.fixed.plusdecoys.fasta- The above file that has had decoys added.comet/*.pin- The percolator input files generated by the comet search.comet/*.pep.xml- The comet results files.percolator/combined_filtered.pout.xml- The percolator results.
-
You must first set up your PanoramaWeb credentials. After finding your API KEY in PanoramaWeb save it to Nextflow by typing:
nextflow secrets set PANORAMA_API_KEY "api key from PanoramaWeb" -
All file locations that begin with
https://are assumed to be PanoramaWeb WebDAV URLs. To specify PanoramaWeb locations for all input files, the followingpipeline.configfile could be used:params { comet_params = 'https://panoramaweb.org/_webdav/FOLDER_PATH/@files/comet.params' fasta = 'https://panoramaweb.org/_webdav/FOLDER_PATH/@files/myname.fasta' spectra_dir = 'https://panoramaweb.org/_webdav/FOLDER_PATH/@files/FOLDER_NAME/' psm_qvalue_filter = 0.05 }Note: it is not required that all files be in PanoramaWeb, mixing local and PanoramaWeb files will work.