ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.
ORFanage aids in finding the best matching ORF for each transcript in the GTF file based on evidence from one or more reference annotaitons. The method is designed to identify cases of known ORFs fitting the query transcript both with and without modifications, introduced by additional exons, alternative start and end sites, etc. ORFanage is also designed to quantify any changes to the reference annotation which are introduced by the splice variation.
Varabyou, A., Erdogdu, B., Salzberg, S. L., & Pertea, M. (2023). Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. bioRxiv, 2023-03.
A much more comprehensive documentation for ORFanage is provided on ReadTheDocs! Please check it out to see examples workflows, some interesting results and more.
By far the easiest way to install ORFanage is by using BioConda.
$ conda install -c conda-forge -c bioconda orfanage
If you want to build it from source, we recommend cloning the git repository as shown below.
$ git clone https://github.com/alevar/ORFanage.git --recursive $ cd ORFanage $ cmake -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles" . $ make -j4
For a fully static build -DORFANAGE_STATIC_BUILD=1 needs to be added to the list of arguments in the cmake command.
By default make install will likely require administrative privileges. To specify custom installation path -DCMAKE_INSTALL_PREFIX=<custom/installation/path> needs to be added to the list of arguments in the cmake command.
If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately (git submodule update --init --recursive).
| Operating System | GNU/Linux |
| Architecture | Intel/AMD platforms that support POPCNT |
| Compiler | GCC ≥ 4.9, Clang ≥ 3.8 |
| Build system | CMake ≥ 3.2 |
| Language support | C++14 |
Usage: orfanage [OPTIONS] <templates>...
Arguments:
- <templates> One or more GFF/GTF files with coding exons to be used as
- templates.
Options:
--query STRING Path to a GTF query file with transcripts to which CDSs are to be ported --output STRING Basename for all output files generated by this software --reference STRING Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pi. --cleanq If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter --cleant If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter --rescue If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter --lpi INT Percent identity by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed. --ilpi INT Percent identity by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed. --mlpi INT Percent identity by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed. --minlen INT Minimum length of an open reading frame to consider for the analysis --mode STRING Strategy to select the CDS for transcripts: ALL, LONGEST, LONGEST_MATCH, FIRST, BEST, START_MATCH. A cascading array of modes can be provided as a comma-separated list to resolve ties or issues in CDS selection (eg. two candidate ORFs have the longest length, mode selection falls back from LONGEST to BEST. Or START_MATCH fails to match and falls back to FIRST or BEST). Default: LONGEST_MATCH,BEST,START_MATCH,LONGEST,FIRST,ALL. --stats STRING Output a separate file with stats for each query/template pair --threads INT Number of threads to run in parallel --use_id If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids. --non_aug If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon. --keep_all_cds Mutually exclusive with '--keep_cds_if_not_found'. If enabled, any CDS already present in the query will be kept unmodified. --keep_cds_if_not_found Mutually exclusive with '--keep_all_cds'. If enabled, will still search for new ORF in each query transcript. If query transcript has CDS annotated, and no ORF can be identified by the method, the original will be kept. Original CDS will be replaced if a valid ORF can be found. Use '--keep_all_cds' to retain all unmodified CDS in query. --overhang INT If enabled, will also evaluate nucleotide sequence up and downstream up to N bases as set for the argument. --spliced_overhang Only in effect when combined with the '--overhang' parameter. If enabled, this option will extend the sequence up to the '--overhang' number of bases, but terminate prematurely if either a splice donor (if extending towards 3') or splice acceptor (if extending towards 5') is detected.
Help options:
--help Prints this help message.
Sample datasets are provided in the "example" directory to test and get familiar with ORFanage. The included examples can be run with the following base commands:
- orfanage --reference <path/to/grch38.fa> --output example/output.gtf --query example/query.gtf <--additional arguments> --stats example/stats.tsv example/template.gtf