ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

Introduction
Publications
Documentation
Installation
- BioConda
- Building from source
Getting started
Data

Introduction

ORFanage aids in finding the best matching ORF for each transcript in the GTF file based on evidence from one or more reference annotaitons. The method is designed to identify cases of known ORFs fitting the query transcript both with and without modifications, introduced by additional exons, alternative start and end sites, etc. ORFanage is also designed to quantify any changes to the reference annotation which are introduced by the splice variation.

Publications

Varabyou, A., Erdogdu, B., Salzberg, S. L., & Pertea, M. (2023). Investigating Open Reading Frames in Known and Novel Transcripts using ORFanage. bioRxiv, 2023-03.

Documentation

A much more comprehensive documentation for ORFanage is provided on ReadTheDocs! Please check it out to see examples workflows, some interesting results and more.

Installation

BioConda

By far the easiest way to install ORFanage is by using BioConda.

$ conda install -c conda-forge -c bioconda orfanage

Building from source

If you want to build it from source, we recommend cloning the git repository as shown below.

$ git clone https://github.com/alevar/ORFanage.git --recursive
$ cd ORFanage
$ cmake -DCMAKE_BUILD_TYPE=Release -G "Unix Makefiles" .
$ make -j4

For a fully static build -DORFANAGE_STATIC_BUILD=1 needs to be added to the list of arguments in the cmake command.

By default make install will likely require administrative privileges. To specify custom installation path -DCMAKE_INSTALL_PREFIX=<custom/installation/path> needs to be added to the list of arguments in the cmake command.

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately (git submodule update --init --recursive).

Requirements

Operating System	GNU/Linux
Architecture	Intel/AMD platforms that support POPCNT
Compiler	GCC ≥ 4.9, Clang ≥ 3.8
Build system	CMake ≥ 3.2
Language support	C++14

Getting started

Usage: orfanage [OPTIONS] <templates>...

Arguments:

<templates> One or more GFF/GTF files with coding exons to be used as

templates.

Options:

--query STRING Path to a GTF query file with transcripts to which CDSs are to be ported

--output STRING

Basename for all output files generated by this software

--reference STRING

Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pi.

--cleanq If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter

--cleant If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter

--rescue If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter

--lpi INT Percent identity by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed.

--ilpi INT Percent identity by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed.

--mlpi INT Percent identity by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed.

--minlen INT Minimum length of an open reading frame to consider for the analysis

--mode STRING Strategy to select the CDS for transcripts: ALL, LONGEST, LONGEST_MATCH, FIRST, BEST, START_MATCH. A cascading array of modes can be provided as a comma-separated list to resolve ties or issues in CDS selection (eg. two candidate ORFs have the longest length, mode selection falls back from LONGEST to BEST. Or START_MATCH fails to match and falls back to FIRST or BEST). Default: LONGEST_MATCH,BEST,START_MATCH,LONGEST,FIRST,ALL.

--stats STRING Output a separate file with stats for each query/template pair

--threads INT Number of threads to run in parallel

--use_id If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids.

--non_aug If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon.

--keep_all_cds Mutually exclusive with '--keep_cds_if_not_found'. If enabled, any CDS already present in the query will be kept unmodified.

--keep_cds_if_not_found

Mutually exclusive with '--keep_all_cds'. If enabled, will still search for new ORF in each query transcript. If query transcript has CDS annotated, and no ORF can be identified by the method, the original will be kept. Original CDS will be replaced if a valid ORF can be found. Use '--keep_all_cds' to retain all unmodified CDS in query.

--overhang INT If enabled, will also evaluate nucleotide sequence up and downstream up to N bases as set for the argument.

--spliced_overhang

Only in effect when combined with the '--overhang' parameter. If enabled, this option will extend the sequence up to the '--overhang' number of bases, but terminate prematurely if either a splice donor (if extending towards 3') or splice acceptor (if extending towards 5') is detected.

Help options:

--help Prints this help message.

Data

Sample datasets are provided in the "example" directory to test and get familiar with ORFanage. The included examples can be run with the following base commands:

orfanage --reference <path/to/grch38.fa> --output example/output.gtf --query example/query.gtf <--additional arguments> --stats example/stats.tsv example/template.gtf

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.vscode		.vscode
docs		docs
example		example
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.readthedocs.yaml		.readthedocs.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.rst		README.rst
orfanage.cpp		orfanage.cpp
orfcompare.cpp		orfcompare.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

Introduction

Publications

Documentation

Installation

BioConda

Building from source

Getting started

Data

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

`--query STRING`	Path to a GTF query file with transcripts to which CDSs are to be ported
`--output STRING`
	Basename for all output files generated by this software
`--reference STRING`
	Path to the reference genome file in FASTA format. This parameter is required when the following parameters are used: 1. cleanq; 2. cleant; 3. pi.
`--cleanq`	If enabled - will ensure all transcripts in the output file will have a valid start and end codons. This option requires the use of --reference parameter
`--cleant`	If enabled - will ensure all ORFs in the reference annotations start with a valid start codon and end with the first available stop codon. This option requires the use of --reference parameter
`--rescue`	If enabled - will attempt rescuing the broken ORFs in the reference annotations. This option requires the use of --reference parameter
`--lpi INT`	Percent identity by length between the original and reference transcripts. If -1 (default) is set - the check will not be performed.
`--ilpi INT`	Percent identity by length of bases in frame of the reference transcript. If -1 (default) is set - the check will not be performed.
`--mlpi INT`	Percent identity by length of bases that are in both query and reference. If -1 (default) is set - the check will not be performed.
`--minlen INT`	Minimum length of an open reading frame to consider for the analysis
`--mode STRING`	Strategy to select the CDS for transcripts: ALL, LONGEST, LONGEST_MATCH, FIRST, BEST, START_MATCH. A cascading array of modes can be provided as a comma-separated list to resolve ties or issues in CDS selection (eg. two candidate ORFs have the longest length, mode selection falls back from LONGEST to BEST. Or START_MATCH fails to match and falls back to FIRST or BEST). Default: LONGEST_MATCH,BEST,START_MATCH,LONGEST,FIRST,ALL.
`--stats STRING`	Output a separate file with stats for each query/template pair
`--threads INT`	Number of threads to run in parallel
`--use_id`	If enabled, only transcripts with the same gene ID from the query file will be used to form a bundle. In this mode the same template transcript may be used in several bundles, if overlaps transcripts with different gene_ids.
`--non_aug`	If enabled, non-AUG start codons in reference transcripts will not be discarded and will be considered in overlapping query transcripts on equal grounds with the AUG start codon.
`--keep_all_cds`	Mutually exclusive with '--keep_cds_if_not_found'. If enabled, any CDS already present in the query will be kept unmodified.
`--keep_cds_if_not_found`
	Mutually exclusive with '--keep_all_cds'. If enabled, will still search for new ORF in each query transcript. If query transcript has CDS annotated, and no ORF can be identified by the method, the original will be kept. Original CDS will be replaced if a valid ORF can be found. Use '--keep_all_cds' to retain all unmodified CDS in query.
`--overhang INT`	If enabled, will also evaluate nucleotide sequence up and downstream up to N bases as set for the argument.
`--spliced_overhang`
	Only in effect when combined with the '--overhang' parameter. If enabled, this option will extend the sequence up to the '--overhang' number of bases, but terminate prematurely if either a splice donor (if extending towards 3') or splice acceptor (if extending towards 5') is detected.

Folders and files

Latest commit

History

Repository files navigation

ORFanage: Ultra-efficient and sensitive method to search for ORFs in spliced genomes guided by reference annotation to maximize protein similarity within genes.

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages