Services Test Online

Fgenesh++ is a pipeline for automatic prediction of genes in eukaryotic genomes based on Softberry gene finding software.

This software can NOT be copied or distributed without Softberry license.
Copyright Softberry, Inc. 1999-2026.   (link to older version)

This manual includes:

FGENESH++: copyrighted and public software, data required, references

Fgenesh++ package includes the following copyrighted software:

  • Fgenesh - highly accurate and fast HMM-based gene prediction program;
  • Fgenesh+ - gene prediction program that uses similar protein information;
  • gene finding parameters for Fgenesh and Fgenesh+;
  • mrna_map - program for mapping full-length mRNAs to genome (genome alignment with splice sites identification) and mapping partial mRNAs (not full-length mRNAs, ESTs) to improve gene prediction accuracy;
  • prot_map - mapping protein database to genomic sequences;
  • Perl scripts.

Fgenesh++ package uses the following public software:

  • NCBI BLAST+ blastp or BLAST blastall, bl2seq programs

Fgenesh++ can be used with the following public data:

  • NCBI RefSeq database;
  • NCBI NR database (non-redundant protein database);
  • OrthoDB protein database.

Fgenesh++ requires sequences for analysis:

    genomic sequences (optionally with repeats soft-masked by lowercase letters)

References to Fgenesh++ software

Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7 Suppl 1:S10.1-12.

Salamov A., Solovyev V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10(4), 516-522.

Solovyev V.V. (2002) Finding genes by computer: probabilistic and discriminative approaches. In Current Topics in Computational Biology (eds. T. Jiang, T. Smith, Y. Xu, M. Zhang), The MIT Press, p. 365-401.


Sources of evidence

Fgenesh++ can accept the following sources of evidence for prediction of genes.


1) full-length mRNAs for a specified organism

full-length mRNAs are aligned to genomic sequences and "mRNA-supported" genes are predicted based on these alignments

(full-length mRNAs can be obtained from transcripts assembled by Trinity or similar programs or from RefSeq)


2) protein DB (or its part) with known proteins

proteins from DB (or some set of proteins, for example, from closely related species) are aligned to genomic sequences and "protein-supported" genes are predicted based on these alignments


3) partial mRNAs (or ESTs) for a specified organism

partial mRNAs (or ESTs) are aligned to genomic sequences to get a list of potential splice sites that are utilised by Fgenesh to improve gene models during ab initio gene prediction


4) RNASeq data (reads) for a specified organism

reads are mapped to genomic sequences (by ReadsMap pipeline) to get a list of potential splice sites that are utilised by Fgenesh to improve gene models during ab initio gene prediction

Notes:

- if RNASeq reads are available for a specified organism it is recommended to assemble reads into transcripts by Trinity or some other program and then split assembled transcripts into full-length mRNAs and partial (not full-length) mRNAs and use them as sources of evidence 1) and 3);

- currently ReadsMap pipeline is under development to be run in a more automatic manner, the revised version will be available in the following releases of Fgenesh++


5) gene prediction parameters trained for a specified (or closely related) organism

ab initio genes are predicted by Fgenesh that runs with parameters trained for a specified (or closely related) organism

Ab initio predictions can use hints from 3), 4) if such data is available.

See the section "Types of evidence" for more details.


Installation


Copy FGENESHPIPE/ directory from the distribution kit to appropriate directory.

Make configuration file (see "Configuration file") or copy from some example.


Running additional scripts

A set of additional scripts (to prepare sequences, data, etc.) is provided with Fgenesh++.

In case if some scripts do not work because they can not find required Perl modules we provide two files that can solve this issue

softberry_PERL5LIB.sh
softberry_wrapper.sh

You can either set PERL5LIB in your current environment or use a wrapper:

1) set PERL5LIB environment variable with the path to FGENESHPIPE/Modules/ directory

Open softberry_PERL5LIB.sh in some editor and put the right path to FGENESHPIPE/Modules/

Then run one of the following commands.

.      softberry_PERL5LIB.sh
source softberry_PERL5LIB.sh

Finally, check that PERL5LIB environment variable is set.

echo $PERL5LIB

2) use softberry_wrapper.sh to run a script

Open softberry_wrapper.sh in some editor and put the right path to softberry_PERL5LIB.sh shell script.

Then run a script with this wrapper:

<path_to>/softberry_wrapper.sh  <path_to>/script.pl [args and options]

Running the pipeline


There are three examples to run the pipeline that are provided with this distribution.

  • EXAMPLE_nGASP/ is an example for three C. elegans sequences (nGASP training sequences 1.seq, 2.seq, 3.seq)

  • EXAMPLE_chr22/ is an example for human chr22

  • EXAMPLE_chr2x/ is an example for human chr20, chr21, chr22
There you can find example data, configuration and shell files with commands to run the pipeline.


1. Preparation of sequences

Genomic sequences can be either complete chromosomes or scaffolds, contigs, etc.


1.1. Repeats masking

Repeats-masked sequences are recommended to use (especially for large genomes like human). Repeats in genomic sequences should be masked by lowercase letters (not by Ns).

If you run the pipeline with -masked option then it is assumed that lowercase letters in genomic sequences are soft-masked repeats.

If you run the pipeline without -masked option then it is assumed that repeats are not masked even if genomic sequences contain lowercase letters.


1.2. Multi-FASTA file vs a list of single-FASTA files

Sequences for annotation can be provided to the pipeline as a multi-FASTA file or a list of single-FASTA files.


a) multi-FASTA file

put all sequences into multi-FASTA file and provide it to the pipeline with an option -fasta <fasta_file> or -f <fasta_file> (for example, -f seq.fa)


b) list of single-FASTA files

put each sequence in a separate single-FASTA file, make a list with absolute or relative paths to single-FASTA files with sequences and provide it to the pipeline with an option -list <seq_list> or -l <seq_list> (for example, -l seq.list)

Note: if you provide relative paths to sequences they should be given relative to the directory from which you start the pipeline (relative to the current working directory).


EXAMPLE

file 'seq.list'

# (with absolute paths)

/data/Genome/SEQ/1.seq
/data/Genome/SEQ/2.seq
/data/Genome/SEQ/3.seq
...

or

# (with relative paths)

../human/SEQ/chr20.fa
../human/SEQ/chr21.fa
../human/SEQ/chr22.fa

1.3. Naming conventions

Files with sequences can be named in any way, with or without extension(s).

(!) But please do NOT use the extensions .N and .M - they are reserved for the pipeline.

Examples of file names for files with sequences:

1.seq
chr22.fa

ENm003.seq
ENm003.fa
ENm003.hg17.fa
ENm003

2. Gene prediction parameters file

Organism-specific gene prediction parameters are usually located in FGENESHPIPE/PARAM/matrices/ directory.

For example,

C_elegans_nGASP - gene prediction parameters for C.elegans used in the example EXAMPLE_nGASP/

Human - gene prediction parameters for Human used in the examples EXAMPLE_chr22/ and EXAMPLE_chr2x/


3. Pipeline general parameters file

Parameters file contains parameters that regulate gene prediction process.

For a number of parameters you might want to set up values other than those initially provided in the file, e.g., maximal intron length allowed in a gene or minimal length of an intron suitable for prediction of genes in introns of other genes. Some parameters may require some expertise to manipulate them properly.

We provide parameters files for mammals and for non-mammals:

mammals.par  - for mammals
non_mamm.par - for non-mammals

Put the path to parameters file in configuration file (see "Configuration file").


4. Configuration file


4.1. Typical config file

In configuration file you can indicate which steps of the pipeline to run, provide absolute or relative paths to parameters files, data files or directories and set other options.

Note: if you provide relative paths they should be given relative to the directory from which you start the pipeline (relative to the current working directory).

EXAMPLE

(similar to EXAMPLE_nGASP/run/ce.cfg)

#
# Location of data and options for eukaryotic genome annotation
#

# Organism-specific and pipeline parameters

GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP  # gene prediction parameters
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par              # location of parameters files

# Predict genes with GC donor splice sites or not

PREDICT_GC = 1

#---------------------------------------------------------------------

#  Number of threads
# (if NUM_THREADS > 1 then other threads settings are not used)

NUM_THREADS     = 1  # number of threads (general)

# the following threads settings are used if NUM_THREADS = 1

MRNA_THREADS    = 1  # number of threads to map full-length mRNAs
EST_THREADS     = 1  # number of threads to map partial mRNAs / ESTs
PROTEIN_THREADS = 1  # number of threads to map proteins or run 'blast'

#---------------------------------------------------------------------

# Mapping full-length mRNAs

MAP_MRNAS = 1                                        # map full-length mRNAs to genomic sequences
MRNA_FILE = /data/EXAMPLE_nGASP/DATA/ngasp.mrna.fa   # full-length mRNAs

#---------------------------------------------------------------------

# Mapping ESTs (partial mRNAs)

MAP_ESTS = 1                                         # map ESTs (partial mRNAs) to sequences
EST_FILE = /data/EXAMPLE_nGASP/DATA/rna_matches.fa   # file with ESTs (partial mRNAs)

#---------------------------------------------------------------------

# Using reads

USE_READS = 0                                        # use reads info to improve gene models
DIR_SITES = /data/EXAMPLE_nGASP/reads_sites/         # directory with reads *.sites files

#---------------------------------------------------------------------

# Using known proteins from DB
# (predict genes based on homology to known proteins)

USE_PROTEINS     = 1                                    # 0 - no, 1 -yes

PROTEIN_DB       = /data/EXAMPLE_nGASP/NR_ce/nr_ce      # protein DB
PROTEIN_DB_INDEX = /data/EXAMPLE_nGASP/NR_ce/nr_ce.ind  # protein DB index file
PROTEIN_DB_TAG   = NR                                   # short name for protein DB

#---------------------------------------------------------------------

# Location of BLAST+ or BLAST programs

# BLAST+ programs

BLASTP  = /path_to/ncbi-blast-2.2.28+/bin/blastp       # blastp (protein vs protein DB)
BLAST2  = /path_to/ncbi-blast-2.2.28+/bin/blastp       # blast 2 proteins

# BLAST programs
#
# BLASTP  = /path_to/blast-2.2.26/bin/blastall         # blastp (protein vs protein DB)
# BLAST2  = /path_to/blast-2.2.26/bin/bl2seq           # blast 2 proteins

#---------------------------------------------------------------------

# Predicting ab initio genes 
# (including genes with evidence from partial mRNAs or reads)

PREDICT_AB_INITIO = 1   # predict ab initio genes


# Finding protein homologs (from protein database) for ab initio genes

BLAST_AI_PROTEINS = 1   # find homologs for ab initio predicted genes (0 - no, 1 - yes)


# Predicting ab initio genes in long introns of other genes

INTRONIC_GENES = 0      # predict ab initio genes in long introns of other genes

4.2. Description of fields

In the line GENE_PARAM provide location of a file with organism-specific gene prediction parameters, for example:

GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP  # gene prediction parameters

In the line PIPE_PARAM provide location of a file with pipeline general parameters. Currently, we have two such files, one for mammals and one for all other organisms (non-mammals):

PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/mammals.par              # for mammals
or
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par             # for non-mammals

The pipeline can predict genes with GC donor splice sites. You can switch this option ON (1) or OFF (0).

PREDICT_GC = 1

Several programs of the pipeline support parallelization - they can automatically split a set of query sequences (full-length or partial mRNAs, proteins from DB) into subsets and perform calculations with such subsets in parallel.

You can specify the number of threads (cores) for all such programs or set an individual value for each program.

For example:
# (if NUM_THREADS > 1 then other threads settings are not used)

NUM_THREADS     = 10  # number of threads (general)

# the following theads settings are used if NUM_THREADS = 1

MRNA_THREADS    = 1  # number of threads to map full-length mRNAs
EST_THREADS     = 1  # number of threads to map partial mRNAs / ESTs
PROTEIN_THREADS = 1  # number of threads to map proteins or run 'blast'
or (the same)
# (if NUM_THREADS > 1 then other threads settings are not used)

NUM_THREADS     = 1  # number of threads (general)

# the following theads settings are used if NUM_THREADS = 1

MRNA_THREADS    = 10  # number of threads to map full-length mRNAs
EST_THREADS     = 10  # number of threads to map partial mRNAs / ESTs
PROTEIN_THREADS = 10  # number of threads to map proteins or run 'blast'
See also Running the pipeline in parallel for strategies to run calculations in parallel.

Then, you can switch ON (1) or OFF (0) each of the pipeline steps:

MAP_MRNAS         = 1   # map full-length mRNAs to genomic sequences
MAP_ESTS          = 1   # map ESTs (partial mRNAs) to genomic sequences
USE_READS         = 0   # use reads info to improve gene models
USE_PROTEINS      = 1   # 0 - no, 1 -yes
PREDICT_AB_INITIO = 1   # predict ab initio genes
BLAST_AI_PROTEINS = 1   # find homologs for ab initio predicted genes (0 - no, 1 - yes)
INTRONIC_GENES    = 0   # predict ab initio genes in long introns of other genes
Note:

in the current version of the pipeline either "using partial mRNAs" (MAP_ESTS) or "using reads" (USE_READS) step can be used (switched ON) but not both at the same time.


For each step (if it is switched ON) provide all the necessary data / options.

MAP_MRNAS = 1

if full-length mRNAs are available for a specified organism then make a FASTA file with full-length mRNAs (see Using full-length mRNAs) and provide its location in MRNA_FILE line

MAP_ESTS = 1

if partial mRNAs (or ESTs) are available for a specified organism then make a FASTA file with partial mRNAs (or ESTs) (see Using partial mRNAs) and provide its location in EST_FILE line

USE_READS = 1

if RNASeq reads are available for a specified organism they can be mapped to genomic sequences by ReadsMap pipeline and a directory with obtained *.sites files (with potential splice sites) can be provided in DIR_SITES line

(note although that in the current release this step is not fully automatical, it requires running ReadsMap pipeline and preparation of *.sites files; also, MAP_ESTS = 1 and USE_READS = 1 can not be used simultaneously; this will be fixed and improved in the following releases of Fgenesh++)

USE_PROTEINS = 1

"Prediction of genes based on homology to known proteins" (from some protein DB or its subset)

PROTEIN_DB       = /data/NR/nr_ce        # protein DB

Prepare protein database (NR, OrthoDB or its subset or custom protein database, see APPENDIX 3) and provide its location in PROTEIN_DB line.

PROTEIN_DB_INDEX = /data/NR/nr_ce.ind    # protein DB index file

Prepare index file for a protein database (see APPENDIX 3) and provide its location in PROTEIN_DB_INDEX line.

PROTEIN_DB_TAG    = NR                   # short name for protein DB

Short name or tag for protein database. It is used in deflines of ab initio predicted genes / proteins to indicate a database where homolog / BLAST hit is found to a predicted protein, for example:

>FGENESH:   10   5 exon (s) 721921  - 735306   240 aa, chain + ## NR: EAW59798.1 adaptor-related protein complex 1, beta 1 subunit, isoform CRA_a [Homo sapiens] ## query_len = 240  match_len = 229  query_cov = 0.36  match_cov = 0.38  query_x1 = 100  query_x2 = 185  match_x1 = 48  match_x2 = 133  score = 139  pid = 79.0  evalue = 5e-40
MADVRLLAESARHDGNTIGHKSTEVLALTWMLPLKLVVANAVAALSEIAESHPSNNLLHL
NPQFINKLLTALIECTEWGQIFILDCLTNYTPKDDREAQRGETSLFFHLKRAGFADMTGP
KEQVSLISSFSSSLLRILPGLSSGIRTNEEETVTALAANTHGQGSSIHLWEDPKFDTEDV
KPSPNWKAKARSSFLSTPLCLPSFTTKEDGSAAAQSLLVNHRPPYTPAPVHIPQGSSWDS

Provide paths to BLAST+ or BLAST executables for "protein vs protein DB" and "protein vs protein" BLAST comparisons.

For BLAST+, both programs are blastp
# Location of BLAST+ or BLAST programs

# BLAST+ programs

BLASTP  = /path_to/ncbi-blast-2.2.28+/bin/blastp        # blastp (protein vs. protein DB)
BLAST2  = /path_to/ncbi-blast-2.2.28+/bin/blastp        # blast 2 proteins
or, if you can run BLAST+ without paths
BLASTP  = blastp                                        # blastp (protein vs. protein DB)
BLAST2  = blastp                                        # blast 2 proteins
For BLAST, programs are blastall and bl2seq
# BLAST programs

BLASTP  = /path_to/blast-2.2.26/bin/blastall               # blastp (protein vs. protein DB)
BLAST2  = /path_to/blast-2.2.26/bin/bl2seq                 # blast 2 proteins

PREDICT_AB_INITIO = 1   # predict ab initio genes

Predict ab initio genes (including genes with evidence from partial mRNAs or reads) or not.
Ab initio genes are predicted in regions where no genes are predicted by mapping full-length mRNAs or known proteins from a protein DB.


BLAST_AI_PROTEINS = 1   # find homologs for ab initio predicted genes (0 - no, 1 - yes)

Find homologs (BLAST hits) to ab initio predicted genes (proteins) or not.


INTRONIC_GENES = 0      # predict ab initio genes in long introns of other genes

Optional step, prediction of ab initio genes in long introns of other genes.


Note: in config files lines starting with '#' are comment lines (they are not parsed). Also, everything after '#' is considered as comments.


4.3. Examples of config files

EXAMPLE_nGASP/run/
ce.cfg             - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio
ce_mrnas.cfg       - Mapping full-length mRNAs, partial mRNAs, Ab initio
ce_proteins.cfg    - Mapping proteins from DB,  Ab initio

ce_reads.cfg - Mapping full-length mRNAs, proteins from DB, Using reads, Ab initio
EXAMPLE_chr22/run/
human.cfg          - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio
human_mrnas.cfg    - Mapping full-length mRNAs, partial mRNAs, Ab initio
human_proteins.cfg - Mapping proteins from DB,  Ab initio
EXAMPLE_chr2x/run/
human.cfg          - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio
human_mrnas.cfg    - Mapping full-length mRNAs, partial mRNAs, Ab initio
human_proteins.cfg - Mapping proteins from DB,  Ab initio

4.4. Shorter config files

If some steps are switched OFF (0) you can leave paths and options for these steps empty or remove such steps (with correspodning paths and options) from configuration file at all.

For example, this is how a configuration file can look if only "Mapping full-length mRNAs", "Mapping partial mRNAs" and "Predict ab initio" steps are used -

EXAMPLE_nGASP/run/ce_mrnas.cfg

Note:

in this example ngasp.mrna.fa contains only 253 mRNAs which is not enough to annotate sequences with "Mapping full-length mRNAs" step alone - in real life use some protein DB as well;

but if you have a good amount of full-length mRNAs from assembled transcripts or/and RefSeq mRNAs you might want to skip "Using known proteins from DB" step to increase the speed of calculations.

#
# Location of data and options for eukaryotic genome annotation
#

# Organism-specific and pipeline parameters

GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP  # gene prediction parameters
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par              # location of parameters files

# Predict genes with GC donor splice sites or not

PREDICT_GC = 1


#  Number of threads
# (if NUM_THREADS > 1 then other threads settings are not used)

NUM_THREADS     = 10  # number of threads (general)

# the following threads settings are used if NUM_THREADS = 1

MRNA_THREADS    = 1  # number of threads to map full-length mRNAs
EST_THREADS     = 1  # number of threads to map partial mRNAs / ESTs


# Mapping full-length mRNAs

MAP_MRNAS = 1                                         # map full-length mRNAs to genomic sequences
MRNA_FILE = /data/EXAMPLE_nGASP/DATA/ngasp.mrna.fa    # full-length mRNAs


# Mapping ESTs (partial mRNAs)

MAP_ESTS = 1                                         # map ESTs (partial mRNAs) to genomic sequences
EST_FILE = /data/EXAMPLE_nGASP/DATA/rna_matches.fa   # file with ESTs (partial mRNAs)


# Predicting ab initio genes 
# (including genes with evidence from partial mRNAs or reads)

PREDICT_AB_INITIO = 1   # predict ab initio genes

# Predicting ab initio genes in long introns of other genes

INTRONIC_GENES = 0      # predict ab initio genes in long introns of other genes

And this is how a configuration file can look if only "Predict genes based on homology to proteins from DB" and "Predict ab initio" steps are used

EXAMPLE_nGASP/run/ce_proteins.cfg

#
# Location of data and options for eukaryotic genome annotation
#

# Organism-specific and pipeline parameters

GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP  # gene prediction parameters
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par              # location of parameters files

# Predict genes with GC donor splice sites or not

PREDICT_GC = 1


#  Number of threads
# (if NUM_THREADS > 1 then other threads settings are not used)

NUM_THREADS     = 10  # number of threads (general)

# the following threads settings are used if NUM_THREADS = 1

PROTEIN_THREADS = 1  # number of threads to map proteins or run 'blast'


# Using known proteins from DB
# (predict genes based on homology to known proteins)

USE_PROTEINS      = 1                                     # 0 - no, 1 -yes

PROTEIN_DB        = /data/EXAMPLE_nGASP/NR_ce/nr_ce       # protein DB
PROTEIN_DB_INDEX  = /data/EXAMPLE_nGASP/NR_ce/nr_ce.ind   # protein DB index file
PROTEIN_DB_TAG    = NR                                    # short name for protein DB


# Location of BLAST+ or BLAST programs

# BLAST+ programs

BLASTP  = /path_to/ncbi-blast-2.2.28+/bin/blastp        # blastp (protein vs. protein DB)
BLAST2  = /path_to/ncbi-blast-2.2.28+/bin/blastp        # blast 2 proteins

# BLAST programs
#
# BLASTP  = /path_to/blast-2.2.26/bin/blastall          # blastp (protein vs. protein DB)
# BLAST2  = /path_to/blast-2.2.26/bin/bl2seq            # blast 2 proteins


# Predicting ab initio genes 
# (including genes with evidence from partial mRNAs or reads)

PREDICT_AB_INITIO = 1   # predict ab initio genes

# Finding protein homologs (from protein database) for ab initio genes

BLAST_AI_PROTEINS = 1   # find homologs for ab initio predicted genes (0 - no, 1 - yes)

# Predicting ab initio genes in long introns of other genes

INTRONIC_GENES = 0      # predict ab initio genes in long introns of other genes

5. Commands to run the pipeline

5.1. Script run_pipe.pl (help and options)

Run run_pipe.pl without parameters for help.

Usage:   run_pipe.pl  <config_file>  -f <fasta_file> | -l <seq_list>  [-masked]  -d <results_dir>  [options]

where

    <config file>          - configuration file

  ------------------------------------------------------------------------------------------

  Source of sequences:

    -fasta <fasta_file>,
    -f     <fasta_file>    - file with sequences in multi-FASTA format

    -list <seq_list>,
    -l    <seq_list>       - list of single-FASTA files with sequences

    -masked                - sequences are soft-masked (repeats are masked by lowercase letters)

  ------------------------------------------------------------------------------------------

  Options to set a range of sequences for calculations:

    -from <seq_from_No>    - run for sequences from <seq_from_No>
    -to   <seq_to_No>      - run for sequences to   <seq_to_No>

    -seq <seq_No_1,...>    - run for sequences <seq_No_1>, etc.

                             - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                             Note:

                             options -from, -to, -seq cannot be used together with
                             options -jobs <N>, -chunk <n>

  ------------------------------------------------------------------------------------------

  Options to split sequences into subsets and run calculations in parallel:

    -jobs <N>,
    -j    <N>              - split sequences into N parts and 
                             print commands to run N processes in parallel

    -chunk <n>             - split sequences into chunks (<n> sequences each) and 
                             print commands to run <number_processes> in parallel

                             where

                             <number_processes> = ( number_of_sequences / chunk ) + 1

    The following options are used only with -jobs <N> or -chunk <n>:

    -make_sh               - print commands to .sh file 

                            'run_cmd.sh.XXXX' by default (where XXXX is unique substring) or
                             use option -sh <filename> to provide name for .sh file

    -sh_name <filename>,
    -sh      <filename>    - filename for .sh file 

                            (this option automatically switches ON -make_sh)

    -run                   - run commands (start calculations) in parallel

                             - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

                             Note (!):

                             with -run option all processes are started at once!
                             use    -jobs  <N> or
                             choose -chunk <n> with care for not to overload your system

  ------------------------------------------------------------------------------------------

  Options to make separate .res files for each sequence (only with -f <fasta_file>):

    -split_res             - if sequences are in multi-FASTA file,
                             make separate .res files for each sequence

                            (this option is used with -f <fasta_file>)

                            (separate .res files are named according to options 
                            -res_name_by_count, -res_name_by_seqid, see below)

    -res_name_by_count     - name .res files by assigning numbers like 1.fa.res, 2.fa.res, ...
                            (by default)

    -res_name_by_seqid     - name .res files by assigning seq IDs like chr20.fa.res, chrX.fa.res

                            (these options are used with -split_res)

                            ----------------------------------------------------------------

                            Note:

                            With no -split_res option,
                            naming of result files for multi-fasta input looks like this:

                            a) if no -from, -to or -seq options are provided 
                               then results are saved in a file

                               <fasta_file>.res

                               for example:

                              -f chr.fa -> chr.fa.res

                            b) if -from <from> and -to <to> options are provided 
                               then results are saved in a file 

                               <fasta_file>.res.<from>-<to>

                               for example:

                              -f chr.fa  -from 5 -to 10 -> chr.fa.res.5-10
                              -f chr.seq -from 5 -to 5  -> chr.seq.res.5     (here from = to)

                            c) if -seq <seq_No>  or 
                                  -seq <seq_No_1,seq_No_2,...> option is provided 
                                   then results are saved in a file

                               <fasta_file>.res.<seq_No>  or 
                               <fasta_file>.res.<seq_No_1>_<seq_No_2>_...

                               for example:

                              -f chr.fa  -seq 3     -> chr.fa.res.3
                              -f chr.seq -seq 3,5,7 -> chr.seq.res.3_5_7

  ------------------------------------------------------------------------------------------

    -d <results_dir>       - directory to store results

    -w <work_dir>          - working directory (if not provided, 
                             temporary directory is created automatically for each run)

    -clean                 - clean working directory *before* calculations 

                            (if it exists and contains any files)

    -debug                 - do not clean working directory *after* calculations

                            (so that developers can see files and spot problems)

5.2. Examples how to run

In examples below the following options mean

  -f seq.fa   - run for a multi-fasta file
  -l seq.list  - run for a list of sequences (a list of single-fasta files)

1) run for a whole set of sequences (for small genomes)

  run_pipe.pl  celegans.cfg  -f seq.fa    -d results  &> run.log  &
  run_pipe.pl  celegans.cfg  -l seq.list  -d results  &> run.log  &

use -masked for repeats-masked sequences

  run_pipe.pl  human.cfg  -f seq.fa   -masked  -d results  &> run.log  &
  run_pipe.pl  human.cfg  -l seq.list -masked  -d results  &> run.log  &

2) run for a range of sequences (from a multi-fasta file or a list of sequences)

  run_pipe.pl  celegans.cfg  -f seq.fa    -from  1  -to  5  -d results  &> run_1.log  &
  run_pipe.pl  celegans.cfg  -f seq.fa    -from  6  -to 10  -d results  &> run_2.log  &
or
  run_pipe.pl  celegans.cfg  -l seq.list  -seq 4            -d results  &> run.log  &
  run_pipe.pl  celegans.cfg  -l seq.list  -seq 4,7,9        -d results  &> run.log  &

make separate .res files for each sequence

  run_pipe.pl  celegans.cfg  -f seq.fa    -from  1  -to  5  -d results  -split_res  &> run.log  &
  run_pipe.pl  celegans.cfg  -l seq.list  -seq 4,7,9        -d results  -split_res  &> run.log  &

3) run for a whole set of sequences (in parallel)

print commands (and see how a set of sequences is split)

  run_pipe.pl  celegans.cfg  -f seq.fa         -d results  -jobs  5
  run_pipe.pl  celegans.cfg  -l seq.list       -d results  -chunk 100

make .sh file with commands (and -run to run it)

  run_pipe.pl  human.cfg  -f seq.fa  -masked  -d results  -jobs 5  -sh run_cmd.sh
  run_pipe.pl  human.cfg  -f seq.fa  -masked  -d results  -jobs 5  -sh run_cmd.sh  -run

make separate .res files for each sequence

  run_pipe.pl  human.cfg  -f seq.fa  -masked  -d results  -jobs 5  -sh run_cmd.sh -split_res
  run_pipe.pl  human.cfg  -f seq.fa  -masked  -d results  -jobs 5  -sh run_cmd.sh -split_res -run

4) run for subsets of sequences (manually created from a multi-fasta file or a list of sequences)

  run_pipe.pl  celegans.cfg  -f seq_1.fa            -d results  &> run_1.log  &
  run_pipe.pl  celegans.cfg  -f seq_2.fa            -d results  &> run_2.log  &
  ...

  run_pipe.pl  human.cfg     -l seq_1.list -masked  -d results  &> run_1.log  &
  run_pipe.pl  human.cfg     -l seq_2.list -masked  -d results  &> run_2.log  &
  ...

saving results in different directories and providing working directories explicitly

  run_pipe.pl  human.cfg     -f seq_1.fa    -masked  -d results_1/  -w work_1/  &> run_1.log  &
  run_pipe.pl  human.cfg     -f seq_2.fa    -masked  -d results_2/  -w work_2/  &> run_2.log  &
  ...

  run_pipe.pl  celegans.cfg  -l seq_1.list           -d results_1/  -w work_1/  &> run_1.log  &
  run_pipe.pl  celegans.cfg  -l seq_2.list           -d results_2/  -w work_2/  &> run_2.log  &
  ...

5.3. Description

Sequences for annotation can be provided to the pipeline in two ways:

a) as a     multi-FASTA file via -f <fasta_file> option or
b) as a list of single-FASTA files via -l <seq_list> option.

If sequences are soft-masked (repeats are masked by lowercase letters) then also use -masked option.

Directory with results will contain .res files with Fgenesh++ predictions
(predicted gene structures and corresponding proteins in Fgenesh++ format - see APPENDIX 4 for details).

In case of multiple .res files you can merge them into one multi-res file with the following script:

SCRIPTS/merge_res_files.pl

If not provided explicitly, unique temporary working directory for temporary and intermediate files will be created automatically and cleaned after calculations are finished.


5.4. Pipeline log and errors output

The pipeline prints some log output to stdout and stderr (and errors to stderr).

This log output does not make much sense for users but can help to spot errors and bugs (if any) for developers. To save log output (and errors) run the pipeline with the following command:

in bash, sh redirect stdout and stderr to one file or two different files

FGENESHPIPE/run_pipe.pl  ...  &>  <log_err_file>
or
FGENESHPIPE/run_pipe.pl  ...   >  <log_file>  2>  <log_err_file>

in tcsh, csh

FGENESHPIPE/run_pipe.pl  ...  >&  <log_err_file>

Note: tcsh and csh cannot redirect stdout and stderr to two different files but >& will redirect the combined output to one file)

EXAMPLES

in bash

FGENESHPIPE/run_pipe.pl  celegans.cfg  -f seq.fa  -d results  &> run.log_err
or
FGENESHPIPE/run_pipe.pl  celegans.cfg  -f seq.fa  -d results  >  run.log  2> run.log_err

in tcsh

FGENESHPIPE/run_pipe.pl  celegans.cfg  -f seq.fa  -d results  >& run.log_err

6. Running the pipeline in parallel

There are several strategies to run the pipeline in parallel to annotate genomic sequences.

Basically, the pipeline can be parallelized either by splitting

1) a set of target sequences (genomic sequences to annotate) or
2) a set of query sequences (full-length or partial mRNAs, proteins from DB)

into subsets and running the pipeline for these subsets simultaneously.

Some combination of 1) and 2) is also possible.


Let us consider each approach.

1) split a set of genomic sequences into subsets and run the pipeline in parallel for these subsets

You can split genomic sequences with options -jobs <N> or -chunk <n>.

First, run pipeline commands with options -jobs <N> or -chunk <n> and -sh <file.sh> to see how it splits sequences into subsets.

For example,

run_pipe.pl  ...  -jobs 10  -sh run_cmd.sh

Look at run_cmd.sh for generated commands with -from <n1> and -to <n2> options that relate to subsets of genomic sequences.

Then, if the split looks OK to you, run .sh file or add -run to the command line to start calculations in parallel.

For example,

./run_cmd.sh
or
run_pipe.pl  ...  -jobs 10  -sh run_cmd.sh  -run

Notes:

a) if you use this approach then set all THREADS values in a config file to 1

NUM_THREADS     = 1  # number of threads (general)

MRNA_THREADS    = 1  # number of threads to map full-length mRNAs
EST_THREADS     = 1  # number of threads to map partial mRNAs / ESTs
PROTEIN_THREADS = 1  # number of threads to map proteins or run 'blast'

b) directory for result files (provided via -d <dir> option) is not getting cleaned when the pipeline starts therefore multiple pipeline runs can write files with gene predictions into the same results directory;

but if the pipeline creates files with the same names as ones already stored in results directory then new files will overwrite the old ones


c) since multiple pipeline runs are started and running at once (for subsets of sequences) then multiple .res files (with gene predictions) will appear in results directory - one for each subset of sequences (or one per each sequence if -split_res option is applied)

After all calculations are complete, multiple .res files can be merged into one multi-res file with the following script:

SCRIPTS/merge_res_files.pl


d) you can split genomic sequences yourself using other methods or scripts, write commands to some .sh file and start calculations (run .sh file)


e) if you have hundreds or thousands of genomic sequences (contigs) then to run calculations in some controllable way it is probably better not to start all the calculations at once (as N parallel processes with big subsets of sequences);

instead, split genomic sequences into multiple subsets (for example, with -chunk <n>) and get .sh file with commands. Run the commands from this file in chunks (X commands at a time). Wait until some jobs are finished (check by inspection of running jobs or by counting of .res files in results directory) and then run another series of commands, and so on.


2) run the pipeline for the whole set of genomic sequences at once

but specify in a config file the number of threads for programs that support parallelization (these programs can automatically split a set of query sequences (full-length or partial mRNAs, proteins from DB) into subsets and perform calculations with such subsets in parallel).

You can specify the number of threads (cores) for all such programs or set an individual value for each program.

For example,

NUM_THREADS     = 10  # number of threads (general)

or (the same)

# the following threads settings are used if NUM_THREADS = 1

MRNA_THREADS    = 10  # number of threads to map full-length mRNAs
EST_THREADS     = 10  # number of threads to map partial mRNAs / ESTs
PROTEIN_THREADS = 10  # number of threads to map proteins or run 'blast'

3) combination of 1) and 2) is also possible

if you want to annotate a small number of sequences with maximal speed (and computer resources allow it) use both -jobs <N> and set the number of threads in configuration file

For example,

run_pipe.pl ... -jobs 3    # command
NUM_THREADS = 12           # config file

Overall: 3*12 = 36 cores/threads


Conclusion:

Generally, approach 1) is faster than 2) since not all steps of the pipeline support parallelization by query sequences (at some steps the pipeline runs as 1 process).

Obviously, approach 3) is the fastest if computer resources allow you to annotate a certain number of sequences in this way.

Approach 2) is useful for a small number of sequences because it creates only one .res file for a multi-FASTA input file
(if -split_res option is not applied).


7. Types of evidence: mode details

(See also Sources of evidence from above.)


Fgenesh++ pipeline can deal with several types of evidence.


1) gene models predicted using full-length mRNA sequences

This corresponds to MAP_MRNAS block in configuration file.

See Using full-length mRNAs for details.


2) gene models predicted using proteins from some protein DB
 (NR, OrthoDB or custom protein DB, or their subsets)

This corresponds to USE_PROTEINS block in configuration file.

First, prot_map maps known proteins from a protein database (for example, NR or its subset) to genomic sequences and good mappings are selected.

Then fgenesh+ program predicts gene models in regions with good mappings using mapped proteins.

After that predicted gene models are additionally checked / filtered by script that analyses BLAST2 (bl2seq) alignments between predicted and homologous proteins. Only gene models that have good coverage of predicted and homologous proteins by BLAST alignment are selected. Some other criteria are also checked.

These protein-supported predictions are made in regions not occupied by mRNA- supported predictions.

See APPENDIX 3 for how to prepare protein DB for calculations.

To speed up calculations make a subset of proteins that includes proteins from closely related species or some taxon.


3) using partial mRNAs to improve gene predictions

This corresponds to MAP_ESTS block in configuration file.

Partial mRNAs (including ESTs) can be used to make more accurate predictions by fgenesh.

Partial mRNAs are aligned to genomic sequences by mrna_map and good alignments are selected. A list of alignment blocks is compiled with information whether such blocks are flanked at one or both ends with splice sites or not. This information about regions (potential exons) supported by partial mRNAs and potential splice sites is used by fgenesh when predicting genes ab initio (see below) (although in this case if such info is used for predictions they are not pure ab initio).

With this approach, boundaries of CDS exons and gene structures are predicted more correctly.

Technically, using partial mRNAs is not an additional step of improvement after making initial predictions but rather incorporating additional information (potential splice sites and their weights) in the process of prediction by fgenesh. By improvement we mean higher accuracy of gene models.

Each ab initio predicted gene is represented by only one gene model (the one with the maximum weight) even if partial mRNAs support several alternative splicing gene variants.

See also Using partial mRNAs.


4) using reads to improve gene predictions

This corresponds to USE_READS block in configuration file.

Reads are aligned to genomic sequences by ReadsMap pipeline and a list of potential splice sites is compiled for each sequence (*.sites files).

These *.sites files should be placed in some directory and path to this directory should be provided in configuration file.

Information about potential splice sites is used by fgenesh when predicting genes ab initio (see below). With this approach, boundaries of CDS exons and gene structures are predicted more correctly.

Technically, as with partial mRNAs (ESTs), this additional information is used in the process of prediction by fgenesh, not as a separate step after making initial predictions.

Currently, evidence from partial mRNAs and reads can not be used simultaneously in Fgenesh++. Apart from that, there is no limitation on combination of evidence data.


5) ab initio predictions

Ab initio predictions (without full-length mRNA or protein type of evidence) are made in regions not occupied by mRNA-supported and protein- supported predictions.

To make ab initio predictions, we run fgenesh with gene prediction parameters trained for a specified (or closely related) organism.

Some additional evidence from partial mRNAs or reads can be used by fgenesh in the process of gene prediction (see 3, 4 from above).

To make ab initio predictions, set

PREDICT_AB_INITIO = 1

In addition, best BLAST hits to ab initio predicted genes (proteins) can be found by BLAST of predicted proteins vs protein database. To do this, set

BLAST_AI_PROTEINS = 1

6) ab initio gene predictions in long introns of other genes

Optionally, pipeline can make gene predictions in long introns of other genes.

To do this, set

INTRONIC_GENES = 1   # predict genes in long introns of other genes

8. Using full-length mRNAs

The most powerful stage of Fgenesh++ pipeline is mapping full-length mRNAs to genomic sequences and predicting genes based on mRNA spliced alignments.

If full-length mRNAs are of good quality (for example, known mRNAs from RefSeq or assembled by Trinity or similar programs with no or little errors in CDS part) then Fgenesh++ is able to predict genes based on these mRNA alignments with very high accuracy.

If a set of full-length mRNAs includes isoforms then gene structures with alternative splicing are predicted.


8.1. Sources of full-length mRNAs

There can be several sources for full-length mRNAs:

1) transcripts assembled by Trinity or similar programs

Assembled transcripts usually include a mix of full-length and partial mRNAs (and also chimeras).

To get full-length mRNAs from assembled transcripts we need to predict ORFs, choose the best ORF in each transcript and decide whether this transcript can be considered as full-length mRNA (by CDS part, with ATG and STOP codon present) or not.

To predict ORFs and get full-length mRNAs (with definition lines in required format) with BestOrf program and mrnas_annot.pl script, see APPENDIX 1

If you use TransDecoder (or other program) to predict ORFs then it is your task to choose full-length mRNAs and make definition lines in the format required for Fgenesh++ (see below).


2) RefSeq mRNAs for a specified organism

if full-length mRNAs for a specified organism are present in RefSeq you can use them to predict genes in genomic sequeces, see APPENDIX 2


1) and 2) can be combined (merged) into one file.

In this case make sure that sequences in the merged file have unique IDs.

For example,

./seq_renumber.pl -i human.mrna.fa -check

To make sequences IDs unique (if they are not) use the command

./seq_renumber.pl -i human.mrna.fa -o human.mrna.unique_ids.fa

Note:

if some full-length mRNAs are exactly the same (and they are mapped to genomic sequences with good quality) then this can lead to prediction of the same gene structures. We need a script to filter out duplicate gene predictions from the final set of predicted genes.


8.2. Definition lines

Fgenesh++ expects definition lines for full-length mRNAs to have the following fields at the end after the '#' character.

# len = 845  atg = 283  stop = 711  target = chr22

where

len   - length of mRNA
atg   - first coordinate of ATG
stop   - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)

Note: target field is optional, it is assumed target = 'na' if it is absent


EXAMPLES of deflines:

>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = chr22
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = na
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711
or
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = chr2
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = na
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820

8.3. Check

To check that full-length mRNAs have no annotation errors and their definition lines are in correct format run

./check_mrnas.pl  

Example:

./check_mrnas.pl  human.mrna.fa


8.4. Config file

For Fgenesh++ provide location of a FASTA file with full-length mRNAs in a config file, for example:

MRNA_FILE = /data/softberry/Human_data/human.mrna.fa    # full-length mRNAs

9. Using partial mRNAs

Partial mRNAs can come from assembled transcripts that are not full-length mRNAs or from ESTs.

Put them all into one FASTA file and provide its location in config file, for example:

EST_FILE = /data/softberry/Human_data/human.rna.fa      # partial mRNAs

It is not required for sequences in this file to have unique IDs, nor any special format of definition lines is required.


APPENDIX 1. Full-length mRNAs from assembled transcripts.


How to get full-length mRNAs from assembled transcripts

Suppose we have transcripts assembled from RNA-Seq reads by Trinity or other programs.

The best strategy to use transcripts in Fgenesh++ is to split them into full-length mRNAs and partial mRNAs.

To find full-length mRNAs we have to predict ORFs in transcritps: in each transcript choose the best ORF and decide whether a transcript can be considered as full-length mRNA (by CDS, with ATG and STOP codon present) or not.

ORFs in transcripts can be predicted by BestOrf, TransDecoder or other programs.

In Fgenesh++ package we provide BestOrf and MaxOrf that predict ORFs in transcripts. BestOrf uses the same gene prediction matrices as Fgenesh and calculates coding potential when choosing the best ORF for a transcript while MaxOrf reports the longest ORF (by CDS).

BestOrf has also higher accuracy of ORF predictions than MaxOrf in out tests on annotated RefSeq mRNAs.

Script mrnas_annot.pl runs BestOrf (or MaxOrf) on transcripts and splits them into full-length mRNAs and other transcripts (partial mRNAs).


Format of definition lines

FASTA file with full-length mRNAs have special format for definition lines with the following fields:

# len = 845  atg = 283  stop = 711  target = chr22

where

len   - length of mRNA
atg   - first coordinate of ATG
stop   - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)

Note: target field is optional, it is assumed target = 'na' if it is absent


EXAMPLES of deflines:

>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = chr22
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = na
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711

Once you get a file with full-length mRNAs (with deflines in special format) provide its location in configuration file, for example:

MRNA_FILE = /data/softberry/Human_data/human.trinity.mrna.fa    # full-length mRNAs

File with partial mRNAs can be used as ESTs, for example:

EST_FILE = /data/softberry/Human_data/human.trinity.rna.fa      # partial mRNAs

Programs and commands

Here we provide commands how to split assembled transcripts into full-length mRNAs and other transcripts (partial mRNAs).

1) make sure that assembled transcripts have unique sequence IDs

check if sequence IDs are unique or not

./seq_renumber.pl -i human.trinity.fa -check

If sequence IDs are not unique then make them unique.

For example, rename file

mv  human.trinity.fa  human.trinity.fa.initial

Then, make new file with unique sequence IDs

./seq_renumber.pl -i human.trinity.fa.initial -o human.trinity.fa

(run ./seq_renumber.pl for help and more options)


2) predict ORFs in transcripts and split them into full-length mRNAs and other transcripts (partial mRNAs)

using bestorf to predict ORFs (by default)

./mrnas_annot.pl -i human.trinity.fa -par Human.dat -min_cds:100

using maxorf to predict ORFs

./mrnas_annot.pl -i human.trinity.fa -p maxorf -min_cds:100

EXAMPLE:

./mrnas_annot.pl -i human.trinity.fa -par Human.dat -min_cds:100

gives the following output Warning: no -chr <chr_name> or -chr_code <chr_code>, assuming 'na' (source of transcripts is unknown) Predicting ORFs with 'bestorf' and 'Human.dat' parameters Creating files: human.trinity.mrna.fa - full-length mRNAs (based on predicted ORFs) human.trinity.mrna.bestorf - bestorf output (ORF predictions) for mRNAs human.trinity.protein.fa - proteins (CDS translation) for full-length mRNAs human.trinity.rna.fa - partial mRNAs (with no ORF prediction) human.trinity.log - log file Overall number of sequences : 73232 Sequences with ORF predicted : 57014 Annotated as full-length mRNAs: 57014 ( 77.9%) in direct strand : 26447 ( 46.4%) in reverse strand : 30567 ( 53.6%) Saved as partial mRNAs : 16218 ( 22.1%) Coverage of transcripts by ORF (CDS) (from 0 to 1) min : 0.03 max : 1.00 average : 0.51

Execution time: ~46 min


3) you can additionally check that full-length mRNAs have definition lines in required format and mRNAs have no annotation errors

./check_mrnas.pl  human.trinity.mrna.fa

4) provide locations of files with full-length (and partial) mRNAs in configuration file

MRNA_FILE = /data/softberry/Human_data/human.trinity.mrna.fa    # full-length mRNAs

and

EST_FILE  = /data/softberry/Human_data/human.trinity.rna.fa     # partial mRNAs

Script mrnas_annot.pl (help and options)

Run ./mrnas_annot.pl without parameters for help.

Predict ORFs and annotate transcripts, v1.07 (Softberry)

Usage: ./mrnas_annot.pl -i <seq_file_in>  [options]

where

    -i <seq_file_in>      - file with mRNAs (transcripts)


Options:

    -program <program>
    -p       <program>    - program to predict ORFs:

                            bestorf (by default) or 
                            maxorf  (maxorf.pl)

    -par <param_file>     - file with parameters for 'bestorf'

    -cov <thr>            - minimal coverage of mRNA by ORF (CDS) length (from 0 to 1)
                            (0 by default)

    -basename <name>
    -b <name>             - filename without extension as basename for output files
                            (basename from input file by default)

    -chr      <chr_name>,
    -chr_name <chr_name>  - chromosome name (chr22, chrX, etc.)  or 'na'

    -code     <chr_code>, 
    -chr_code <chr_code>  - chromosome code (number, X, Y, etc.) or 'na'

                            if options -chr <chr_name> and -chr_code <chr_code> are not provided 
                            then mRNAs are not assigned to any particular chromosomes or sequences 
                            (and in Fgenesh++ they are mapped to all genomic sequences and 
                            good mappings are selected)

    -check 0|1            - post-annotational check for mRNAs (1 by default)

    -info
    -progress             - print progress


Options for an ORF prediction program ('bestorf' or 'maxorf') are passed directly to a program.

The following options can be used with 'bestorf' or 'maxorf':

    -strand:0|1|2      - predict ORFs in + strand (0), - strand (1) or both (2)  (2 by default)
         -s:0|1|2

    -min_cds:<LEN>     - minimal CDS length (in aa)  (100 by default)
                        (do not make it too low otherwise the rate of false positives increases)

    -stop_in_cds:0|1 - include STOP codon in CDS / ORF coordinates (1) or not (0) (1 by default)


The following options can be used with 'maxorf':

    -polya:0|1         - use polyA tail as a hint to define mRNA strand (direct or reverse) 
                        (1 by default)
                        (option is used only when predicting ORFs in both strands)

    -polya_hint:0|1    - same as -polya:0|1

    -polya_thr:<N>     - number of A's at the end of mRNA to consider it as polyA tail
                        (number of T's at the start in reverse strand)
                        (5 by default)

Output:

if input file is 'transcripts.fna' then the following files will be created:

(basename for these files can be changed with -b <name> or -basename <name>)


# files for full-length mRNAs

transcripts.mrna.fa      - annotated full-length mRNAs in coding strand
transcripts.protein.fa   - proteins for full-length mRNAs
transcripts.mrna.bestorf or
transcripts.mrna.maxorf  - CDS / ORF predictions for full-length mRNAs (by 'bestorf' or 'maxorf')


# files for other transcripts

transcripts.rna.fa       - transcripts with no predicted ORFs 


# log file

transcripts.log          - log file


Examples:

./mrnas_annot.pl -i human.trinity.all.fa -par Human.dat  -min_cds:100
./mrnas_annot.pl -i chr22.trinity.all.fa -par Human.dat  -min_cds:100    -chr chr22

./mrnas_annot.pl -i transcripts.fna   -p bestorf -par Human.dat
./mrnas_annot.pl -i transcripts.fna   -p maxorf

./mrnas_annot.pl -i rf_hs_chr22.cdna  -p bestorf -par Human.dat -strand:0 -min_cds:55 -chr chr22
./mrnas_annot.pl -i rf_hs_chr22.cdna  -p maxorf                 -strand:0 -min_cds:55 -chr_code 22

./mrnas_annot.pl -i QSOX1_cDNAs_13.txt -p bestorf -par Human.dat  -strand 0  -cov 0.5
./mrnas_annot.pl -i QSOX1_cDNAs_13.txt -p maxorf                  -strand:0  -cov 0.5

APPENDIX 2. Full-length mRNAs from RefSeq.


RefSeq mRNAs for mRNA mapping

RefSeq can be used as a source of known full-length mRNAs for an organism.

Here we describe how to download RefSeq data and prepare full-length mRNAs for a specified organism to use them in Fgenesh++.


RefSeq data

FTP to RefSeq to browse its directories and see what files are present

ftp ftp://ftp.ncbi.nlm.nih.gov/refseq/
or
ftp anonymous@ftp.ncbi.nlm.nih.gov
cd refseq/

1) download RefSeq *.rna.gbff.gz (and optionally *.protein.faa.gz) files for a specified organism

Example for Human:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.rna.gbff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.protein.faa.gz  (optional)

Example for Arabidopsis thaliana (plants):

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.rna.gbff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.protein.faa.gz  (optional)

Example for Saccharomyces cerevisiae S288c (fungi):

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.rna.gbff.gz
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.protein.faa.gz  (optional)

2) test downloaded .gz files

gunzip -t *.gz

if some files are corrupted then re-download them

3) gunzip and concatenate files

for Human

gunzip -c human.*.rna.gbff.gz    > human.rna.gbff
gunzip -c human.*.protein.faa.gz > human.protein.faa  (optionally)

for Arabidopsis thaliana

gunzip -c plant.*.rna.gbff.gz    > plant.rna.gbff
gunzip -c plant.*.protein.faa.gz > plant.protein.faa  (optionally)

for Fungi

gunzip -c fungi.*.rna.gbff.gz    > fungi.rna.gbff
gunzip -c fungi.*.protein.faa.gz > fungi.protein.faa  (optionally)

The order in which files are concatenated is not important for refseq_mrnas.pl script.

Or if you want to concatenate files in an ordered manner then make a shell script like this (example for human.rna.gbff):

gunzip_refseq.sh
# set the number of volumes for .gz files
# human.<volume>.rna.gbff.gz
#
# for example, if we have files 
#
# from  human.1.rna.gbff.gz
# to    human.14.rna.gbff.gz 
# 
# then the number of volumes is 14

MAX_VOL=14


# delete old file if it is present

rm  human.rna.gbff


# unzip files to human.rna.gbff

for (( i=1; i<=$MAX_VOL; i++ ))
do

  cmd="gunzip -c human.$i.rna.gbff.gz >> human.rna.gbff"

  echo $cmd
  eval $cmd

done

Concatenated file <filename>.rna.gbff (and optionally <filename>.protein.faa) is then used as an input to refseq_mrnas.pl script to get full-length mRNAs <filename>.mrna.fa (and optionally RefSeq proteins <filename>.protein.fa) for a specified organism.

4) you can check how many entries (if any) are present for a specified organism in RefSeq data

for example,

grep "^  ORGANISM  Saccharomyces cerevisiae S288c" fungi.rna.gbff | wc -l

Note that these include not only mRNAs but also other RNAs.


Script to get full-length mRNAs from Refseq

Script refseq_mrnas.pl takes as an input RefSeq <filename>.rna.gbff file and extracts full-length mRNAs for a specified organism to <filename>.mrna.fa file.

(Optionally it also takes RefSeq <filename>.protein.faa and extracts proteins for full-length mRNAs for a specified organism to <filename>.protein.fa file.)

By default, it selects full-length mRNAs with IDs starting with NM_ prefix and with REVIEWED / VALIDATED / PROVISIONAL status codes.

There are options in the script that also allow to get mRNAs with both prefixes (NM_ and XM_) and choose entries with specified status codes.

Note:
mRNAs with NM_ prefix are protein-coding transcripts (usually curated) while
mRNAs with XM_ prefix are predicted model protein-coding transcripts (computed).

Here are some links to RefSeq documentation.

RefSeq accession prefixes:

What is the difference between XM_ and NM_ accessions?
RefSeq accession numbers and molecule types.

RefSeq status codes:

How can I tell if a RefSeq record has been curated?
What sequence is used to define a RefSeq?
RefSeq status codes.


Script refseq_mrnas.pl processes <filename>.rna.gbff and creates files:

<filename>.mrna.fa   - mRNAs in FASTA format with deflines in special format
<filename>.mrna.err.fa - mRNAs with errors in RefSeq annotation
<filename>.protein.fa   - RefSeq proteins for extracted mRNAs (optional)


Deflines of mRNAs in FASTA file have special format with the following fields:

# len = 14121  atg = 129  stop = 13820  target = chr2

where

len   - length of mRNA
atg   - first coordinate of ATG
stop   - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)

Note: target field is optional, it is assumed target = 'na' if it is absent


EXAMPLES of deflines:

>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = chr2
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = na
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820

Once you created <filename>.mrna.fa provide its location in configuration file, for example:

MRNA_FILE = /data/softberry/Human_data/rf_human.mrna.fa      # full-length mRNAs

Sequences as Chromosomes vs Contigs/Scaffolds

It is usually indicated in RefSeq <filename>.rna.gbff to which chromosome mRNA belongs.

For example,

     source          1..3159
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /chromosome="10"

(this mRNA belongs to chromosome 10)

Fgenesh++ can work with full-length mRNAs in two ways:

1) if you annotate chromosome sequences (and they are properly named like 'chr22.fa') then Fgenesh++ can map mRNAs only to those chromosomes where they belong to,

2) if you annotate contigs / scaffolds (so that you do not know to which contigs RefSeq mRNAs belong) then Fgenesh++ tries to map mRNAs to all contigs and selects good mappings.

Let us discuss these two cases in details.


1) if you annotate chromosome sequences (and they are properly named)

Suppose you have chromosome sequences properly named like (for example, for human) chr1.fa, chr2.fa, .., chr22.fa, chrX.fa, chrY.fa (or with some other extension like chr1.seq, etc.).

Human chromosomes are named from 1 to 22 and 'X', 'Y' in human.rna.gbff file, for example:

     source          1..3159
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /chromosome="10"

To convert 10 (as in the example above) to chr10 (in 'target' field so that Fgenesh++ will know to map this mRNA to chr10.fa (or chr10.seq, or chr10 with any extension) run refseq_mrnas.pl with an option -chr_prefix:chr (so that the prefix 'chr' is added to '10' and we have 'chr10' at the end),

for example:

./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'  -chr_prefix:chr

Now for this mRNA we have "target = chr10" in defline in FASTA file with mRNAs.

Or, for 'Saccharomyces cerevisiae S288c' chromosomes are named from I to XVI in fungi.rna.gbff, for example:

     source          1..1848
                     /organism="Saccharomyces cerevisiae S288c"
                     /mol_type="mRNA"
                     /strain="S288c"
                     /db_xref="taxon:559292"
                     /chromosome="III"

So, if your files with S.c. chromosomes are I.seq, II.seq, .., XVI.seq then run refseq_mrnas.pl with an option -chr_prefix: (empty prefix) or -chr_mode (so that the empty prefix is added to 'III' and we have 'III' at the end),

for example:

./refseq_mrnas.pl  fungi.rna.gbff  -org:'Saccharomyces cerevisiae S288c'  -chr_prefix: 
./refseq_mrnas.pl  fungi.rna.gbff  -org:'Saccharomyces cerevisiae S288c'  -chr_mode    

Now for this mRNA we have "target = III" in defline in FASTA file with mRNAs.


2) if you annotate contigs / scaffolds (so that you do not know to which contigs RefSeq mRNAs belong)

run refseq_mrnas.pl with no -chr_prefix: or -chr_mode options,

for example:

./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'

Now for all mRNAs we have "target = na" in deflines in FASTA file.

Fgenesh++ maps such mRNAs to all contigs and selects good mappings.


Help (options and examples)

Run ./refseq_mrnas.pl without parameters for help.

Extraction of full-length mRNAs from RefSeq for a specified organism, v2.04 (Softberry).

Usage: ./refseq_mrnas.pl  <refseq .rna.gbff>  -org:<species name>

where

  <refseq .rna.gbff>    - RefSeq .rna.gbff file name

  -org:<species name>   - organism for which to select mRNAs

                          use species name from the 'ORGANISM' field of RefSeq entries,
                          for example:

                          ORGANISM  Homo sapiens
                          ORGANISM  Arabidopsis thaliana
                          ORGANISM  Saccharomyces cerevisiae S288c

Options:

  -chr_mode             - assumes that target sequences (where mRNAs will be mapped to) are
                          chromosomes and they are properly named (like 'chr22.fa')

                          in this case for each mRNA  field 'target = <chr>' is added to
                          mRNA defline (and Fgenesh++ will map that mRNA only to chromosome <chr>)

                          if target sequences are NOT chromosomes or if we want to map each mRNA
                          to all target sequences, do not use -chr_mode or -chr_prefix:

                          (in this case field 'target = na' is added to mRNA defline 
                           and Fgenesh++ will map that mRNA to all target sequences)


  -chr_prefix:<chr>     - prefix for chromosomes (for example, 'chr', 'Chr' or even '')

                          for example, if chromosome codes in RefSeq are 1, 2, .., 22, X, Y and you
                          have chromosome sequences in files chr1.fa, chr2.fa, ..., chrX.fa, chrY.fa 
                          then use -chr_prefix:chr

                          (automatically applies -chr_mode)


  -prefix:<prefix>      - prefix for generated files 
                         <.mrna.fa>, <.mrna.err.fa> and <.protein.fa> (optionally)

                         by default, prefix is 'refseq_Nm' where 'Nm' are the first letters 
                         of species name (for example, 'refseq_Hs' for 'Homo sapiens')


  -ids:<NM|XM|all>      - select mRNAs with ID prefixes NM_, XM_ or both (NM by default)

                          Note: -ids:all or -ids:XM automatically sets -status:all

  -status:<[RVP]|all>   - select mRNAs with this status (RVP by default)
                          (R - REVIEWED, V - VALIDATED, P - PROVISIONAL), or

                          all - status is NOT checked which allows to select entries with 
                                any status (including also PREDICTED, INFERRED, MODEL)

  -get_prot_from  <.protein.faa>

                        - for extracted mRNAs also make <.protein.fa> file with proteins 
                          from Refseq <.protein.faa> file

  -check_mrna:<0|1>     - check mRNA annotation (1 by default) and skip mRNA if errors are found
                          checks for:

                          right ATG and STOP codon coordinates
                          ORF length is dividable by 3
                          protein translated from mRNA does not contain internal '*'

  -no_check             - same as -check_mrna:0

  -h, -help             - print this help


Examples:

./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'                                                          > Hs.log        2> Hs.err
./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'  -chr_prefix:chr               -prefix:refseq_human      > human.log     2> human.err
./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'  -chr_prefix:chr  -ids:all     -prefix:refseq_human_all  > human_all.log 2> human_all.err
./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'  -chr_prefix:chr  -status:all                            > Hs.log        2> Hs.err
./refseq_mrnas.pl  human.rna.gbff  -org:'Homo sapiens'  -chr_prefix:chr  -get_prot_from  human.protein.faa      > Hs.log        2> Hs.err

./refseq_mrnas.pl  cow.rna.gbff    -org:'Bos taurus'    -chr_prefix:chr                                         > Bt.log        2> Bt.err
./refseq_mrnas.pl  cow.rna.gbff    -org:'Bos taurus'    -chr_prefix:chr  -ids:all     -prefix:refseq_Bt_all     > Bt_all.log    2> Bt_all.err

./refseq_mrnas.pl  plant.rna.gbff  -org:'Arabidopsis thaliana'  -chr_prefix:ath                                 > At.log        2> At.err
./refseq_mrnas.pl  plant.rna.gbff  -org:'Arabidopsis thaliana'     -ids:all  -no_check  -prefix:refseq_At_test  > At_test.log   2> At_test.err

./refseq_mrnas.pl  fungi.rna.gbff  -org:'Saccharomyces cerevisiae S288c'  -chr_mode                             > Sc.log        2> Sc.err
./refseq_mrnas.pl  fungi.rna.gbff  -org:'Saccharomyces cerevisiae S288c'  -chr_prefix:                          > Sc.log        2> Sc.err

Output files

Script creates the following files:

<filename>.mrna.fa   - mRNAs in FASTA format with deflines in special format
<filename>.mrna.err.fa - mRNAs with errors in RefSeq annotation
<filename>.protein.fa   - RefSeq proteins for extracted mRNAs (optional)

Exact names of output files depend on the -prefix: option.

For example, filenames for human (-org:'Homo sapiens') are

by default (no -prefix: option)

refseq_Hs.mrna.fa
refseq_Hs.mrna.err.fa
refseq_Hs.protein.fa

with -prefix:rf_human

rf_human.mrna.fa
rf_human.mrna.err.fa
rf_human.protein.fa

Once you created <filename>.mrna.fa provide its location in configuration file, for example:

MRNA_FILE = /data/softberry/Human_data/rf_human.mrna.fa      # full-length mRNAs

Other scripts

1) check_mrnas.pl to check that full-length mRNAs are annotated correctly

Example:

./check_mrnas.pl  rf_human.mrna.fa

2) set_mrnas_target.pl to change 'target' in mRNA deflines

Example:

make 'target = na'

./set_mrnas_target.pl  refseq_Hs.mrna.fa  refseq_Hs.mrna.fa.target_na   -set_target na

APPENDIX 3. Protein database.


Preparing NR (non-redundant protein database) or other protein database


BLAST+ / BLAST executables

BLAST+ can be downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html

The legacy BLAST executables can be downloaded from NCBI:

https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/

Use BLAST+ or BLAST release 2.2.13 or some higher version.


NR

NR (non-redundant protein sequence database) can be downloaded from

ftp ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

NR or custom protein database must be present in directory provided in configuration file in both formats - as FASTA file and as formatted for BLAST by makeblastdb or formatdb program.

Format protein database for BLAST with one of the following commands:

makeblastdb -in <DB> -dbtype prot  # BLAST+
formatdb    -i  <DB>               # BLAST

For example,

makeblastdb -in nr -dbtype prot  # BLAST+
formatdb    -i  nr               # BLAST

Note: do NOT use -parse_seqids (BLAST+) or -o T (BLAST) option since the pipeline requires a protein database to be formatted without parsing Seq-ids!

After formatting, protein DB will be present in both formats:

nr        (FASTA formatted)

nr.*.phr  (BLAST formatted)
nr.*.pin  (BLAST formatted)
nr.*.psq  (BLAST formatted)

Put the name of FASTA file ('nr' in example above) in configuration file, for example:

PROTEIN_DB       = /data/NR/nr                          # protein DB

Prepare index file for 'nr' (or its subset)

Also, prepare index file with make_nr_indexed.pl script.

For example,

FGENESHPIPE/scripts/make_nr_indexed.pl  -f /data/NR/nr  -i /data/NR/nr.ind

Put the name of index file in configuration file, for example:

PROTEIN_DB_INDEX = /data/NR/nr.ind                      # protein DB index file

NR subsets

Predicting genes using proteins from protein DB is still the most time consuming step since the size of protein databases grows exponentially.

We highly recommend to use relevant NR subsets for calculations because the initial NR size is very big.

Use, for example, the following subsets:

nr_animals - animals (Metazoa) proteins (to annotate animals)
nr_plants  - plant proteins             (to annotate plants)
nr_fungi   - fungi proteins             (to annotate fungi)

Or, even better, prepare more specific subsets that include proteins for species from some taxon level to which a specified organism (that you are going to annotate) belongs.

Also, we recommend to exclude from NR proteins not relevant for your annotation. For example, partial proteins, "unnamed protein product" proteins, hypothetical proteins, predicted proteins (with 'XP_' or 'PREDICTED' in deflines).


OrthoDB

OrthoDB (or its subsets) can be a good substitute for NR.

If you use OrthoDB make sure that proteins are at least 6 aa long and do not contain STOP '*' codons.

Script skip_short_seq.pl allows to filter out short protein sequences.

Script skip_STOP_seq.pl allows to filter out protein sequences with internal STOP codons.


Non-standard characters in protein sequences

Legacy BLAST does not work with non-standard characters (like J, O) in protein sequences.

Examples of warning messages because of non-standard characters in proteins:

a)
formatdb WARNING: [000.000] Sequence number 4446542 (lcl|22997_nr.01), 1 illegal character was removed: 1 J >gi|123965153|gb|ABM74351.1| solute carrier family 8 member 3 [Uraeotyphlus cf. malabaricus KR-2007] SIEVITSQEKEITIRKINGETTTTTIRVWNETVSNLTLMALGSSAPEILLSLVEVCGHGFAAGELGPSTIVGSAAFNMFI IIAICVYVIPDGETRRIKHLRVFFVTAAWSIFAYIWLYMILAVFSPGIVQVWEGLLTLFFFPICVFLAWVADRRLLFYKY MHKKYRTDKHRGIMIETEGEHPKGIEMDGKMMNSHFLDGSLVNMEGKEVDESRREMIRILKDJKQKHPEKDLDQLVEMAN YYALSHQQKSRAFYRIQATRMMTGAGNILKKHAAEQAKKSSSMNEVQIDEPEEFMSKIYFDPCSYQCLENCGAVLLTVVR RGGDISKTLYVDYKTEDGSANAGADYEFTEGTVVFKSGETQKEFSVGIIDDDIF

b)
formatdb Selenocysteine (U) at position 48 replaced by X Script skip_non_standard_seq.pl allows to filter out protein sequences with non-standard characters.


Commands for 'nr_animals'

As an example, for 'nr_animals' subset commands look like:

format protein database for BLAST / BLAST+

makeblastdb -in nr_animals -dbtype prot  # BLAST+
formatdb    -i  nr_animals               # BLAST

make index file

FGENESHPIPE/scripts/make_nr_indexed.pl  -f /data/NR/nr_animals  -i /data/NR/nr_animals.ind

put lines in config file

PROTEIN_DB       = /data/NR/nr_animals                  # protein DB
PROTEIN_DB_INDEX = /data/NR/nr_animals.ind              # protein DB index file

Commands for 'nr_plants'

As another example, for 'nr_plants' commands look similar:

format protein database for BLAST / BLAST+

makeblastdb -in nr_plants -dbtype prot  # BLAST+
formatdb    -i  nr_plants               # BLAST

make index file

FGENESHPIPE/scripts/make_nr_indexed.pl  -f /data/NR/nr_plants  -i /data/NR/nr_plants.ind

put lines in config file

PROTEIN_DB       = /data/NR/nr_plants                   # protein DB
PROTEIN_DB_INDEX = /data/NR/nr_plants.ind               # protein DB index file

APPENDIX 4. Fgenesh++ output format.


Fgenesh++ pipeline output format

Header of Fgenesh++ output for a sequence contains information about that sequence name ('Seq name:' field), sequence length and numbers of predicted genes and exons, for example:

FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA
 Time    :   Fri Nov 29 21:47:46 2024
 Seq name: ENm011
 Length of sequence: 7878624, Sequence: 1, File: ENm011.seq.N
 Number of predicted genes 20: in +chain 10, in -chain 10.
 Number of predicted exons 114: in +chain 66, in -chain 48.

Gene models are described by the following fields:

G Str   Feature   Start        End    Score           ORF           Len

G       - predicted gene number, starting from sequence start
Str     - DNA strand (+ for direct or - for complementary)
Feature - type of coding sequence:

   CDSf - first    coding segment (starting with start codon),
   CDSi - internal coding segment (internal exon),
   CDSl - last     coding segment (ending with stop codon),
   CDSo - one/only coding segment for genes with only one CDS
         (starting with start codon and ending with stop codon);
   UTR5 - untranslated part of an exon (from 5'-end)
   UTR3 - untranslated part of an exon (from 3'-end)
   EXON - untranslated exon

   Number before the Feature is the exon number.

   Note that
   UTR5 and CDSf belong to the same exon (here exon 1), and
   CDSl and UTR3 belong to the same exon (here exon 5):

   G Str   Feature   Start        End   

  20 +    1 UTR5    580523 -    580682  ...
  20 +    1 CDSf    580683 -    580773  ...
  20 +    2 CDSi    581922 -    582019  ...
  20 +    3 CDSi    586551 -    586625  ...
  20 +    4 CDSi    591469 -    591523  ...
  20 +    5 CDSl    592300 -    592355  ...
  20 +    5 UTR3    592356 -    592386  ...
  20 +    6 EXON    594044 -    594127  ...
  20 +    7 EXON    594391 -    594482  ...
  20 +    8 EXON    595161 -    595340  ...
  20 +    9 EXON    595678 -    596015  ...


TSS     - TATA-box (or transcription start site for genes with mRNA support)
PolA    - PolyA signal (or end of transcript for genes with mRNA support)
Start   - start position of the feature
End     - end   position of the feature
Score   - Log likelihood*10 score for the feature
         (or special value for genes with mRNA support)
ORF     - start/end positions where the first complete codon starts 
          and the last codon ends
Len     - length of the Feature

Note that in current Fgenesh++ format features 'TSS' and 'PolA' have several meanings.

For mRNA supported genes:

TSS  - transcription start site
PolA - PolyA site (end of transcript)

For protein supported and ab initio predicted genes:

TSS  - TATA-box (always predicted, even if bad)
PolA - PolyA signal

Predicted gene models (examples)

Fgenesh++ can predict complete as well as incomplete gene models. Incomplete gene models are those with initial (CDSf for first) or/and terminal (CDSl for last) exon(s) absent. Note that gene models with only one CDS (CDSo) having both ATG and stop codon are considered as complete gene models.

In Fgenesh++ output, you can distinguish between mRNA supported, protein supported and ab initio gene predictions:

An example of ab initio gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len

   5 +    1 CDSf    114111 -    114191   10.58    114111 -    114191     81
   5 +    2 CDSi    114249 -    114407    9.31    114249 -    114407    159
   5 +    3 CDSi    114571 -    114676    8.12    114571 -    114675    106
   5 +    4 CDSi    114772 -    114905   20.57    114774 -    114905    134
   5 +    5 CDSl    115007 -    115246   18.58    115007 -    115246    240
   5 +      PolA    115316                1.13
Note: relative strength of ab initio genes can be found by summing over all scores of their CDS exons.

An example of mRNA supported gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len   (Hom_x1  Hom_x2  Hom_%  Hom_id)

  14 -      PolA    437594             4728.00                                                    NP_000198.1
  14 -    1 UTR3    437594 -    437666 4728.00                           73    378  -    450   99 NP_000198.1
  14 -    1 CDSl    437667 -    437812 4728.00    437667 -    437810    146    232  -    377   99 NP_000198.1
  14 -    2 CDSf    438600 -    438786 4728.00    438601 -    438786    187     45  -    231  100 NP_000198.1
  14 -    2 UTR5    438787 -    438803 4728.00                           17     28  -     44  100 NP_000198.1
  14 -    3 EXON    438983 -    439009 4728.00                           27      1  -     27   96 NP_000198.1
  14 -      TSS     439009             4728.00                                                    NP_000198.1

Four more fields are added in comparison with ab initio predictions:

Hom_x1,
Hom_x2 - coordinates of that part of mapped mRNA (in bp) that corresponds to predicted exon
Hom_% - homology in % between predicted exon and corresponding part of mapped mRNA
Hom_id - ID of mapped mRNA (or ID of its protein)

An example of protein supported gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len    (Hom_x1  Hom_x2  Hom_%)

   6 +      TSS     116762               -7.48
   6 +    1 CDSf    118349 -    118471  251.85    118349 -    118471    123      1  -     41  100
   6 +    2 CDSi    118634 -    118723  192.68    118634 -    118723     90     42  -     71  100
   6 +    3 CDSi    118846 -    119022  373.98    118846 -    119022    177     72  -    130  100
   6 +    4 CDSl    119271 -    119366  210.07    119271 -    119366     96    131  -    162  100
   6 +      PolA    119473                1.13

Three more fields are added in comparison with ab initio predictions:

Hom_x1,
Hom_x2 - coordinates of that part of homologous protein that corresponds to predicted CDS
Hom_% - homology in % between predicted CDS and corresponding part of homologous protein


Predicted and homologous proteins

Predicted proteins are listed at the end of *.res files.

For mRNA supported and protein supported predictions information about a homolog to a predicted protein is given in defline of a predicted protein (after first '##'), for example:

for protein supported prediction (note also 'BY PROTMAP' tag)


>FGENESH:   6   4 exon (s)   118349  -  119366   161 aa, chain + ## BY PROTMAP: gi|119622872|gb|EAX02467.1| troponin I type 2 (skeletal, fast), isoform CRA_a [Homo sapiens] ## query_len = 161  match_len = 161  query_cov = 1.00  match_cov = 1.00  score = 322  pid = 100.0  evalue = 1e-120

for mRNA supported prediction (note also 'BY MRNA_MAP' tag)


>FGENESH:     14   2 exon (s)    437667  -    438786    110 aa, chain -  ## BY MRNA_MAP: NP_000198.1

For ab initio predictions information about homology to a predicted protein is also given in defline of a predicted protein (after first '##'), for example:


>FGENESH:   5   5 exon (s) 114111  - 115246   239 aa, chain + ## NR: gi|94536846|ref|NP_612634.3| synaptotagmin VIII [Homo sapiens] ## query_len = 239  match_len = 401  query_cov = 0.82  match_cov = 0.49  query_x1 = 28  query_x2 = 223  match_x1 = 190  match_x2 = 385  score = 362  pid = 92.0  evalue = 1e-125

If no homologs were found to a predicted protein then defline contains 'no hits' tag, for example:


>FGENESH:   11   7 exon (s) 241806  - 368165   245 aa, chain - ## NR: no hits


Fgenesh++ output (example for the whole sequence)

All Fgenesh++ output for sequence ENm011 (file "ENm011.seq.res") is shown below.


FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA
 Time    :   Fri Nov 29 21:47:46 2024
 Seq name: ENm011
 Length of sequence: 7878624, Sequence: 1, File: ENm011.seq.N
 Number of predicted genes 20: in +chain 10, in -chain 10.
 Number of predicted exons 114: in +chain 66, in -chain 48.
 Positions of predicted genes and exons: Variant   1 from   1, Score:5.023016 
   G Str   Feature   Start        End    Score           ORF           Len

   1 -      PolA     12632               -6.07
   1 -    1 CDSl     13095 -     13244   24.11     13095 -     13244    150
   1 -    2 CDSi     25482 -     25934   61.84     25482 -     25934    453
   1 -    3 CDSi     26363 -     27134    3.80     26363 -     27133    772
   1 -    4 CDSf     30308 -     30408    1.64     30310 -     30408    101

   2 -      PolA     30570            23320.00                                                    NP_001900.1
   2 -    1 UTR3     30570 -     31317 23320.00                          748   1373  -   2120  100 NP_001900.1
   2 -    1 CDSl     31318 -     31485 23320.00     31318 -     31485    168   1205  -   1372  100 NP_001900.1
   2 -    2 CDSi     31618 -     31716 23320.00     31618 -     31716     99   1106  -   1204  100 NP_001900.1
   2 -    3 CDSi     31809 -     31953 23320.00     31809 -     31952    145    961  -   1105  100 NP_001900.1
   2 -    4 CDSi     32721 -     32843 23320.00     32723 -     32842    123    838  -    960  100 NP_001900.1
   2 -    5 CDSi     35139 -     35371 23320.00     35141 -     35371    233    605  -    837  100 NP_001900.1
   2 -    6 CDSi     36784 -     36902 23320.00     36784 -     36900    119    486  -    604  100 NP_001900.1
   2 -    7 CDSi     37331 -     37454 23320.00     37332 -     37454    124    362  -    485  100 NP_001900.1
   2 -    8 CDSi     39124 -     39283 23320.00     39124 -     39282    160    202  -    361  100 NP_001900.1
   2 -    9 CDSf     41607 -     41674 23320.00     41609 -     41674     68    134  -    201  100 NP_001900.1
   2 -    9 UTR5     41675 -     41807 23320.00                          133      1  -    133  100 NP_001900.1
   2 -      TSS      41807            23320.00                                                    NP_001900.1

   3 -      PolA     80566               -1.07
   3 -    1 CDSl     80704 -     81029   30.41     80704 -     81027    326
   3 -    2 CDSi     81123 -     81249    6.09     81124 -     81249    127
   3 -    3 CDSf     81617 -     81910   28.98     81617 -     81910    294
   3 -      TSS      84211               -8.28

   4 +      TSS     110820               -3.68
   4 +    1 CDSo    112921 -    113397  804.88    112921 -    113397    477      1  -    159  100

   5 +    1 CDSf    114111 -    114191   10.58    114111 -    114191     81
   5 +    2 CDSi    114249 -    114407    9.31    114249 -    114407    159
   5 +    3 CDSi    114571 -    114676    8.12    114571 -    114675    106
   5 +    4 CDSi    114772 -    114905   20.57    114774 -    114905    134
   5 +    5 CDSl    115007 -    115246   18.58    115007 -    115246    240
   5 +      PolA    115316                1.13

   6 +      TSS     116762               -7.48
   6 +    1 CDSf    118349 -    118471  251.85    118349 -    118471    123      1  -     41  100
   6 +    2 CDSi    118634 -    118723  192.68    118634 -    118723     90     42  -     71  100
   6 +    3 CDSi    118846 -    119022  373.98    118846 -    119022    177     72  -    130  100
   6 +    4 CDSl    119271 -    119366  210.07    119271 -    119366     96    131  -    162  100
   6 +      PolA    119473                1.13

   7 +      TSS     130773               -8.18
   7 +    1 CDSf    130960 -    131012  107.56    130960 -    131010     53      1  -     17  100
   7 +    2 CDSi    157902 -    158039  246.88    157903 -    158037    138     19  -     63  100
   7 +    3 CDSi    159247 -    159411  324.48    159248 -    159409    165     65  -    118  100
   7 +    4 CDSi    161234 -    161375  264.90    161235 -    161375    142    120  -    166  100
   7 +    5 CDSi    161747 -    161839  172.04    161747 -    161839     93    167  -    197  100
   7 +    6 CDSi    162095 -    162138   82.14    162095 -    162136     44    198  -    211  100
   7 +    7 CDSi    162315 -    162396  147.97    162316 -    162396     82    213  -    239  100
   7 +    8 CDSi    164547 -    164681  257.58    164547 -    164681    135    240  -    284  100
   7 +    9 CDSi    165052 -    165129  155.82    165052 -    165129     78    285  -    310  100
   7 +   10 CDSl    165289 -    165378  169.21    165289 -    165378     90    311  -    340  100

   8 -      PolA    165912                1.13
   8 -    1 CDSo    166960 -    168378   70.26    166960 -    168378   1419
   8 -      TSS     170023               -5.88

   9 +      TSS     186853               -4.08
   9 +    1 CDSf    187635 -    187717    1.09    187635 -    187715     83
   9 +    2 CDSi    197244 -    197388    3.93    197245 -    197388    145
   9 +    3 CDSi    200672 -    200706    6.28    200672 -    200704     35
   9 +    4 CDSi    210284 -    210329    5.65    210285 -    210329     46
   9 +    5 CDSi    211536 -    211652   19.77    211536 -    211652    117
   9 +    6 CDSi    211746 -    211823    0.75    211746 -    211823     78
   9 +    7 CDSi    212147 -    212260   13.22    212147 -    212260    114
   9 +    8 CDSi    212361 -    212470   12.61    212361 -    212468    110
   9 +    9 CDSi    212644 -    212734   17.26    212645 -    212734     91
   9 +   10 CDSi    214778 -    214818   -0.14    214778 -    214816     41
   9 +   11 CDSl    216253 -    216307   -3.11    216254 -    216307     55
   9 +      PolA    216498                1.13

  10 +      TSS     223577               -2.48
  10 +    1 CDSf    225178 -    225194   49.30    225178 -    225192     17      1  -      5  100
  10 +    2 CDSi    228714 -    228836  243.90    228715 -    228834    123      7  -     46  100
  10 +    3 CDSi    229942 -    230024  166.14    229943 -    230023     83     48  -     74  100
  10 +    4 CDSi    230597 -    230670  153.93    230599 -    230670     74     76  -     99  100
  10 +    5 CDSl    234071 -    234235  317.41    234071 -    234235    165    100  -    154  100
  10 +      PolA    234391                1.13

  11 -    1 CDSi    241806 -    241896    6.92    241807 -    241896     91
  11 -    2 CDSi    261282 -    261405   -1.03    261282 -    261404    124
  11 -    3 CDSi    273691 -    273813    0.06    273693 -    273812    123
  11 -    4 CDSi    273894 -    274006    3.51    273896 -    274006    113
  11 -    5 CDSi    274102 -    274236    7.12    274102 -    274236    135
  11 -    6 CDSi    310521 -    310600    1.42    310521 -    310598     80
  11 -    7 CDSf    368096 -    368165   10.57    368097 -    368165     70
  11 -      TSS     368880               -5.68

  12 -      PolA    410320                1.13
  12 -    1 CDSl    410802 -    411038  430.65    410802 -    411038    237    181  -    103  100
  12 -    2 CDSi    411332 -    411480  266.30    411332 -    411478    149    102  -     54  100
  12 -    3 CDSf    413182 -    413338  310.28    413183 -    413338    157     52  -      1  100

  13 +    1 CDSf    418459 -    418762  539.77    418459 -    418761    304      1  -    101  100
  13 +    2 CDSl    424060 -    424262  327.95    424062 -    424262    203    103  -    169  100
  13 +      PolA    424883               -5.47

  14 -      PolA    437594             4728.00                                                    NP_000198.1
  14 -    1 UTR3    437594 -    437666 4728.00                           73    378  -    450   99 NP_000198.1
  14 -    1 CDSl    437667 -    437812 4728.00    437667 -    437810    146    232  -    377   99 NP_000198.1
  14 -    2 CDSf    438600 -    438786 4728.00    438601 -    438786    187     45  -    231  100 NP_000198.1
  14 -    2 UTR5    438787 -    438803 4728.00                           17     28  -     44  100 NP_000198.1
  14 -    3 EXON    438983 -    439009 4728.00                           27      1  -     27   96 NP_000198.1
  14 -      TSS     439009             4728.00                                                    NP_000198.1

  15 -      PolA    441761                1.13
  15 -    1 CDSl    442048 -    442207  324.54    442048 -    442206    160    529  -    477  100
  15 -    2 CDSi    443047 -    443180  261.27    443049 -    443180    134    475  -    432  100
  15 -    3 CDSi    443483 -    443578  207.38    443483 -    443578     96    431  -    400  100
  15 -    4 CDSi    443817 -    443873  120.93    443817 -    443873     57    399  -    381  100
  15 -    5 CDSi    444295 -    444364  160.17    444295 -    444363     70    380  -    358  100
  15 -    6 CDSi    444448 -    444583  257.95    444450 -    444581    136    356  -    313  100
  15 -    7 CDSi    444702 -    444847  292.21    444703 -    444846    146    311  -    264  100
  15 -    8 CDSi    445250 -    445300   98.90    445252 -    445299     51    262  -    247  100
  15 -    9 CDSi    445681 -    445748  141.80    445683 -    445748     68    245  -    224  100
  15 -   10 CDSi    445906 -    445994  159.40    445906 -    445992     89    223  -    195  100
  15 -   11 CDSi    446306 -    446480  334.93    446307 -    446480    175    193  -    136  100
  15 -   12 CDSi    447465 -    447686  432.35    447465 -    447686    222    135  -     62  100
  15 -   13 CDSi    448505 -    448585  145.64    448505 -    448585     81     61  -     35  100
  15 -   14 CDSf    449500 -    449601  218.47    449500 -    449601    102     34  -      1  100

  16 -      PolA    546531                1.13
  16 -    1 CDSo    547566 -    548147 1202.02    547566 -    548147    582    194  -      1  100

  17 -    1 CDSl    577271 -    577559  351.28    577271 -    577558    289    133  -     50  100
  17 -    2 CDSi    578372 -    578428   63.99    578374 -    578427     57     36  -     19  100
  17 -    3 CDSf    579571 -    579623   92.87    579573 -    579623     53     17  -      1  100

  18 +      TSS     579828            15136.00                                                    NP_620591.1
  18 +    1 UTR5    579828 -    579964 15136.00                          137      1  -    137  100 NP_620591.1
  18 +    1 CDSf    579965 -    580030 15136.00    579965 -    580030     66    138  -    203  100 NP_620591.1
  18 +    2 CDSi    580659 -    580773 15136.00    580659 -    580772    115    204  -    318  100 NP_620591.1
  18 +    3 CDSi    581922 -    582019 15136.00    581924 -    582019     98    319  -    416  100 NP_620591.1
  18 +    4 CDSi    586551 -    586625 15136.00    586551 -    586625     75    417  -    491  100 NP_620591.1
  18 +    5 CDSi    591469 -    591570 15136.00    591469 -    591570    102    492  -    593  100 NP_620591.1
  18 +    6 CDSi    592300 -    592386 15136.00    592300 -    592386     87    594  -    680  100 NP_620591.1
  18 +    7 CDSi    594044 -    594127 15136.00    594044 -    594127     84    681  -    764  100 NP_620591.1
  18 +    8 CDSi    594391 -    594482 15136.00    594391 -    594480     92    765  -    856  100 NP_620591.1
  18 +    9 CDSi    595159 -    595340 15136.00    595160 -    595339    182    857  -   1038  100 NP_620591.1
  18 +   10 CDSl    595678 -    595739 15136.00    595680 -    595739     62   1039  -   1100  100 NP_620591.1
  18 +   10 UTR3    595740 -    596015 15136.00                          276   1101  -   1376  100 NP_620591.1
  18 +      PolA    596015            15136.00                                                    NP_620591.1

  19 +      TSS     580523            14286.00                                                    NP_005696.1
  19 +    1 UTR5    580523 -    580682 14286.00                          160      1  -    160   99 NP_005696.1
  19 +    1 CDSf    580683 -    580773 14286.00    580683 -    580772     91    161  -    251   99 NP_005696.1
  19 +    2 CDSi    581922 -    582019 14286.00    581924 -    582019     98    252  -    349  100 NP_005696.1
  19 +    3 CDSi    586551 -    586625 14286.00    586551 -    586625     75    350  -    424   99 NP_005696.1
  19 +    4 CDSi    591469 -    591570 14286.00    591469 -    591570    102    425  -    526  100 NP_005696.1
  19 +    5 CDSi    592300 -    592386 14286.00    592300 -    592386     87    527  -    613  100 NP_005696.1
  19 +    6 CDSi    594044 -    594127 14286.00    594044 -    594127     84    614  -    697  100 NP_005696.1
  19 +    7 CDSi    594391 -    594482 14286.00    594391 -    594480     92    698  -    789  100 NP_005696.1
  19 +    8 CDSi    595159 -    595340 14286.00    595160 -    595339    182    790  -    971  100 NP_005696.1
  19 +    9 CDSl    595678 -    595739 14286.00    595680 -    595739     62    972  -   1033  100 NP_005696.1
  19 +    9 UTR3    595740 -    596015 14286.00                          276   1034  -   1309  100 NP_005696.1
  19 +      PolA    596015            14286.00                                                    NP_005696.1

  20 +      TSS     580523            13749.00                                                    NP_620593.1
  20 +    1 UTR5    580523 -    580682 13749.00                          160      1  -    160   99 NP_620593.1
  20 +    1 CDSf    580683 -    580773 13749.00    580683 -    580772     91    161  -    251   99 NP_620593.1
  20 +    2 CDSi    581922 -    582019 13749.00    581924 -    582019     98    252  -    349  100 NP_620593.1
  20 +    3 CDSi    586551 -    586625 13749.00    586551 -    586625     75    350  -    424  100 NP_620593.1
  20 +    4 CDSi    591469 -    591523 13749.00    591469 -    591522     55    425  -    479  100 NP_620593.1
  20 +    5 CDSl    592300 -    592355 13749.00    592302 -    592355     56    480  -    535  100 NP_620593.1
  20 +    5 UTR3    592356 -    592386 13749.00                           31    536  -    566  100 NP_620593.1
  20 +    6 EXON    594044 -    594127 13749.00                           84    567  -    650  100 NP_620593.1
  20 +    7 EXON    594391 -    594482 13749.00                           92    651  -    742  100 NP_620593.1
  20 +    8 EXON    595161 -    595340 13749.00                          180    743  -    922  100 NP_620593.1
  20 +    9 EXON    595678 -    596015 13749.00                          338    923  -   1260  100 NP_620593.1
  20 +      PolA    596015            13749.00                                                    NP_620593.1

Predicted protein(s):
>FGENESH:   1   4 exon (s)  13095  -  30408   491 aa, chain - ## NR: gi|40254450|ref|NP_003632.2| interferon induced transmembrane protein 1 (9-27) [Homo sapiens] ## query_len = 491  match_len = 125  query_cov = 0.18  match_cov = 0.64  query_x1 = 379  query_x2 = 467  match_x1 = 8  match_x2 = 87  score = 81  pid = 44.0  evalue = 6e-18
MGSSLVGQRGLTCSQVWSSSRQDVALAGVSGCLVVAASEAPTLLAGGGGGGEVAVPAAAV
SQALGCSRAAGLALRRERRRRGRQRRRRRAPRAPHPLKVGPLGGGSVDRQRPAGRRKDAR
TEAQGVRPRAEESLVGEEAPSGGLRQTRMGLGAAEPRPQTPRRAPHGSWERGVAQSASDL
VPRWRSVHFTACYCASMYGPGVRGSGQVCALGVPRLEVGGGAHELETPHLCRFLVLVAEP
ASPCVGSCLLVNGSERARGCAPGSGEPAVQPAVEPVAVPVAVSGVRGRRGQAQGPGQCPA
PLGDPASTTDGAQEARVPLDGAFWIPRPPAGSPKGCFACVSKPPALQAPAAPAPEPSASP
PMAPTLFPMESKSSKTDSVRAAGAPPACKHLAEKKTMTNPTTVIEVYPDTTEVNDYYLWS
IFNFVYLNFCCLGFIALAYSLKVRDKKLLNDLNGAVEDAKTARLFNITSSALAASCIILV
FIFLRYPLTDY
>FGENESH:     2   9 exon (s)     31318  -     41674    412 aa, chain -  ## BY MRNA_MAP: NP_001900.1
MQPSSLLPLALCLLAAPASALVRIPLHKFTSIRRTMSEVGGSVEDLIAKGPVSKYSQAVP
AVTEGPIPEVLKNYMDAQYYGEIGIGTPPQCFTVVFDTGSSNLWVPSIHCKLLDIACWIH
HKYNSDKSSTYVKNGTSFDIHYGSGSLSGYLSQDTVSVPCQSASSASALGGVKVERQVFG
EATKQPGITFIAAKFDGILGMAYPRISVNNVLPVFDNLMQQKLVDQNIFSFYLSRDPDAQ
PGGELMLGGTDSKYYKGSLSYLNVTRKAYWQVHLDQVEVASGLTLCKEGCEAIVDTGTSL
MVGPVDEVRELQKAIGAVPLIQGEYMIPCEKVSTLPAITLKLGGKGYKLSPEDYTLKVSQ
AGKTLCLSGFMGMDIPPPSGPLWILGDVFIGRYYTVFDRDNNRVGFAEAARL
>FGENESH:   3   3 exon (s)  80704  -  81910   248 aa, chain - ## NR: gi|56204817|emb|CAI19051.1| actin, alpha 1, skeletal muscle [Homo sapiens] ## query_len = 248  match_len = 287  query_cov = 1.00  match_cov = 0.99  query_x1 = 2  query_x2 = 248  match_x1 = 4  match_x2 = 287  score = 348  pid = 62.0  evalue = 1e-121
MDDNITLLVIDNTSGMCKASFTDDDAPWAVFPSIVGCPRHQGMTVGMGQKDFYVGGEAQS
ERGILTLKYPIEHGIISSWDDMEKIWHHTFYNELREHPEVATATTSSSLEKSYELPDSQV
IPTGNERFRCPEEPFPPSFRGMKSCGIHETIFNSIMKCDMDILEDQYTNTVLSGGTSMHP
SITHRVQKEITALAPGTMKINVIAPPERKYSAWISGSIPASPSTFQQMWISKQAYDEPGP
SILHRKCF
>FGENESH:   4   1 exon (s)  112921  -  113397   158 aa, chain + ## BY PROTMAP: gi|119622870|gb|EAX02465.1| synaptotagmin VIII, isoform CRA_a [Homo sapiens] ## query_len = 158  match_len = 158  query_cov = 1.00  match_cov = 1.00  score = 330  pid = 100.0  evalue = 1e-123
MGHPPVSPSAPAPAGTTAIPGLIPDLVAGTPCELWDSQEGCGDNPAKWGLQLSTDALSLA
STPGPRWALIAGALAAGVLLVSCLLCAACCCCRRHRKKPRDKESVGLGSARGTTTTHLVR
SGSLLTQSREGLKSRLQSPGQRGEFSPRDGLTPTEAGR
>FGENESH:   5   5 exon (s) 114111  - 115246   239 aa, chain + ## NR: gi|94536846|ref|NP_612634.3| synaptotagmin VIII [Homo sapiens] ## query_len = 239  match_len = 401  query_cov = 0.82  match_cov = 0.49  query_x1 = 28  query_x2 = 223  match_x1 = 190  match_x2 = 385  score = 362  pid = 92.0  evalue = 1e-125
MVGWVGLDGWMGLGWVGLGSWVGLGSWAELPGATLQVQLFNFKRFSGHEPLGELRLPLGT
VDLQHVLEHWYLLGPPAATQPEQVGELCFSLRYVPSSGRLTVVVLEARGLRPGLAEPYVK
VQLMLNQRKWKKRKTATKKGTAAPYFNEAFTFLVPFSQVQNVDLVLAVWDRSLPLRTEPV
GKVHLGARASGQPLQHWADMLAHARRPIAQRHPLRPAREVDRMLALQPRLRLRLPLPHS
>FGENESH:   6   4 exon (s)   118349  -  119366   161 aa, chain + ## BY PROTMAP: gi|119622872|gb|EAX02467.1| troponin I type 2 (skeletal, fast), isoform CRA_a [Homo sapiens] ## query_len = 161  match_len = 161  query_cov = 1.00  match_cov = 1.00  score = 322  pid = 100.0  evalue = 1e-120
MLQIAATELEKEESRREAEKQNYLAEHCPPLHIPGSMSEVQELCKQLHAKIDAAEEEKYD
MEVRVQKTSKELEDMNQKLFDLRGKFKRPPLRRVRMSADAMLKALLGSKHKVCMDLRANL
KQVKKEDTEKERDLRDVGDWRKNIEEKSGMEGRKKMFESES
>FGENESH:   7  10 exon (s)  130960  -  165378   339 aa, chain + ## BY PROTMAP: gi|10880979|ref|NP_002330.1| lymphocyte-specific protein 1 isoform 1 [Homo sapiens] ## query_len = 339  match_len = 339  query_cov = 1.00  match_cov = 1.00  score = 687  pid = 100.0  evalue = 0
MAEASSDPGAEEREELLGPTAQWSVEDEEEAVHEQCQHERDRQLQAQDEEGGGHVPERPK
QEMLLSLKPSEAPELDEDEGFGDWSQRPEQRQQHEGAQGALDSGEPPQCRSPEGEQEDRP
GLHAYEKEDSDEVHLEELSLSKEGPGPEDTVQDNLGAAGAEEEQEEHQKCQQPRTPSPLV
LEGTIEQSSPPLSPTTKLIDRTESLNRSIEKSNSVKKSQPDLPISKIDQWLEQYTQAIET
AGRTPKLARQASIELPSMAVASTKSRWETGEVQAQSAAKTPSCKDIVAGDMSKKSLWEQK
GGSKTSSTIKSTPSGKRYKFVATGHGKYEKVLVEGGPAP
>FGENESH:   8   1 exon (s) 166960  - 168378   472 aa, chain - ## NR: no hits
MAPEVCGPSLQGTPGPPPPLLPKPGKDNLRLQKLLRKAARKKMMGGTHLAPPRAFRTSLS
PVSEASHDQEVTAPHAAEGPHPAEAPRLPEAPRPAEAPRMVAALPRSPHTPIIHHVASPL
QKSTFSIGLTQRRILAAQFRAMGPQVVASAPEPTRPPSGFVPVSGGGGTHVTQVHIQLAP
SPHNGTPEPPRTAPEVGSNSQDGDATPSPPRAQPLVPVAHIRPLPTTVQAASPLPEEPPV
PRPPPGFQASVPREASARVVVPIAPTCRSLESSPHSLVPMGPGREHLEEPPMAGPAAEAE
RVSSPAWASSPTPPSGPHPCPVPKVAPKPRLSGWTWLKKQLLEEAPEPPCPEPRQSLEPE
VPTPTEQEVPAPTEQEVPALTAPRAPASRTSRMWDAVLYRMSVAEAQGRLAGPSGGEHTP
ASLTRLPFLYRPRFNARKLQEATRPPPTVRSILELSPQPKNFNRTATGWRLQ
>FGENESH:   9  11 exon (s) 187635  - 216307   304 aa, chain + ## NR: gi|119570954|gb|EAW50569.1| troponin T type 3 (skeletal, fast) [Homo sapiens] ## query_len = 304  match_len = 257  query_cov = 0.71  match_cov = 0.84  query_x1 = 89  query_x2 = 304  match_x1 = 42  match_x2 = 257  score = 302  pid = 73.0  evalue = 1e-103
MASSELRALCSSFVVESSALLAGTFHHNPSRMYPLLGVKIAADTSWRRREDLASFSKHVL
CILASFSLLLWPFLYLKPPTFTMSDEEVLTAPKIPEGEKVDFDDIQKKRQNKDLMELQAL
IDSHFEARKKEEEELVALKERIEKRRAERAEQQRIRAEKERERQNRLAEEKARREEEDAK
RRAEDDLKKKKALSSMGANYSSYLAKADQKRGKKQTAREMKKKILAERRKPLNIDHLGED
KLRDKAKELWETLHQLEIDKFEFGEKLKRQKYDITTLRSRIDQAQKHSKKAGTPAKGKVG
GRWK
>FGENESH:   10   5 exon (s)   225178  -  234235   153 aa, chain + ## BY PROTMAP: gi|68566020|sp|Q16540|RM23_HUMAN Mitochondrial 39S ribosomal protein L23 (L23mt) (MRP-L23) (L23 mitochondrial-related protein) (Ribosomal protein L23-like) ## query_len = 153  match_len = 153  query_cov = 1.00  match_cov = 1.00  score = 315  pid = 100.0  evalue = 1e-117
MARNVVYPLYRLGGPQLRVFRTNFFIQLVRPGVAQPEDTVQFRIPMEMTRVDLRNYLEGI
YNVPVAAVRTRVQHGSNKRRDHRNVRIKKPDYKVAYVQLAHGQTFTFPDLFPEKDESPEG
SAADDLYSMLEEERQQRQSSDPRRGGVPSWFGL
>FGENESH:   11   7 exon (s) 241806  - 368165   245 aa, chain - ## NR: no hits
MPALPMAFMRTGAPFKLRQLLATAGLLQPSGPNQSAGSSKGVQEEGSPAEPGLQLGWTCP
PAAEGQDARSGGRDKQDMTWSGVTARTEEARPAFLNTLGWWGCGKKRVCFFTSSTESAHY
GCPLGSQNPQHERNGATQLKPGPLNPDTKPSSLEMNMLHFTTTALPDSGIGSGRYPRQPL
PPQRDPLGLTLPWAQGRARDGPWEKDGFMGRWPGQWAQAKAGSADVPNVTTAGGGTAMWV
ARGTP
>FGENESH:   12   3 exon (s)   410802  -   413338   180 aa, chain - ## BY PROTMAP: gi|4504609|ref|NP_000603.1| insulin-like growth factor 2 [Homo sapiens] ## query_len = 180  match_len = 180  query_cov = 1.00  match_cov = 1.00  score = 373  pid = 100.0  evalue = 1e-139
MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVS
RRSRGIVEECCFRSCDLALLETYCATPAKSERDVSTPPTVLPDNFPRYPVGKFFQYDTWK
QSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLIALPTQDPAHGGAPPEMASNRK
>FGENESH:   13   2 exon (s)   418459  -  424262   168 aa, chain + ## BY PROTMAP: gi|98986327|ref|NP_057496.2| insulin-like growth factor 2 antisense [Homo sapiens] ## query_len = 168  match_len = 168  query_cov = 1.00  match_cov = 1.00  score = 352  pid = 100.0  evalue = 1e-131
MSKRKWRGFRGAQQERAQPPAASPQPCPAPHAGLPGGSRRRAPAPAGQQQMRAESRSGAQ
RRRGSARRGAHREAGGCVRGRTRSSGSERSNALWQAVDAAEALALSSPLRRPWDQAQHFT
NPAPFSKGPQSAPPSPPAGRRRRGADLALTPLAGEGHTRWRQPGRPGK
>FGENESH:     14   2 exon (s)    437667  -    438786    110 aa, chain -  ## BY MRNA_MAP: NP_000198.1
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
>FGENESH:   15  14 exon (s)   442048  -   449601   528 aa, chain - ## BY PROTMAP: gi|88900501|ref|NP_954986.2| tyrosine hydroxylase isoform a [Homo sapiens] ## query_len = 528  match_len = 528  query_cov = 1.00  match_cov = 1.00  score = 1059  pid = 100.0  evalue = 0
MPTPDATTPQAKGFRRAVSELDAKQAEAIMVRGQGAPGPSLTGSPWPGTAAPAASYTPTP
RSPRFIGRRQSLIEDARKEREAAVAAAAAAVPSEPGDPLEAVAFEEKEGKAVLNLLFSPR
ATKPSALSRAVKVFETFEAKIHHLETRPAQRPRAGGPHLEYFVRLEVRRGDLAALLSGVR
QVSEDVRSPAGPKVPWFPRKVSELDKCHHLVTKFDPDLDLDHPGFSDQVYRQRRKLIAEI
AFQYRHGDPIPRVEYTAEEIATWKEVYTTLKGLYATHACGEHLEAFALLERFSGYREDNI
PQLEDVSRFLKERTGFQLRPVAGLLSARDFLASLAFRVFQCTQYIRHASSPMHSPEPDCC
HELLGHVPMLADRTFAQFSQDIGLASLGASDEEIEKLSTLYWFTVEFGLCKQNGEVKAYG
AGLLSSYGELLHCLSEEPEIRAFDPEAAAVQPYQDQTYQSVYFVSESFSDAKDKLRSYAS
RIQRPFSVKFDPYTLAIDVLDSPQAVRRSLEGVQDELDTLAHALSAIG
>FGENESH:   16   1 exon (s)   547566  -   548147   193 aa, chain - ## BY PROTMAP: gi|4885665|ref|NP_005161.1| achaete-scute complex homolog-like 2 [Homo sapiens] ## query_len = 193  match_len = 193  query_cov = 1.00  match_cov = 1.00  score = 387  pid = 100.0  evalue = 1e-144
MDGGTLPRSAPPAPPVPVGCAARRRPASPELLRCSRRRRPATAETGGGAAAVARRNERER
NRVKLVNLGFQALRQHVPHGGASKKLSKVETLRSAVEYIRALQRLLAEHDAVRNALAGGL
RPQAVRPSAPRGPPGTTPVAASPSRASSSPGRGGSSEPGSPRSAYSSDDSGCEGALSPAE
RELLDFSSWLGGY
>FGENESH:   17   3 exon (s)   577271  -   579623   132 aa, chain - ## BY PROTMAP: gi|20454856|sp|Q9P2W6|CK021_HUMAN Uncharacterized protein C11orf21 ## query_len = 132  match_len = 132  query_cov = 1.00  match_cov = 1.00  score = 263  pid = 91.0  evalue = 2e-97
MGRTWCGMWRRRRPGRRSAVPRWPHLSSQSGVEPPDRCARSLYNVLTHQEAPGSMMPPAA
AQPSAHGALVPPATAHEPVDHPALHWLACCCCLSLPGQLPLAIRLGWDLDLEAGPSSGKL
CPRARRWQPLPS
>FGENESH:     18  10 exon (s)    579965  -    595739    320 aa, chain +  ## BY MRNA_MAP: NP_620591.1
MGPWSRVRVAKCQMLVTCFFILLLGLSVATMVTLTYFGAHFAVIRRASLEKNPYQAVHQW
AFSAGLSLVGLLTLGAVLSAAATVREAQGLMAGGFLCFSLAFCAQVQVVFWRLHSPTQVE
DAMLDTYDLVYEQAMKGTSHVRRQELAAIQDVFLCCGKKSPFSRLGSTEADLCQGEEAAR
EDCLQGIRSFLRTHQQVASSLTSIGLALTVSALLFSSFLWFAIRCGCSLDRKGKYTLTPR
ACGRQPQEPSLLRCSQGGPTHCLHSEAVAIGPRGCSGSLRWLQESDAAPLPLSCHLAAHR
ALQGRSRGGLSGCPERGLSD
>FGENESH:     19   9 exon (s)    580683  -    595739    290 aa, chain +  ## BY MRNA_MAP: NP_005696.1
MVTLTYFGAHFAVIRRASLEKNPYQAVHQWAFSAGLSLVGLLTLGAVLSAAATVREAQGL
MAGGFLCFSLAFCAQVQVVFWRLHSPTQVEDAMLDTYDLVYEQAMKGTSHVRRQELAAIQ
DVFLCCGKKSPFSRLGSTEADLCQGEEAAREDCLQGIRSFLRTHQQVASSLTSIGLALTV
SALLFSSFLWFAIRCGCSLDRKGKYTLTPRACGRQPQEPSLLRCSQGGPTHCLHSEAVAI
GPRGCSGSLRWLQESDAAPLPLSCHLAAHRALQGRSRGGLSGCPERGLSD
>FGENESH:     20   5 exon (s)    580683  -    592355    124 aa, chain +  ## BY MRNA_MAP: NP_620593.1
MVTLTYFGAHFAVIRRASLEKNPYQAVHQWAFSAGLSLVGLLTLGAVLSAAATVREAQGL
MAGGFLCFSLAFCAQVQVVFWRLHSPTQVEDAMLDTYDLVYEQAMKVSVLWEEVSFQPSG
EHRG
//


Fgenesh++ output if no genes predicted

If Fgenesh++ does not find any reliable genes (for example, in short sequences), the output looks like this:

FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA
 Seq name: small_seq
 Length of sequence: 60
 no reliable predictions
//


Conversion to GFF3 and GenBank format

Fgenesh++ / Fgenesh output can be converted to GFF3 or GenBank format by scripts

fgenesh_2_gff3.pl

fgenesh_2_genbank.pl



Last updated: March 13, 2026