SoftBerry - fgenesh plus plus

Fgenesh++ is a pipeline for automatic prediction of genes in eukaryotic genomes based on Softberry gene finding software.

This software can NOT be copied or distributed without Softberry license.
Copyright Softberry, Inc. 1999-2015.

This manual includes:

FGENESH++: copyrighted and public software, data required, references
INSTALLATION
RUNNING THE PIPELINE
RUNNING THE PIPELINE (in short)

FGENESH++: copyrighted and public software, data required, references

Fgenesh++ package includes the following copyrighted software:

Fgenesh - most accurate and fastest HMM-based gene prediction program;
Fgenesh+ - gene prediction program that uses similar protein information;
gene finding parameters for Fgenesh and Fgenesh+;
est_map - program for mapping known mRNAs to genome (genome alignment with splice sites identification) and mapping a set of available ESTs to improve the gene prediction accuracy and add 5' and 3'- noncoding sequences;
prot_map - mapping protein database to genomic sequences;
perl scripts.

Note:
in the current release/manual of the pipeline programs Fgenesh and Fgenesh+ are called by their pipeline internal names - 'ppd' and 'ppdn+', respectively.

Fgenesh++ package uses the following public software and data:

NCBI BLAST+ blastp or BLAST blastall, bl2seq programs;
NCBI NR database (non-redundant protein database);
NCBI RefSeq database.

Fgenesh++ package requires the following analyzed sequences:

genomic sequences;
genomic sequences with repeats masked by N (optionally).

References to Fgenesh++ software:

Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7 Suppl 1:S10.1-12.

Solovyev V.V. (2002) Finding genes by computer: probabilistic and discriminative approaches. In Current Topics in Computational Biology (eds. T. Jiang, T. Smith, Y. Xu, M. Zhang), The MIT Press, p. 365-401.

Salamov A., Solovyev V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10(4), 516-522.

INSTALLATION

Copy FGENESHPIPE/ directory from the distribution kit to appropriate directory.
Make configuration file (see "Configuration file").

RUNNING THE PIPELINE

An example of running the pipeline is provided for three sequences (nGASP train sequences 1.seq, 2.seq, 3.seq) in EXAMPLE_nGASP/ directory.
There you can find example data, configuration and command files used to run the pipeline.

1. Preparation of sequences

- genomic sequences (not masked);
- genomic sequences with repeats masked by N.

- fasta files with non-masked sequences;
- fasta files with repeats-masked sequences (if available).

When preparing files, put each sequence in a separate fasta file.

- files with non-masked sequences;
- files with repeats-masked sequences.

When making such lists, provide full paths to files with sequences.

EXAMPLE

list of files with non-masked sequences (e.g., file 'seq.list'):

/home/EXAMPLE_nGASP/SEQ_train/1.seq
/home/EXAMPLE_nGASP/SEQ_train/2.seq
/home/EXAMPLE_nGASP/SEQ_train/3.seq
...

list of files with repeats-masked sequences (e.g., 'seq_N.list'):

/home/EXAMPLE_nGASP/SEQ_train.N/1.seq.N
/home/EXAMPLE_nGASP/SEQ_train.N/2.seq.N
/home/EXAMPLE_nGASP/SEQ_train.N/3.seq.N
...

Naming convention for files with non-masked and repeats-masked sequences used in the pipeline:

files with non-masked sequences can be named in any way, with or without (one or several) extensions, e.g., filename for non-masked sequence ENm003 could look like 'ENm003.seq', 'ENm003.fa', 'ENm003.hg17.fa', 'ENm003', etc.

To derive filename for repeats-masked sequence, add to the filename of corresponding non-masked sequence some additional extension, e.g., '.N' or '.masked', etc.

Here are some examples of filenames for files with sequences:

non-masked            repeats-masked

1.seq            ->   1.seq.N
ENm003.seq       ->   ENm003.seq.N
ENm003.fa        ->   ENm003.fa.masked
ENm003.hg17.fa   ->   ENm003.hg17.fa.n
ENm003           ->   ENm003.m

genomic sequences can be either complete chromosomes or scaffolds, contigs, etc.;
repeats-masked sequences are recommended to use (for large genomes like Human) but are not necessarily required to run the pipeline;
when masking repeats in sequences by repeats masking program (for example, RepeatMasker), we recommend not to mask low complexity regions and simple repeats since they can be parts of coding sequences.

2. File with locations of BLAST+ or BLAST programs

Make a file with locations of BLAST+ or BLAST programs.

For example, you can edit
FGENESHPIPE/blast_paths.list

If you use BLAST, provide paths to 'blastall' and 'bl2seq' programs, for example:

# location of blast programs
/home/blast-2.2.26/bin/blastall         ## for blastp
/home/blast-2.2.26/bin/bl2seq           ## for blasting 2 sequences

If you use BLAST+, provide paths to 'blastp' program (on both lines), for example:

# location of blast programs
/home/ncbi-blast-2.2.28+/bin/blastp     ## for blastp
/home/ncbi-blast-2.2.28+/bin/blastp     ## for blasting 2 sequences

or, if you can run it without parth, then:

# location of blast programs
blastp                                  ## for blastp
blastp                                  ## for blasting 2 sequences

You should put location of 'blast_paths.list' in configuration file (see "Configuration file").

3. Gene prediction parameters file

Organism-specific gene prediction parameters are usually located in
FGENESHPIPE/EXE_CFG/ directory. For example,

C_elegans_nGASP - gene prediction parameters for C.elegans used in the example how to run the pipeline (EXAMPLE_nGASP/).

4. Pipeline general parameters file

Parameters file contains parameters that regulate gene prediction process.

For a number of parameters you might want to set up values other than those initially provided in the file, e.g., maximal intron length allowed in a gene or minimal length of intron suitable for prediction of genes in introns of other genes. Some parameters may require expertise to manipulate them properly.

We provide parameters files for mammals and for non-mammals:

mammals.par - for mammals
non_mamm.par - for non-mammals

Put path to parameters file in configuration file (see "Configuration file").

5. Configuration file

In configuration file you can indicate which steps of the pipeline to run, provide paths to parameters files, data files or directories and set other options.

Example:
EXAMPLE_nGASP/run/ce.cfg

-----------------------------------------------------------------------

#
# Location of data and options for eukaryotic genome annotation
#

# Organism-specific and pipeline parameters

GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP  # gene prediction parameters
PIPE_PARAM = /home/FGENESHPIPE/non_mamm.par             # location of parameters files


# Mapping known mRNAs

MAP_mRNAs = 1                                           # map known mRNA sequences to genome sequences
CDNA_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.cdna       # *.cdna file for known mRNAs
PROT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.pro        # *.pro  file for known mRNAs
DAT_FILE  = /home/EXAMPLE_nGASP/DATA/ngasp_1.dat        # *.dat  file for known mRNAs


# Mapping ESTs

MAP_ESTS = 0                                            # map ESTs to genomic sequences
EST_FILE = /home/EXAMPLE_nGASP/DATA/rna_matches.fa      # file with ESTs


# Using reads

USE_READS = 0                                           # use reads info to improve gene models
DIR_SITES = /home/EXAMPLE_nGASP/reads_sites/            # directory with reads *.sites files


# Using known proteins for prediction 
# (predict genes based on homology to known proteins)

USE_PROTEINS = 1                                        # 0 - no, 1 -yes
PROG_PROT    = 1                                        # 1 - use prot_map, 2 - use blast
NUM_THREADS  = 5                                        # number of processors for 'prot_map' or 'blast'
PROTEIN_DB   = /home/NR/nr_ce                           # protein DB
BLAST_PATHS  = /home/FGENESHPIPE/blast_paths.list       # location of "blast_paths.list"


# Predicting genes in long introns of other genes

INTRONIC_GENES = 0                                      # predict genes in long introns of other genes

-----------------------------------------------------------------------

In the line 'GENE_PARAM' provide location of a file with organism-specific gene prediction parameters, for example:

GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP  # gene prediction parameters

In the line 'PIPE_PARAM' provide location of a file with pipeline general parameters. For example,

for mammals - 

PIPE_PARAM = /home/FGENESHPIPE/mammals.par              # location of parameters files

for non-mammals - 

PIPE_PARAM = /home/FGENESHPIPE/non_mamm.par

Then, you can switch ON (1) or OFF (0) each of the pipeline steps:

MAP_mRNAs      = 1                                      # map known mRNA sequences to genome sequences
MAP_ESTS       = 0                                      # map ESTs to genomic sequences
USE_READS      = 0                                      # use reads info to improve gene models
USE_PROTEINS   = 1                                      # 0 - no, 1 -yes
INTRONIC_GENES = 0                                      # predict genes in long introns of other genes

Please note that in the current version of the pipeline either "using ESTs" (MAP_ESTS) or "using reads" (USE_READS) step can be used (switched ON) but not both at the same time.

For each step (if it is switched ON) provide necessary data / options:

- MAP_mRNAs

if "mapping known mRNAs" step is used, prepare for a query organism three files (*.cdna, *.pro, *.dat) from RefSeq (see APPENDIX 2) and put their locations on corresponding lines

if "using ESTs to improve gene models" step is used, a file with ESTs must be provided

- USE_READS

if "using reads to improve gene models" step is used, a directory with *.sites files (potential splice sites) obtained during mapping of reads to sequences must be provided

- USE_PROTEINS

# PROG_PROT    = 1                                        # 1 - use prot_map, 2 - use blast

"Prediction of genes based on homology to known proteins" step can be run with either 'prot_map' (1) or 'BLAST' (2) method:

"prot_map" (default) method runs prot_map to map proteins from protein database to genomic sequences and selects good quality mappings, then fgenesh+ predicts more refined gene models in regions with good mapped proteins;

"BLAST" (alternative) method first predicts genes ab initio (by fgenesh), then finds homologs to predicted proteins in a database (by BLAST) and then tries to refine gene models (by fgenesh+) using protein homologs found.

We recommend to use "prot_map" method because it is more straightforward and gives higher accuracy in our tests while our "BLAST" method uses some heuristics to merge genes if they got split when predicted ab initio.

# NUM_THREADS  = 5                                        # number of processors for 'prot_map' or 'blast'

Prot_map program can be run in parallel, put the number of processors / threads for 'prot_map' in 'NUM_THREADS' line.

# PROTEIN_DB   = /home/NR/nr_ce                           # protein DB

Prepare protein database (NR or its subset or custom protein database) (see APPENDIX 1) and provide its location in 'PROTEIN_DB' line.

# BLAST_PATHS  = /home/FGENESHPIPE/blast_paths.list       # location of "blast_paths.list"

Provide location of 'blast_paths.list' in 'BLAST_PATHS' line.

- AB INITIO

The pipeline always runs "ab initio gene prediction" step in regions with no genes predicted by other methods (using known mRNAs or known proteins), therefore it is not to set up in configuration file.

- INTRONIC_GENES

The final optional step, prediction of genes in long introns of other genes, can be switched ON and OFF by setting the corresponding value to 1 or 0.

Lines starting with '#' are comment lines.

Shorter config files

You can put "na" or any string (or leave it empty) instead of paths or options in case if some steps are switched OFF.

Also, for convenience, if some steps are not used / switched OFF, you can remove them from configuration file.

For example, this is how configuration file can look if only "predict genes based on homology to known proteins" step is used (and ab initio that is always ON) -

EXAMPLE_nGASP/run/ce_protmap.cfg

#
# Location of data and options for eukaryotic genome annotation
#

# Organism-specific and pipeline parameters

GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP  # gene prediction parameters
PIPE_PARAM = /home/FGENESHPIPE/non_mamm_30_06.par       # location of parameters files


# Using known proteins for prediction 
# (predict genes based on homology to known proteins)

USE_PROTEINS = 1                                       # 0 - no, 1 -yes
PROG_PROT    = 1                                       # 1 - use prot_map, 2 - use blast
NUM_THREADS  = 5                                       # number of processors for 'prot_map' or 'blast'
PROTEIN_DB   = /home/NR/nr_ce                          # protein DB
BLAST_PATHS  = /home/FGENESHPIPE/blast_paths.list      # location of "blast_paths.list"

Examples of other config files:

EXAMPLE_nGASP/run/

ce_est.cfg       - 'Mapping ESTs' is switched ON
ce_reads.cfg     - 'Using reads'  is switched ON
ce_protmap.cfg   - 'Using known proteins' only (and 'ab initio' that always runs)

6. Command to run the pipeline

Run 'run_pipe.pl' without parameters for help.

Usage: FGENESHPIPE/run_pipe.pl  -l  [-m ] -d  [-w ]

where

    <project config file>  - project configuration file
    -l <seq_list>          - list of sequences
    -m <seq_masked_list>   - list of masked sequences (if available)
    -d <dir_results>       - directory to store results
    -w <dir_tmp>           - tmp directory must be unique for each run / process
                            (if not provided, tmp directory is created automatically)

Examples:

FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_1.list  -m ../seq_lists/seq_1N.list  -d results_1/  -w tmp_work_1/
FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_2.list  -m ../seq_lists/seq_2N.list  -d results_2/  -w tmp_work_2/
...

FGENESHPIPE/run_pipe.pl  celegans_prj.cfg  -l seq.list  -d results/

List of non-masked sequences (or any sequences if only one set is available - non-masked or masked) is provided via the -l option. If masked sequences are available, they are provided via the -m option.

Directory with results will contain *.resn3 files with Fgenesh++ predictions (predicted gene structures and corresponding proteins in Fgenesh++ format).

If not provided explicitly, unique temporary directory for temporary and intermediate files will be created automatically and cleaned after calculations are finished.

Run for lists of sequences

If there are many sequences, it is better to split them into several lists and run in parallel, for example:

FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_1.list  -m ../seq_lists/seq_1N.list  -d results_1
FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_2.list  -m ../seq_lists/seq_2N.list  -d results_2
FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_3.list  -m ../seq_lists/seq_3N.list  -d results_3
...

Directory with results is not created anew or cleaned if it already exists. Therefore, it is possible to write results into the same directory:

FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_1.list  -m ../seq_lists/seq_1N.list  -d results
FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_2.list  -m ../seq_lists/seq_2N.list  -d results
FGENESHPIPE/run_pipe.pl  human_prj.cfg  -l ../seq_lists/seq_3.list  -m ../seq_lists/seq_3N.list  -d results
...

But if the pipeline creates files with same names as ones stored in results directory, new files will overwrite old ones.

Commands can be written to and run as *.sh files, for example, 'run_1.sh', 'run_2.sh', etc.

Pipeline log output

When running, the pipeline makes some log output on the screen.

This log output does not make much sense for users but can help to spot errors and bugs, if any, for developers. If you want to save log output to a file, run the pipeline with the following command:

(bash, sh)
FGENESHPIPE/run_pipe.pl  ...  2> 

(tcsh, csh)
FGENESHPIPE/run_pipe.pl  ...  >& 


Examples:

(bash)
FGENESHPIPE/run_pipe.pl  celegans_prj.cfg  -l seq.list  -d results/  2> run.log

(tcsh)
FGENESHPIPE/run_pipe.pl  celegans_prj.cfg  -l seq.list  -d results/  >& run.log

7. Types of evidence

Fgenesh++ pipeline can deal with several types of evidence:

1) gene models predicted using full-length mRNA sequences

This corresponds to "MAP_mRNAs" block in configuration file.

Sometimes full-length mRNA sequences can be obtained from RefSeq for a specified organism.

See "APPENDIX 2. Preparing known mRNA data for mRNA mapping" for how to get mRNAs from RefSeq and prepare them for Fgenesh++.

If you want to use full-length mRNAs, switch ON 'MAP_mRNAs' block in configuration file.

For example,

MAP_mRNAs = 1                                           # map known mRNA sequences to genome sequences
CDNA_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.cdna       # *.cdna file for known mRNAs
PROT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.pro        # *.pro  file for known mRNAs
DAT_FILE  = /home/EXAMPLE_nGASP/DATA/ngasp_1.dat        # *.dat  file for known mRNAs

mRNAs are mapped to genomic sequences and good mappings (gene models) are selected. No other type of evidence is used for these mRNA-supported (as we call them) gene models.

2) gene models predicted using known proteins from NR (or its subset, e.g., eukaryotic part of NR)

This corresponds to "USE_PROTEINS" block in configuration file.

"prot_map" (default) method predicts gene models using combination of prot_map and Fgenesh+, with additional selection of reliable models through blast2 alignments between predicted and homologous proteins.

First, 'prot_map' maps known proteins from a protein database (for example, NR or its subset) to genomic sequences and good mappings are selected.

Then 'fgenesh+' program predicts gene models in regions of good mappings using mapped proteins.

After that predicted gene models are additionally checked / filtered by script that analyses blast2 (bl2seq) alignments between predicted and homologous proteins. Only gene models that have good coverage of predicted and homologous proteins by blast alignment are selected. Some other criteria are also checked.

"BLAST" (alternative) method first predicts genes ab initio (by fgenesh), then finds homologs to predicted proteins in a database (by BLAST) and then tries to refine gene models (by fgenesh+) using protein homologs found.

We recommend to use "prot_map" method because it is more straightforward and gives higher accuracy in our tests while our "BLAST" method uses some heuristics to merge genes if they got split when predicted ab initio.

If you want to use protein-supported (as we call them) predictions, switch ON "USE_PROTEINS" block in configuration file.

For example,

USE_PROTEINS = 1                                        # 0 - no, 1 -yes
PROG_PROT    = 1                                        # 1 - use prot_map, 2 - use blast
NUM_THREADS  = 5                                        # number of processors for 'prot_map' or 'blast'
PROTEIN_DB   = /home/NR/nr_ce                           # protein DB
BLAST_PATHS  = /home/FGENESHPIPE/blast_paths.list       # location of "blast_paths.list"

Protein-supported predictions are made in regions not occupied by mRNA-supported predictions.

3) Ab initio predictions (without mRNA or protein type of evidence) are made in regions not occupied by mRNA-supported and protein-supported predictions.

To make ab initio predictions, we use 'fgenesh' and gene prediction parameters trained for specified (or close) organism.

The pipeline always runs ab initio predictions (in regions with no genes predicted by other methods) therefore it is not to set up in configuration file.

4) using reads to improve gene predictions

This corresponds to "USE_READS" block in configuration file.

Reads are aligned to genomic sequences by ReadsMap pipeline and a list of potential splice sites is compiled for each sequence (*.sites files).

These *.sites files should be placed in some directory and path to this directory should be provided in configuration file.

Information about potential splice sites is used in 'fgenesh+' (see 2) from above) and 'fgenesh' (see 3) from above) when predicting genes. Boundaries of CDS exons and gene structures are predicted more correctly.

Technically, using reads is not an additional step of improvement after making initial predictions but rather incorporating additional information (potential splice sites and their weights) in the process of prediction by 'fgenesh+' and 'fgenesh'. By improvement we mean higher accuracy of gene models.

'Fgenesh+' is less sensitive to this additional information because it has stronger evidence such as protein homology but 'fgenesh' is more sensitive.

If you want to use evidence from reads, switch ON "USE_READS" block in configuration file.

For example,

USE_READS = 1                                           # use reads info to improve gene models
DIR_SITES = /home/EXAMPLE_nGASP/reads_sites/            # directory with reads *.sites files

5) using ESTs to improve gene predictions

This corresponds to "MAP_ESTS" block in configuration file.

Similar to reads, ESTs can be used to make more accurate predictions.

ESTs are aligned to genomic sequences by 'est_map' and good alignments are selected. A list of alignment blocks is compiled with information whether such blocks are flanked at one or both ends with splice sites or not. This information about regions (potential exons) supported by ESTs and potential splice sites is used by 'fgenesh+' (see 2) from above) and 'fgenesh' (see 3) from above).

Boundaries of CDS exons and gene structures are predicted more correctly. Also, 5'- and 3'- untranslated parts of first and last CDS exons can be predicted.

Technically, as with reads, this additional information is used in the process of prediction, not as a step after making initial predictions.

If you want to use evidence from ESTs, switch ON "MAP_ESTS" block in configuration file.

For example,

MAP_ESTS = 1                                            # map ESTs to genomic sequences
EST_FILE = /home/EXAMPLE_nGASP/DATA/rna_matches.fa      # file with ESTs

Currently, evidence from reads and ESTs can not be used simultaneously in Fgenesh++. Apart from that, there is no limitation on combination of evidence data.

Also, each predicted gene is represented by only one gene model (the one with the maximum weight) even if reads or ESTs support several alternative splicing gene variants. In next versions, we will make it possible to predict alternative gene variants based on reads / EST data.

6) ab initio gene predictions in long introns of other genes

Optionally, pipeline can make gene predictions in long introns of other genes. If you want to use it, switch this option ON. For example,

INTRONIC_GENES = 1                                      # predict genes in long introns of other genes

If you switch all options in configuration file to 0, Fgenesh++ runs as Fgenesh, i.e. it makes ab initio predictions in genomic sequences.

RUNNING THE PIPELINE (in short)

To run the pipeline, do the following steps:

- install Fgenesh++ pipeline (see "INSTALLATION");

- prepare genomic sequences (see "Preparation of sequences");

- if "mapping known mRNAs" step is used, prepare three special files *.cdna, *.pro, *.dat with known mRNAs data for a query organism (see APPENDIX 2);

- if "prediction of genes based on homology to known proteins" step is used, prepare NR or its subset or other / custom protein database (see APPENDIX 1);

- if "using reads to improve gene models" step is used, run ReadsMap pipeline to map reads to sequences and obtain *.sites files with potential splice sites;

- if "using ESTs to improve gene models" step is used, prepare file with EST sequences;

(currently either "using reads" or "using ESTs" step can be used but not both at the same time)

- make configuration file (see "Configuration file");

- run the pipeline (see "Command to run the pipeline")

*.resn3 files in results directory will contain predicted gene structures and corresponding proteins in Fgenesh++ format - see APPENDIX 3 for details.

APPENDIX 1. Preparing NR (non-redundant protein database) or other protein database

If "prediction of genes based on homology to known proteins" step is used then NR (or other protein database) is required as well as BLAST+ or BLAST software.

BLAST+ can be downloaded from NCBI ftp or web page:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download

The legacy BLAST executables can be downloaded from NCBI ftp:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/

Use BLAST+ or BLAST release 2.2.13 or some higher version.

NR (non-redundant protein sequence database) can be downloaded from

ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

NR or custom protein database must be present in directory provided in configuration file in both formats - as FASTA file and as formatted for BLAST by 'makeblastdb' or 'formatdb' program.

Format protein database for BLAST with "makeblastdb -in -dbtype prot" (BLAST+) or "formatdb -i " (BLAST) command, e.g.:

makeblastdb -in nr -dbtype prot (BLAST+)
formatdb    -i  nr              (BLAST)

Do NOT use '-parse_seqids' (BLAST+) or '-o T' (BLAST) option since the pipeline requires a protein database to be formatted without parsing Seq-ids!

After formatting, protein DB will be present in both formats:

nr        (FASTA formatted)

nr.*.phr  (BLAST formatted)
nr.*.pin  (BLAST formatted)
nr.*.psq  (BLAST formatted)

Put the name of FASTA file ('nr' in example above) in configuration file.

NR subsets

We highly recommend to use relevant NR subsets for calculations. It can greatly speed up calculations because the initial NR size is very big.

For example, only eukaryotic proteins from NR (nr_euk). Or, better, even more specific subsets:

nr_plants  - plant proteins             (to annotate plants)
nr_animals - animals (Metazoa) proteins (to annotate animals)
nr_fungi   - fungi proteins             (to annotate fungi)

Also, we recommend to exclude from NR proteins not relevant for your annotation. For example, partial proteins, "unnamed protein product" proteins, hypothetical proteins, predicted proteins (with 'XP_' or 'PREDICTED' in deflines).

APPENDIX 2. Preparing known mRNA data for mRNA mapping

To run mRNA mapping you should prepare three files:

rf.cdna - with mRNA sequences in fasta format
rf.pro - with corresponding protein sequences in fasta format
rf.dat - file with coordinates of ATG, STOP codon and corresponding chromosome names (if chromosome name is not available, put "na")

You can create such files using RefSeq database from NCBI and our scripts (or prepare them using your own data and scripts).

RefSeq

1) download RefSeq files *.rna.gbff.gz and *.protein.faa.gz for considered organism(s)

ftp://ftp.ncbi.nlm.nih.gov/refseq/

Example for Human:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.rna.gbff.gz .
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.protein.faa.gz .

Example for Arabidopsis thaliana (plants):

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.rna.gbff.gz .
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.protein.faa.gz .

Example for fungi:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.rna.gbff.gz .
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.protein.faa.gz .

2) gunzip files

gunzip *.gz

3) concatenate files

(The order in which files are concatenated is not important here.)

# Human

cat human.*.rna.gbff    > human.rna.gbff
cat human.*.protein.faa > human.protein.faa

# Arabidopsis thaliana

cat plant.*.rna.gbff    > plant.rna.gbff
cat plant.*.protein.faa > plant.protein.faa

# Fungi

cat fungi.*.rna.gbff    > fungi.rna.gbff
cat fungi.*.protein.faa > fungi.protein.faa

Concatenated files are used to prepare *.cdna, *.pro, *.dat files.

Script

There is a script in REFSEQ_SCRIPT/ directory to prepare *.cdna, *.pro, *.dat files for considered organism.

refseq_NM.pl - selects from RefSeq full-length mRNAs with IDs starting with NM_ and with status keys REVIEWED / PROVISIONAL / VALIDATED

(see Refseq documentation for more information on IDs and status of entries)

Note about XM_ mRNAs from RefSeq:

We use mRNAs with NM_ prefix only (with status 'PROVISIONAL', 'REVIEWED', 'VALIDATED'). Usually they are better than XM_ and have some evidence. More information about prefixes can be found on NCBI site.

But if you like XM_ mRNAs for your organism, you can comment the following line in 'refseq_NM.pl':

    if( $gb_id !~ /^NM_/ ) { $yes = 0; }  #! only mRNAs with NM_ IDs

(To comment a line, put # at the beginning of the line.)

Sequences as Chromosomes vs. Contigs/Scaffolds/etc.

It is usually indicated in RefSeq to which chromosomes mRNAs belong. Fgenesh++ works with RefSeq mRNAs in two ways:

1) if you annotate chromosome sequences (properly named), Fgenesh++ maps mRNAs from RefSeq only to chromosomes where they belong;

2) if you annotate contigs / scaffolds (so that you do not know to what contigs RefSeq mRNAs belong), Fgenesh++ tries to map mRNAs to all contigs and selects good mappings.

In case 1) to make correspondence between mRNAs and files with chromosome sequences, run 'refseq_NM.pl' script with options -chr_prefix: -chr_ext:

Suppose you have chromosome sequences properly named, like for example for human chr1.fa, chr2.fa, .., chr22.fa, chrX.fa, chrY.fa (or with some other extension - chr1.seq, etc.).

Human chromosomes are named from 1 to 22 and 'X', 'Y' in human *.rna.gbff files, for example:

     source          1..3159
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /chromosome="10"

To adjust for RefSeq chromosomes notation / your sequences files names, run 'refseq_NM.pl' with options -chr_prefix:chr -chr_ext:fa

(So that mRNA from chromosome '10' from RefSeq in the example above will be mapped to your 'chr10.fa' file with chromosome sequence.)

Example:

./refseq_NM.pl  rf_hs  human.rna.gbff  human.protein.faa  "Homo sapiens" -chr_prefix:chr -chr_ext:fa

Or, for 'Saccharomyces cerevisiae' chromosomes are named from I to XVI in fungi *.rna.gbff, for example:

     source          1..1848
                     /organism="Saccharomyces cerevisiae S288c"
                     /mol_type="mRNA"
                     /strain="S288c"
                     /db_xref="taxon:559292"
                     /chromosome="III"

So, if your files with S.c. chromosomes are "I.seq", "II.seq", .., "XVI.seq", run 'refseq_NM.pl' with option -chr_ext:seq (adds extension .seq to I, II, ...) Do not use -chr_prefix: because there is no prefix in I.seq before I, etc.

Example:

./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c" -chr_ext:seq

In case 2) (contigs, scaffolds, i.e. sequences not assembled into chromosomes) run 'refseq_NM.pl' without -chr_prefix: -chr_ext: options.

Example:

./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c"

In this case you will see lines with "na" instead of chromosome names in *.dat file.

Command

./refseq_NM.pl  <.rna.gbff file> <.protein.faa file>  \
               [-status:<[PRV]>] [-check_atg:<0|1>] [-check_stop:<0|1>]    \
               [-chr_prefix: -chr_ext:]

where

      - base name for generated .cdna, .pro, .dat files
  <.rna.gbff file>    - .rna.gbff    file name
  <.protein.faa file> - .protein.faa file name
                 - organism
  -status:<[PRV]>     - select mRNAs with this status (PRV by default)
                        (P - PROVISIONAL, R - REVIEWED, V - VALIDATED)
  -check_atg:<0|1>    - skip mRNAs with translation start codon != ATG (1 by default)
  -check_stop:<0|1>   - skip mRNAs with wrong stop codon annotated     (1 by default)

  -chr_prefix:   - prefix    for chromosome files (for example, 'chr' or 'Chr')
  -chr_ext:      - extension for chromosome files (for example, 'fa' or 'seq')

                       (prefix and extension are provided if sequences are chromosomes 
                        and sequence files are properly named)

Examples:

./refseq_NM.pl  rf_hs  human.rna.gbff  human.protein.faa  "Homo sapiens"          -chr_prefix:chr  -chr_ext:fa

./refseq_NM.pl  rf_BT  cow.rna.gbff    cow.protein.faa    "Bos taurus"            -chr_prefix:chr  -chr_ext:fa
./refseq_NM.pl  rf_BT  cow.rna.gbff    cow.protein.faa    "Bos taurus"            -chr_prefix:chr  -chr_ext:fa  -status:RV

./refseq_NM.pl  rf_at  plant.rna.gbff  plant.protein.faa  "Arabidopsis thaliana"  -chr_prefix:ath  -chr_ext:seq
./refseq_NM.pl  rf_at  plant.rna.gbff  plant.protein.faa  "Arabidopsis thaliana"  -chr_prefix:ath  -chr_ext:seq  -check_atg:0  -check_stop:1

./refseq_NM.pl  rf_Sc  fungi.rna.gbff  fungi.protein.faa  "Saccharomyces cerevisiae S288c"  -chr_ext:seq
./refseq_NM.pl  rf_Sc  fungi.rna.gbff  fungi.protein.faa  "Saccharomyces cerevisiae S288c"

As an organism name, use words from 'ORGANISM' field of RefSeq entries, e.g., 'Homo sapiens' for human, etc.:

  ORGANISM  Homo sapiens
  ORGANISM  Arabidopsis thaliana
  ORGANISM  Saccharomyces cerevisiae S288c

Basenames for files (from examples above):

(rf_hs - 'rf' for refseq, 'hs' for homo sapiens
 rf_at - 'rf' for refseq, 'hs' for arabidopsis thaliana
 rf_Sc - 'rf' for refseq, 'Sc' for Saccharomyces cerevisiae)

3 files will be generated (example for Human):

  rf_hs.cdna    - set of Refseq cdna sequences in fasta format
  rf_hs.pro     - set of Refseq protein sequences in fasta format 
                 (slightly modified from Refseq original *.faa file)
  rf_hs.dat     - file with ATG, STOP coordinates and chromosome names

rf_hs.dat file has the following format:

36 2327 chr21.seq 221 3175 chr3.seq 28 3024 chr3.seq 86 3049 na

where 36 and 2327 are coordinates of the first nucleotide of ATG and the last nucleotide of stop codon in some mRNA sequence from RefSeq annotated to belong to chromosome 21, chr21.seq - sequence name where to map this mRNA,

"na" means it is not clear to which chromosome mRNA belongs, in this case Fgenesh++ tries to map such an mRNA to all genomic sequences and selects good mappings.

Data in *.cdna, *.pro and *.dat are organised in the same order, so that the first line in *.dat corresponds to the first sequences in *.cdna and *.pro, and so on.

Once you generated *.cdna, *.pro, *.dat files, put their locations in configuration file.

APPENDIX 3. Fgenesh++ pipeline output format:

G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence:
   CDSf - first (starting with start codon),
   CDSi - internal (internal exon),
   CDSl - last coding segment (ending with stop codon),
   CDSo - for genes with only one CDS (starting with start codon 
          and ending with stop codon);
EXO  - partially translated or untranslated exon 
      (coding sequence, if any, from such an exon is annotated 
       as CDSf, CDSl or CDSo);
TSS  - TATA-box (or transcription start site for genes with mRNA support);
PolA - PolyA signal (or end of transcript for genes with mRNA support);
Start and End - position of the Feature;
Score - Log likelihood*10 score for the feature;
ORF - start/end positions where the first complete codon starts 
      and the last codon ends;
Len - length of predicted ORF.

In current Fgenesh++ format features 'TSS' and 'PolA' have several meanings.

For mRNA supported genes:

TSS - transcription start site,
PolA - PolyA site.

For protein supported and ab initio predicted genes:

TSS - TATA-box (always predicted, even if bad),
PolA - PolyA signal.

Fgenesh++ can predict complete as well as incomplete gene models. Incomplete gene models are those with initial (CDSf for first) or/and terminal (CDSl for last) exon(s) absent. Note that gene models with only one CDS (CDSo) having both ATG and stop codon are considered as complete gene models.)

In Fgenesh++ output, you can distinguish between mRNA supported, protein supported and ab initio gene predictions:

An example of ab initio gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len

   2 +      TSS    118455                 -6.59
   2 +   1 CDSf    118568 -    118585      5.16    118568 -    118585     18
   2 +   2 CDSi    132799 -    132867      3.27    132799 -    132867     69
   2 +   3 CDSi    133530 -    133666      5.47    133530 -    133664    135
   2 +   4 CDSl    139414 -    139570      5.57    139415 -    139570    156
   2 +     PolA    141461                  1.12

An example of mRNA supported gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len          (1)        (2)    (3)

   8 -     PolA    249364                 51.60 NP_000030.1
   8 -   0  EXO    249364 -    250022     51.60    249364 -    250022    659    239  -    897  100 NP_000030.1
   8 -   1 CDSl    249419 -    250022     51.60    249419 -    250021    604    239  -    897  100 NP_000030.1
   8 -   2 CDSi    250612 -    250768     51.60    250614 -    250766    157     82  -    238  100 NP_000030.1
   8 -   3 CDSf    250956 -    250998     51.60    250957 -    250998     43     19  -     81  100 NP_000030.1
   8 -   0  EXO    250956 -    251018     51.60    250956 -    251018     63     19  -     81  100 NP_000030.1
   8 -   0  EXO    251216 -    251233     51.60    251216 -    251233     18      1  -     18  100 NP_000030.1
   8 -      TSS    251233                 51.60 NP_000030.1

Three more fields are added in comparison with ab initio predictions:

(1) - coordinates of that part of known mRNA (in bp) that corresponds to predicted exon;
(2) - homology in % between predicted exon and corresponding part of known mRNA;
(3) - ID of protein corresponding to known mRNA.

An example of protein supported gene prediction.

   G Str   Feature   Start        End    Score           ORF           Len          (1)        (2)

   3 -   1 CDSl    162093 -    162186    157.68    162093 -    162185     93    618  -    590  100
   3 -   2 CDSi    170757 -    170838    145.17    170759 -    170836     78    588  -    563  100
   3 -   3 CDSi    171377 -    171561    339.29    171378 -    171560    183    561  -    501  100
   3 -   4 CDSi    171880 -    172018    247.86    171882 -    172016    135    499  -    455  100
   3 -   5 CDSi    172685 -    172790    191.48    172686 -    172790    105    453  -    419  100
   3 -   6 CDSi    174346 -    174563    382.22    174346 -    174561    216    418  -    347  100
   3 -   7 CDSi    176164 -    176877   1246.74    176165 -    176875    711    345  -    109  100
   3 -   8 CDSi    178973 -    179057    159.79    178974 -    179057     84    107  -     80  100
   3 -   9 CDSi    183740 -    183833    184.33    183740 -    183832     93     79  -     49  100
   3 -  10 CDSf    186433 -    186575    277.99    186435 -    186575    141     47  -      1  100

Two more fields are added in comparison with ab initio predictions:

(1) - coordinates of that part of homologous protein that corresponds to predicted CDS;
(2) - homology in % between predicted CDS and corresponding part of homologous protein.

Predicted proteins are listed at the end of *.resn3 files.

For mRNA supported and protein supported predictions information about a homolog to a predicted protein is given in ID line of a predicted protein (after first ##), for example:

# protein supported prediction
>FGENESH: 3  10 exon (s)    162093 -    186575 619 aa, chain - ## BY PROTMAP: gi|14249338|ref|NP_116114.1| hypothetical protein LOC84811 [Homo sapiens]gi|13623491|gb|AAH06350.1| Hypothetical protein MGC13125 [Homo sapiens] ## 619 ## 100.0 1.00 1.00 1276.0

# mRNA supported prediction
>NP_443200.1_#_18_#_1109 5  3 exon (s) 203739 - 205462 363 aa, chain - ## gi|16445025|ref|NP_443200.1| apolipoprotein AV [Homo sapiens]  ## 363 ## orf_perfect

All Fgenesh++ output for sequence ENm003 (file "ENm003.seq.N.resn3") is shown below.

FGENESH++ 3.1.1 Mapped known genes and predicted genes in genomic DNA
 Seq name: ENm003
 Length of sequence: 500000
 Number of predicted genes 11 in +chain 2 in -chain 9
 Number of predicted exons 86 in +chain 10 in -chain 76
 Positions of predicted genes and exons:
   G Str   Feature   Start        End    Score           ORF           Len

   1 -     PolA      1928                 -1.68
   1 -   1 CDSl      3118 -      3219      0.29      3118 -      3219    102
   1 -   2 CDSi      5228 -      5384     11.00      5228 -      5383    156
   1 -   3 CDSi      6666 -      6724     -0.82      6668 -      6724     57
   1 -   4 CDSi      9401 -      9438     -0.43      9401 -      9436     36
   1 -   5 CDSi     10303 -     10499      9.57     10304 -     10498    195
   1 -   6 CDSi     10636 -     10762     -0.88     10638 -     10760    123
   1 -   7 CDSi     12156 -     12321      3.94     12157 -     12321    165
   1 -   8 CDSi     28607 -     28648      3.50     28607 -     28648     42
   1 -   9 CDSf     69822 -     69932      6.96     69822 -     69932    111
   1 -      TSS     71289                 -5.09

   2 +      TSS    118455                 -6.59
   2 +   1 CDSf    118568 -    118585      5.16    118568 -    118585     18
   2 +   2 CDSi    132799 -    132867      3.27    132799 -    132867     69
   2 +   3 CDSi    133530 -    133666      5.47    133530 -    133664    135
   2 +   4 CDSl    139414 -    139570      5.57    139415 -    139570    156
   2 +     PolA    141461                  1.12

   3 -   1 CDSl    162093 -    162186    157.68    162093 -    162185     93    618  -    590  100
   3 -   2 CDSi    170757 -    170838    145.17    170759 -    170836     78    588  -    563  100
   3 -   3 CDSi    171377 -    171561    339.29    171378 -    171560    183    561  -    501  100
   3 -   4 CDSi    171880 -    172018    247.86    171882 -    172016    135    499  -    455  100
   3 -   5 CDSi    172685 -    172790    191.48    172686 -    172790    105    453  -    419  100
   3 -   6 CDSi    174346 -    174563    382.22    174346 -    174561    216    418  -    347  100
   3 -   7 CDSi    176164 -    176877   1246.74    176165 -    176875    711    345  -    109  100
   3 -   8 CDSi    178973 -    179057    159.79    178974 -    179057     84    107  -     80  100
   3 -   9 CDSi    183740 -    183833    184.33    183740 -    183832     93     79  -     49  100
   3 -  10 CDSf    186433 -    186575    277.99    186435 -    186575    141     47  -      1  100

   4 -   1 CDSl    192536 -    192670    242.45    192536 -    192670    135    458  -    416  100
   4 -   2 CDSi    193383 -    193448    121.76    193383 -    193448     66    415  -    394  100
   4 -   3 CDSi    195769 -    195855    164.50    195769 -    195855     87    393  -    365  100
   4 -   4 CDSi    196530 -    196640    218.51    196530 -    196640    111    364  -    328  100
   4 -   5 CDSi    197141 -    197230    164.54    197141 -    197230     90    327  -    298  100
   4 -   6 CDSi    197989 -    198059    124.84    197989 -    198057     69    297  -    275  100
   4 -   7 CDSi    198466 -    198532    103.87    198467 -    198532     66    273  -    252  100
   4 -   8 CDSi    198710 -    198757     84.47    198710 -    198757     48    251  -    236  100
   4 -   9 CDSi    199125 -    199247    236.78    199125 -    199247    123    235  -    195  100
   4 -  10 CDSi    199407 -    199493    161.23    199407 -    199493     87    194  -    166  100
   4 -  11 CDSi    200122 -    200139     16.83    200122 -    200139     18    165  -    160  100
   4 -  12 CDSi    200563 -    200670    118.34    200563 -    200670    108    141  -    112  100
   4 -  13 CDSi    201093 -    201254    281.00    201093 -    201254    162    111  -     58  100
   4 -  14 CDSf    201431 -    201601    315.93    201431 -    201601    171     57  -      1  100

   5 -     PolA    202982                 74.60 NP_443200.1
   5 -   0  EXO    202982 -    204678     74.60    202982 -    204678   1697    170  -   1866   99 NP_443200.1
   5 -   1 CDSl    203739 -    204678     74.60    203739 -    204677    940    170  -   1866   99 NP_443200.1
   5 -   2 CDSi    205197 -    205308     74.60    205199 -    205306    112     58  -    169  100 NP_443200.1
   5 -   3 CDSf    205423 -    205462     74.60    205424 -    205462     40      1  -     57  100 NP_443200.1
   5 -   0  EXO    205423 -    205479     74.60    205423 -    205479     57      1  -     57  100 NP_443200.1
   5 -      TSS    205479                 74.60 NP_443200.1

   6 -     PolA    234313                 65.70 NP_000473.1
   6 -   0  EXO    234313 -    235492     65.70    234313 -    235492   1180    291  -   1466   99 NP_000473.1
   6 -   1 CDSl    234478 -    235492     65.70    234478 -    235491   1015    291  -   1466   99 NP_000473.1
   6 -   2 CDSi    236270 -    236396     65.70    236272 -    236394    127    164  -    290   99 NP_000473.1
   6 -   3 CDSf    236754 -    236802     65.70    236755 -    236802     49      1  -    163  100 NP_000473.1
   6 -   0  EXO    236754 -    236915     65.70    236754 -    236915    162      1  -    163  100 NP_000473.1
   6 -      TSS    236915                 65.70 NP_000473.1

   7 +      TSS    243519                 39.60 NP_000031.1
   7 +   0  EXO    243519 -    243551     39.60    243519 -    243551     33      1  -     33  100 NP_000031.1
   7 +   0  EXO    244181 -    244248     39.60    244181 -    244248     68     34  -    101  100 NP_000031.1
   7 +   1 CDSf    244194 -    244248     39.60    244194 -    244247     55     34  -    101  100 NP_000031.1
   7 +   2 CDSi    244384 -    244507     39.60    244386 -    244505    124    102  -    225  100 NP_000031.1
   7 +   3 CDSl    246375 -    246495     39.60    246376 -    246495    121    226  -    533  100 NP_000031.1
   7 +   0  EXO    246375 -    246682     39.60    246375 -    246682    308    226  -    533  100 NP_000031.1
   7 +     PolA    246682                 39.60 NP_000031.1

   8 -     PolA    249364                 51.60 NP_000030.1
   8 -   0  EXO    249364 -    250022     51.60    249364 -    250022    659    239  -    897  100 NP_000030.1
   8 -   1 CDSl    249419 -    250022     51.60    249419 -    250021    604    239  -    897  100 NP_000030.1
   8 -   2 CDSi    250612 -    250768     51.60    250614 -    250766    157     82  -    238  100 NP_000030.1
   8 -   3 CDSf    250956 -    250998     51.60    250957 -    250998     43     19  -     81  100 NP_000030.1
   8 -   0  EXO    250956 -    251018     51.60    250956 -    251018     63     19  -     81  100 NP_000030.1
   8 -   0  EXO    251216 -    251233     51.60    251216 -    251233     18      1  -     18  100 NP_000030.1
   8 -      TSS    251233                 51.60 NP_000030.1

   9 -   1 CDSl    260007 -    260164    216.16    260007 -    260162    156   1051  -   1003   84
   9 -   2 CDSi    261087 -    261219    217.15    261088 -    261219    132   1001  -    958   93
   9 -   3 CDSi    262731 -    262894    224.97    262731 -    262892    162    957  -    904   79
   9 -   4 CDSi    271421 -    272310   1397.33    271422 -    272309    888    902  -    606   90
   9 -   5 CDSi    272876 -    273215    495.64    272878 -    273213    336    604  -    493   80
   9 -   6 CDSi    274885 -    274994    169.71    274886 -    274993    108    491  -    456   88
   9 -   7 CDSi    275452 -    275537    135.57    275454 -    275537     84    454  -    427   96
   9 -   8 CDSi    275813 -    275938    217.61    275813 -    275938    126    426  -    385   97
   9 -   9 CDSi    277279 -    277429    275.71    277279 -    277428    150    384  -    335   98
   9 -  10 CDSi    281557 -    281700    252.23    281559 -    281699    141    333  -    287   95
   9 -  11 CDSi    283942 -    284012    139.45    283944 -    284012     69    285  -    263  100
   9 -  12 CDSi    287092 -    287247    277.74    287092 -    287247    156    262  -    211  100
   9 -  13 CDSi    287514 -    287667    246.50    287514 -    287666    153    210  -    160   92
   9 -  14 CDSi    288769 -    288878    186.07    288771 -    288878    108    158  -    123   94
   9 -  15 CDSi    289021 -    289062     54.07    289021 -    289062     42    110  -     97   88
   9 -  16 CDSi    289477 -    289620    222.91    289477 -    289620    144     96  -     49   87
   9 -  17 CDSi    289864 -    289974    192.60    289864 -    289974    111     48  -     12   91
   9 -  18 CDSf    290535 -    290570     42.89    290535 -    290570     36     11  -      1  100

  10 -     PolA    302475                  1.12
  10 -   1 CDSl    302565 -    302641     -0.55    302565 -    302639     75
  10 -   2 CDSi    309864 -    309987      5.78    309865 -    309987    123
  10 -   3 CDSi    310804 -    310928     11.92    310804 -    310926    123
  10 -   4 CDSi    340830 -    340991     13.82    340831 -    340989    159
  10 -   5 CDSi    367655 -    367718      6.56    367656 -    367718     63
  10 -   6 CDSi    370559 -    370675      9.39    370559 -    370675    117
  10 -   7 CDSf    372345 -    372401      5.39    372345 -    372401     57
  10 -      TSS    372529                 -9.59

  11 -     PolA    449354                 31.00 NP_001021.1
  11 -   0  EXO    449354 -    449697     31.00    449354 -    449697    344      1  -    344   97 NP_001021.1
  11 -   1 CDSo    449408 -    449662     31.00    449408 -    449662    255      1  -    344   97 NP_001021.1
  11 -      TSS    449697                 31.00 NP_001021.1

Predicted protein(s):
>FGENESH: 1  9 exon (s) 3118  -  69932  332 aa, chain -
MSVPPWGGQALVGGPGSAGGLVEEVESKDGQDGAQKKFGQGHGGSLSDAEEEEGWAGAGR
NSRWPKDHNSGSMGAGAAEGSTKLCDPQSLAPGEELSVFLLTAAAKGPNLLRPRPLAQHP
QAWHGAAMVPKGCLAVVCSRGMMGVLAGRKDVEHVTVQIPDEPKSQLRLLTESQCLVGSP
KPECPLQGPDGLFCGDPHKEITSFERLGLAEDKKALSKFARCEQEPMGGDLSLSWGSGLY
PAEEHLWLEEVRAEMQELIPGFCPAPSSEAGPPGVLAADEELDIEGQVSSGITGTNCQLG
GSSPSSLPVAAAADTTYGSESAQENWWGHELV
>FGENESH: 2  4 exon (s) 118568  -  139570  126 aa, chain +
MSMPHEGRHSRKLEAGGKRKEKQLQELAELVRSQAPANSAPIKRAGSREPSDQGLETQRP
FLITGLEESGCQLVRVAPSPHLLSQRCSESPGTSKVTHAREGSQWESSSECGKGAERRMY
NPGPGV
>FGENESH: 3  10 exon (s)    162093 -    186575 619 aa, chain - ## BY PROTMAP: gi|14249338|ref|NP_116114.1| hypothetical protein LOC84811 [Homo sapiens]gi|13623491|gb|AAH06350.1| Hypothetical protein MGC13125 [Homo sapiens] ## 619 ## 100.0 1.00 1.00 1276.0
MAAAPPLSKAEYLKRYLSGADAGVDRGSESGRKRRKKRPKPGGAGGKGMRIVDDDVSWTA
ISTTKLEKEEEEDDGDLPVVAEFVDERPEEVKQMEAFRSSAKWKLLGGHNEDLPSNRHFR
HDTPDSSPRRVRHGTPDPSPRKDRHDTPDPSPRRARHDTPDPSPLRGARHDSDTSPPRRI
RHDSSDTSPPRRARHDSPDPSPPRRPQHNSSGASPRRVRHDSPDPSPPRRARHGSSDISS
PRRVHNNSPDTSRRTLGSSDTQQLRRARHDSPDLAPNVTYSLPRTKSGKAPERASSKTSP
HWKESGASHLSFPKNSKYEYDPDISPPRKKQAKSHFGDKKQLDSKGDCQKATDSDLSSPR
HKQSPGHQDSDSDLSPPRNRPRHRSSDSDLSPPRRRQRTKSSDSDLSPPRRSQPPGKKAA
HMYSGAKTGLVLTDIQREQQELKEQDQETMAFEAEFQYAETVFRDKSGRKRNLKLERLEQ
RRKAEKDSERDELYAQWGKGLAQSRQQQQNVEDAMKEMQKPLARYIDDEDLDRMLREQER
EGDPMANFIKKNKAKENKNKKVRPRYSGPAPPPNRFNIWPGYRWDGVDRSNGFEQKRFAR
LASKKAVEELAYKWSVEDM
>FGENESH: 4  14 exon (s)    192536 -    201601 447 aa, chain - ## BY PROTMAP: gi|30582123|gb|AAP35288.1| zinc finger protein 259 [Homo sapiens]gi|61361942|gb|AAX42130.1| zinc finger protein 259 [synthetic construct]gi|61361934|gb|AAX42129.1| zinc finger protein 259 [synthetic construct]gi|16924221|gb|AAH17380.1| Zinc finger protein 259 [Homo sapiens]gi|16878306|gb|AAH17349.1| Zinc finger protein 259 [Homo sapiens]gi|15082497|gb|AAH12162.1| Zinc finger protein 259 [Homo sapiens]gi|13279038|gb|AAH04256.1| Zinc finger protein 259 [Homo sapiens]gi|3510462|gb|AAC33514.1| zinc finger protein [Homo sapiens]gi|6137318|sp|O75312|ZPR1_HUMAN Zinc-finger protein ZPR1 (Zinc finger protein 259)gi|4508021|ref|NP_003895.1| zinc finger protein 259 [Homo sapiens] ## 459 ## 96.0 1.00 1.00 884.0
MAASGAVEPGPPGAAVAPSPAPAPPPAPDHLFRPISAEDEEQQPTEIESLCMNCYCNGMT
RLLLTKIPFFREIIVSSFSCEHCGWNNTEIQSAGRIQDQGVRYTLSVRALEDMNREVVKT
DSAATRIPELDFEIPAFSQKGGKFKIRDQPARRANKDATAERIDEFIVKLKELKQVASPF
TLIIDDPSGNSFVENPHAPQKDDALVITHYNRTRQQEEMLGLQEEAPAEKPEEEDLRNEV
LQFSTNCPECNAPAQTNMKLVQIPHFKEVIIMATNCENCGHRTNEVKSGGAVEPLGTRIT
LHITDASDMTRDLLKSETCSVEIPELEFELGMAVLGGKFTTLEGLLKDIRELVTKNPFTL
GDSSNPGQTERLQEFSQKMDQIIEGNMKAHFIMDDPAGNSYLQNVYAPEDDPEMKVERYK
RTFDQNEELGLNDMKTEGYEAGLAPQR
>NP_443200.1_#_18_#_1109 5  3 exon (s) 203739 - 205462 363 aa, chain - ## gi|16445025|ref|NP_443200.1| apolipoprotein AV [Homo sapiens]  ## 363 ## orf_perfect
MAAVLTWALALLSAFSATQARKGFWDYFSQTSGDKGRVEQIHQQKMAREPATLKDSLEQD
LNNMNKFLEKLRPLSGSEAPRLPQDPVGMRRQLQEELEEVKARLQPYMAEAHELVGWNLE
GLRQQLKPYTMDLMEQVALRVQELQEQLRVVGEDTKAQLLGGVDEAWALLQGLQSRVVHH
TGRFKELFHPYAESLVSGIGRHVQELHRSVAPHAPASPARLSRCVQVLSRKLTLKAKALH
ARIQQNLDQLREELSRAFAGTGTEEGAGPDPQMLSEEVRQRLQAFRQDTYLQIAAFTRAI
DQETEEVQQQLAPPPPGHSAFAPEFQQTDSGKVLSKLQARLDDLWEDITHSLHDQGHSHL
GDP
>NP_000473.1_#_115_#_1305 6  3 exon (s) 234478 - 236802 396 aa, chain - ## gi|4502151|ref|NP_000473.1| apolipoprotein A-IV precursor [Homo sapiens]  ## 396 ## orf_perfect
MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA
LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV
SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR
PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG
LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLRGNTEGLQKSLAELGGHLDQQV
EEFRRRVEPYGENFNKALVQQMEQLRTKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE
KESQDKTLSLPELEQQQEQHQEQQQEQVQMLAPLES
>NP_000031.1_#_47_#_346 7  3 exon (s) 244194 - 246495 99 aa, chain + ## gi|4557323|ref|NP_000031.1| apolipoprotein C-III precursor [Homo sapiens]  ## 99 ## orf_perfect
MQPRVLLVVALLALLASARASEAEDASLLSFMQGYMKHATKTAKDALSSVQESQVAQQAR
GWVTDGFSSLKDYWSTVKDKFSEFWDLDPEVRPTSAVAA
>NP_000030.1_#_39_#_842 8  3 exon (s) 249419 - 250998 267 aa, chain - ## gi|4557321|ref|NP_000030.1| apolipoprotein A-I precursor [Homo sapiens]  ## 267 ## orf_perfect
MKAAVLTLAVLFLTGSQARHFWQQDEPPQSPWDRVKDLATVYVDVLKDSGRDYVSQFEGS
ALGKQLNLKLLDNWDSVTSTFSKLREQLGPVTQEFWDNLEKETEGLRQEMSKDLEEVKAK
VQPYLDDFQKKWQEEMELYRQKVEPLRAELQEGARQKLHELQEKLSPLGEEMRDRARAHV
DALRTHLAPYSDELRQRLAARLEALKENGGARLAEYHAKATEHLSTLSEKAKPALEDLRQ
GLLPVLESFKVSFLSALEEYTKKLNTQ
>FGENESH: 9  18 exon (s)    260007 -    290570 1041 aa, chain - ## BY PROTMAP: gi|37360178|dbj|BAC98067.1| mKIAA0999 protein [Mus musculus] ## 1052 ## 89.0 1.00 1.00 1903.0
MKLGDADPNFDRLIAECQQLKEERQVDPLNEDVLLAMEDMGLDKEQTLQSLRSDAYDHYS
AIYSLLCDRHKRHKTLRLGALPSMPRALAFQAPVNIQAEQAGTAMNISVPQPDGTLNLDS
DEGEEPSPEALVRYLSMRRHTVGVADPRTEVMEDLQKLLPGFPGVNPQAPFLQVAPNVNF
MHNLLPMQNLQPTGQLEYKEQSLLQPPTLQLLNGMGPLGRRASDGGANIQLHAQQLLKRP
RGPSPLVTMTPAVPAVTPVDEESSDGEPDQEAVQRYLANRSKRHTLAMTNPTAEIPPDLQ
RQLGQQPFRSRVWPPHLVPDQHRSTYKDSNTLHLPTERFSPVRRFSDGAASIQAFKAHLE
KMGNNSSIKQLQQECEQLQKMYGGQIDERTLEKTQQQHMLYQQEQHHQILQQQIQDSICP
PQPSPPLQAACENQPALLTHQLQRLRIQPSSPPPNHPNNHLFRQPSNSPPPMSSAMIQPH
GAASSSQFQGLPSRSAIFQQQPENCSSPPNVALTCLGMQQPAQSQQVTIQVQEPVDMLSN
MPGTAAGSSGRGISISPSAGQMQMQHRTNLMATLSYGHRPLSKQLSADSAEAHSLNVNRF
SPANYDQAHLHPHLFSDQSRGSPSSYSPSTGVGFSPTQALKVPPLDQFPTFPPSAHQQPP
HYTTSALQQALLSPTPPDYTRHQQVPHILQGLLSPRHSLTGHSDIRLPPTEFAQLIKRQQ
QQRQQQQQQQQQQEYQELFRHMNQGDAGSLAPSLGGQSMTERQALSYQNADSYHHHTSPQ
HLLQIRAQECVSQASSPTPPHGYAHQPALMHSESMEEDCSCEGAKDGFQDSKSSSTLTKG
CHDSPLLLSTGGPGDPESLLGTVSHAQELGIHPYGHQPTAAFSKNKVPSREPVIGNCMDR
SSPGQAVELPDHNGLGYPARPSVHEHHRPRALQRHHTIQNSDDAYVQLDNLPGMSLVAGK
ALSSARMSDAVLSQSSLMGSQQFQDGENEECGASLGGHEHPDLSDGSQHLNSSCYPSTCI
TDILLSYKHPEVSFSMEQAGV
>FGENESH: 10  7 exon (s) 302565  -  372401  241 aa, chain -
MQVVKGYIIQDFLQETKQLVAIKIIDKTQLDEENLKKIFREVQIMKMLCHPHIIRLYQVM
ETERMIYLVTEYASGGEIFDHLVAHGRMAEKEARRKFKQIVTAVYFCHCRNIVHRDLKAE
NLLLDANLNIKIADFGFSNLFTPGQLLKTWCGSPPYAAPELFEGKEYDGPKVDIWSLGVV
LYVLVCGALPFDGSTLQNLRARVLSGKFRIPFFMSTVFPVSNLDLFYPPPPSPHTMARDS
Y
>NP_001021.1_#_36_#_290 11  1 exon (s) 449408 - 449662 84 aa, chain - ## gi|4506711|ref|NP_001021.1| ribosomal protein S27 [Homo sapiens]  ## 84 ## orf_perfect
MPLAKDLLHPSPEEEKRKHKKKRLVQSPNSYFMDVKCPGCYKITTVFSHAQTVVLCVGCS
TVLCQPTGGKARLTEGCSFRRKQH

If Fgenesh++ does not find any reliable genes, the output looks like so:

FGENESH++ 3.1.1 Mapped known genes and predicted genes in genomic DNA
 Seq name: small_seq
 Length of sequence: 60
 no reliable predictions

Services Test Online