This software can NOT be copied or distributed without Softberry license.
Copyright Softberry, Inc. 1999-2026. (link to older version)
|
Fgenesh++ package includes the following copyrighted software:
Fgenesh - highly accurate and fast HMM-based gene prediction program;
Fgenesh+ - gene prediction program that uses similar protein information;
mrna_map - program for mapping full-length mRNAs to genome
(genome alignment with splice sites identification)
and mapping partial mRNAs (not full-length mRNAs, ESTs)
to improve gene prediction accuracy;
prot_map - mapping protein database to genomic sequences;
Perl scripts.
Fgenesh++ package uses the following public software:
blastp or BLAST blastall, bl2seq programs
Fgenesh++ can be used with the following public data:
RefSeq database;
NR database (non-redundant protein database);
OrthoDB protein database.
Fgenesh++ requires sequences for analysis:
Fgenesh++ can accept the following sources of evidence for prediction of genes.
1) full-length mRNAs for a specified organism
full-length mRNAs are aligned to genomic sequences and "mRNA-supported" genes are predicted based on these alignments
(full-length mRNAs can be obtained from transcripts assembled by Trinity or similar programs or from RefSeq)
2) protein DB (or its part) with known proteins
proteins from DB (or some set of proteins, for example, from closely related species) are aligned to genomic sequences and "protein-supported" genes are predicted based on these alignments
3) partial mRNAs (or ESTs) for a specified organism
partial mRNAs (or ESTs) are aligned to genomic sequences to get a list of potential splice sites that are utilised by Fgenesh to improve gene models during ab initio gene prediction
4) RNASeq data (reads) for a specified organism
reads are mapped to genomic sequences (by ReadsMap pipeline) to get a list of potential splice sites that are utilised by Fgenesh to improve gene models during ab initio gene prediction
Notes:
- if RNASeq reads are available for a specified organism it is recommended to assemble reads into transcripts by Trinity or some other program and then split assembled transcripts into full-length mRNAs and partial (not full-length) mRNAs and use them as sources of evidence 1) and 3);
- currently ReadsMap pipeline is under development to be run in a more automatic manner, the revised version will be available in the following releases of Fgenesh++
5) gene prediction parameters trained for a specified (or closely related) organism
ab initio genes are predicted by Fgenesh that runs with parameters trained for a specified (or closely related) organism
Ab initio predictions can use hints from 3), 4) if such data is available.
See the section "Types of evidence" for more details.
Copy FGENESHPIPE/ directory from the distribution kit to appropriate directory.
Make configuration file (see "Configuration file") or copy from some example.
A set of additional scripts (to prepare sequences, data, etc.) is provided with Fgenesh++.
In case if some scripts do not work because they can not find required Perl modules we provide two files that can solve this issue
softberry_PERL5LIB.sh softberry_wrapper.sh
You can either set PERL5LIB in your current environment or use a wrapper:
1) set PERL5LIB environment variable with the path to FGENESHPIPE/Modules/
directory
Open softberry_PERL5LIB.sh in some editor and put the right path to
FGENESHPIPE/Modules/
Then run one of the following commands.
. softberry_PERL5LIB.sh source softberry_PERL5LIB.sh
Finally, check that PERL5LIB environment variable is set.
echo $PERL5LIB
softberry_wrapper.sh to run a script
Open softberry_wrapper.sh in some editor and put the right path to
softberry_PERL5LIB.sh shell script.
Then run a script with this wrapper:
<path_to>/softberry_wrapper.sh <path_to>/script.pl [args and options]
There are three examples to run the pipeline that are provided with this distribution.
EXAMPLE_nGASP/ is an example for three C. elegans sequences
(nGASP training sequences 1.seq, 2.seq, 3.seq)
EXAMPLE_chr22/ is an example for human chr22
EXAMPLE_chr2x/ is an example for human chr20, chr21, chr22
Genomic sequences can be either complete chromosomes or scaffolds, contigs, etc.
Repeats-masked sequences are recommended to use (especially for large genomes like human). Repeats in genomic sequences should be masked by lowercase letters (not by Ns).
If you run the pipeline with -masked option then it is assumed that lowercase
letters in genomic sequences are soft-masked repeats.
If you run the pipeline without -masked option then it is assumed that repeats
are not masked even if genomic sequences contain lowercase letters.
Sequences for annotation can be provided to the pipeline as a multi-FASTA file or a list of single-FASTA files.
put all sequences into multi-FASTA file and provide it to the pipeline with
an option -fasta <fasta_file> or -f <fasta_file>
(for example, -f seq.fa)
put each sequence in a separate single-FASTA file, make a list with absolute
or relative paths to single-FASTA files with sequences and provide it
to the pipeline with an option -list <seq_list> or -l <seq_list>
(for example, -l seq.list)
Note: if you provide relative paths to sequences they should be given relative to the directory from which you start the pipeline (relative to the current working directory).
EXAMPLE
file 'seq.list'
# (with absolute paths)
/data/Genome/SEQ/1.seq /data/Genome/SEQ/2.seq /data/Genome/SEQ/3.seq ...
or
# (with relative paths)../human/SEQ/chr20.fa ../human/SEQ/chr21.fa ../human/SEQ/chr22.fa
Files with sequences can be named in any way, with or without extension(s).
(!) But please do NOT use the extensions .N and .M - they are reserved for the pipeline.
Examples of file names for files with sequences:
1.seq chr22.fa ENm003.seq ENm003.fa ENm003.hg17.fa ENm003
Organism-specific gene prediction parameters are usually located in
FGENESHPIPE/PARAM/matrices/ directory.
For example,
C_elegans_nGASP - gene prediction parameters for C.elegans used in
the example EXAMPLE_nGASP/
Human - gene prediction parameters for Human used in
the examples EXAMPLE_chr22/ and EXAMPLE_chr2x/
Parameters file contains parameters that regulate gene prediction process.
For a number of parameters you might want to set up values other than those
initially provided in the file, e.g., maximal intron length allowed in a gene
or minimal length of an intron suitable for prediction of genes in introns of
other genes. Some parameters may require some expertise to manipulate them
properly.
We provide parameters files for mammals and for non-mammals:
mammals.par - for mammals
non_mamm.par - for non-mammals
Put the path to parameters file in configuration file (see "Configuration file").
In configuration file you can indicate which steps of the pipeline to run, provide absolute or relative paths to parameters files, data files or directories and set other options.
Note: if you provide relative paths they should be given relative to the directory from which you start the pipeline (relative to the current working directory).
EXAMPLE
(similar to EXAMPLE_nGASP/run/ce.cfg)
# # Location of data and options for eukaryotic genome annotation # # Organism-specific and pipeline parameters GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP # gene prediction parameters PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par # location of parameters files # Predict genes with GC donor splice sites or not PREDICT_GC = 1 #--------------------------------------------------------------------- # Number of threads # (if NUM_THREADS > 1 then other threads settings are not used) NUM_THREADS = 1 # number of threads (general) # the following threads settings are used if NUM_THREADS = 1 MRNA_THREADS = 1 # number of threads to map full-length mRNAs EST_THREADS = 1 # number of threads to map partial mRNAs / ESTs PROTEIN_THREADS = 1 # number of threads to map proteins or run 'blast' #--------------------------------------------------------------------- # Mapping full-length mRNAs MAP_MRNAS = 1 # map full-length mRNAs to genomic sequences MRNA_FILE = /data/EXAMPLE_nGASP/DATA/ngasp.mrna.fa # full-length mRNAs #--------------------------------------------------------------------- # Mapping ESTs (partial mRNAs) MAP_ESTS = 1 # map ESTs (partial mRNAs) to sequences EST_FILE = /data/EXAMPLE_nGASP/DATA/rna_matches.fa # file with ESTs (partial mRNAs) #--------------------------------------------------------------------- # Using reads USE_READS = 0 # use reads info to improve gene models DIR_SITES = /data/EXAMPLE_nGASP/reads_sites/ # directory with reads *.sites files #--------------------------------------------------------------------- # Using known proteins from DB # (predict genes based on homology to known proteins) USE_PROTEINS = 1 # 0 - no, 1 -yes PROTEIN_DB = /data/EXAMPLE_nGASP/NR_ce/nr_ce # protein DB PROTEIN_DB_INDEX = /data/EXAMPLE_nGASP/NR_ce/nr_ce.ind # protein DB index file PROTEIN_DB_TAG = NR # short name for protein DB #--------------------------------------------------------------------- # Location of BLAST+ or BLAST programs # BLAST+ programs BLASTP = /path_to/ncbi-blast-2.2.28+/bin/blastp # blastp (protein vs protein DB) BLAST2 = /path_to/ncbi-blast-2.2.28+/bin/blastp # blast 2 proteins # BLAST programs # # BLASTP = /path_to/blast-2.2.26/bin/blastall # blastp (protein vs protein DB) # BLAST2 = /path_to/blast-2.2.26/bin/bl2seq # blast 2 proteins #--------------------------------------------------------------------- # Predicting ab initio genes # (including genes with evidence from partial mRNAs or reads) PREDICT_AB_INITIO = 1 # predict ab initio genes # Finding protein homologs (from protein database) for ab initio genes BLAST_AI_PROTEINS = 1 # find homologs for ab initio predicted genes (0 - no, 1 - yes) # Predicting ab initio genes in long introns of other genes INTRONIC_GENES = 0 # predict ab initio genes in long introns of other genes
In the line GENE_PARAM provide location of a file with organism-specific
gene prediction parameters, for example:
GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP # gene prediction parameters
In the line PIPE_PARAM provide location of a file with pipeline general
parameters. Currently, we have two such files, one for mammals and one for
all other organisms (non-mammals):
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/mammals.par # for mammalsor
PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par # for non-mammals
The pipeline can predict genes with GC donor splice sites. You can switch this option ON (1) or OFF (0).
PREDICT_GC = 1
Several programs of the pipeline support parallelization - they can automatically split a set of query sequences (full-length or partial mRNAs, proteins from DB) into subsets and perform calculations with such subsets in parallel.
You can specify the number of threads (cores) for all such programs or set an individual value for each program.
For example:# (if NUM_THREADS > 1 then other threads settings are not used) NUM_THREADS = 10 # number of threads (general) # the following theads settings are used if NUM_THREADS = 1 MRNA_THREADS = 1 # number of threads to map full-length mRNAs EST_THREADS = 1 # number of threads to map partial mRNAs / ESTs PROTEIN_THREADS = 1 # number of threads to map proteins or run 'blast'or (the same)
# (if NUM_THREADS > 1 then other threads settings are not used) NUM_THREADS = 1 # number of threads (general) # the following theads settings are used if NUM_THREADS = 1 MRNA_THREADS = 10 # number of threads to map full-length mRNAs EST_THREADS = 10 # number of threads to map partial mRNAs / ESTs PROTEIN_THREADS = 10 # number of threads to map proteins or run 'blast'See also Running the pipeline in parallel for strategies to run calculations in parallel.
Then, you can switch ON (1) or OFF (0) each of the pipeline steps:
MAP_MRNAS = 1 # map full-length mRNAs to genomic sequences MAP_ESTS = 1 # map ESTs (partial mRNAs) to genomic sequences USE_READS = 0 # use reads info to improve gene models USE_PROTEINS = 1 # 0 - no, 1 -yes PREDICT_AB_INITIO = 1 # predict ab initio genes BLAST_AI_PROTEINS = 1 # find homologs for ab initio predicted genes (0 - no, 1 - yes) INTRONIC_GENES = 0 # predict ab initio genes in long introns of other genesNote:
in the current version of the pipeline either "using partial mRNAs" (MAP_ESTS) or "using reads" (USE_READS) step can be used (switched ON) but not both at the same time.
For each step (if it is switched ON) provide all the necessary data / options.
MAP_MRNAS = 1
if full-length mRNAs are available for a specified organism then make
a FASTA file with full-length mRNAs (see Using full-length mRNAs)
and provide its location in MRNA_FILE line
MAP_ESTS = 1
if partial mRNAs (or ESTs) are available for a specified organism then
make a FASTA file with partial mRNAs (or ESTs) (see Using partial mRNAs)
and provide its location in EST_FILE line
USE_READS = 1
if RNASeq reads are available for a specified organism they can be mapped
to genomic sequences by ReadsMap pipeline and a directory with obtained
*.sites files (with potential splice sites) can be provided in DIR_SITES
line
(note although that in the current release this step is not fully automatical,
it requires running ReadsMap pipeline and preparation of *.sites files;
also, MAP_ESTS = 1 and USE_READS = 1 can not be used simultaneously;
this will be fixed and improved in the following releases of Fgenesh++)
USE_PROTEINS = 1
"Prediction of genes based on homology to known proteins" (from some protein DB or its subset)
PROTEIN_DB = /data/NR/nr_ce # protein DB
Prepare protein database (NR, OrthoDB or its subset or custom protein database,
see APPENDIX 3) and provide its location in PROTEIN_DB line.
PROTEIN_DB_INDEX = /data/NR/nr_ce.ind # protein DB index file
Prepare index file for a protein database (see APPENDIX 3) and provide its
location in PROTEIN_DB_INDEX line.
PROTEIN_DB_TAG = NR # short name for protein DB
Short name or tag for protein database. It is used in deflines of ab initio predicted genes / proteins to indicate a database where homolog / BLAST hit is found to a predicted protein, for example:
>FGENESH: 10 5 exon (s) 721921 - 735306 240 aa, chain + ## NR: EAW59798.1 adaptor-related protein complex 1, beta 1 subunit, isoform CRA_a [Homo sapiens] ## query_len = 240 match_len = 229 query_cov = 0.36 match_cov = 0.38 query_x1 = 100 query_x2 = 185 match_x1 = 48 match_x2 = 133 score = 139 pid = 79.0 evalue = 5e-40 MADVRLLAESARHDGNTIGHKSTEVLALTWMLPLKLVVANAVAALSEIAESHPSNNLLHL NPQFINKLLTALIECTEWGQIFILDCLTNYTPKDDREAQRGETSLFFHLKRAGFADMTGP KEQVSLISSFSSSLLRILPGLSSGIRTNEEETVTALAANTHGQGSSIHLWEDPKFDTEDV KPSPNWKAKARSSFLSTPLCLPSFTTKEDGSAAAQSLLVNHRPPYTPAPVHIPQGSSWDS
Provide paths to BLAST+ or BLAST executables for
"protein vs protein DB" and
"protein vs protein" BLAST comparisons.
blastp
# Location of BLAST+ or BLAST programs # BLAST+ programs BLASTP = /path_to/ncbi-blast-2.2.28+/bin/blastp # blastp (protein vs. protein DB) BLAST2 = /path_to/ncbi-blast-2.2.28+/bin/blastp # blast 2 proteinsor, if you can run BLAST+ without paths
BLASTP = blastp # blastp (protein vs. protein DB) BLAST2 = blastp # blast 2 proteinsFor BLAST, programs are
blastall and bl2seq
# BLAST programs BLASTP = /path_to/blast-2.2.26/bin/blastall # blastp (protein vs. protein DB) BLAST2 = /path_to/blast-2.2.26/bin/bl2seq # blast 2 proteins
PREDICT_AB_INITIO = 1 # predict ab initio genes
Predict ab initio genes (including genes with evidence from partial mRNAs
or reads) or not.
Ab initio genes are predicted in regions where no genes are predicted by
mapping full-length mRNAs or known proteins from a protein DB.
BLAST_AI_PROTEINS = 1 # find homologs for ab initio predicted genes (0 - no, 1 - yes)
Find homologs (BLAST hits) to ab initio predicted genes (proteins) or not.
INTRONIC_GENES = 0 # predict ab initio genes in long introns of other genes
Optional step, prediction of ab initio genes in long introns of other genes.
EXAMPLE_nGASP/run/
ce.cfg - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio ce_mrnas.cfg - Mapping full-length mRNAs, partial mRNAs, Ab initio ce_proteins.cfg - Mapping proteins from DB, Ab initio
ce_reads.cfg - Mapping full-length mRNAs, proteins from DB, Using reads, Ab initio
EXAMPLE_chr22/run/
human.cfg - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio human_mrnas.cfg - Mapping full-length mRNAs, partial mRNAs, Ab initio human_proteins.cfg - Mapping proteins from DB, Ab initio
EXAMPLE_chr2x/run/
human.cfg - Mapping full-length mRNAs, partial mRNAs, proteins from DB, Ab initio human_mrnas.cfg - Mapping full-length mRNAs, partial mRNAs, Ab initio human_proteins.cfg - Mapping proteins from DB, Ab initio
If some steps are switched OFF (0) you can leave paths and options for these steps empty or remove such steps (with correspodning paths and options) from configuration file at all.
For example, this is how a configuration file can look if only "Mapping full-length mRNAs", "Mapping partial mRNAs" and "Predict ab initio" steps are used -
EXAMPLE_nGASP/run/ce_mrnas.cfg
Note:
in this example ngasp.mrna.fa contains only 253 mRNAs which is not enough
to annotate sequences with "Mapping full-length mRNAs" step alone - in real
life use some protein DB as well;
but if you have a good amount of full-length mRNAs from assembled transcripts or/and RefSeq mRNAs you might want to skip "Using known proteins from DB" step to increase the speed of calculations.
# # Location of data and options for eukaryotic genome annotation # # Organism-specific and pipeline parameters GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP # gene prediction parameters PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par # location of parameters files # Predict genes with GC donor splice sites or not PREDICT_GC = 1 # Number of threads # (if NUM_THREADS > 1 then other threads settings are not used) NUM_THREADS = 10 # number of threads (general) # the following threads settings are used if NUM_THREADS = 1 MRNA_THREADS = 1 # number of threads to map full-length mRNAs EST_THREADS = 1 # number of threads to map partial mRNAs / ESTs # Mapping full-length mRNAs MAP_MRNAS = 1 # map full-length mRNAs to genomic sequences MRNA_FILE = /data/EXAMPLE_nGASP/DATA/ngasp.mrna.fa # full-length mRNAs # Mapping ESTs (partial mRNAs) MAP_ESTS = 1 # map ESTs (partial mRNAs) to genomic sequences EST_FILE = /data/EXAMPLE_nGASP/DATA/rna_matches.fa # file with ESTs (partial mRNAs) # Predicting ab initio genes # (including genes with evidence from partial mRNAs or reads) PREDICT_AB_INITIO = 1 # predict ab initio genes # Predicting ab initio genes in long introns of other genes INTRONIC_GENES = 0 # predict ab initio genes in long introns of other genes
And this is how a configuration file can look if only "Predict genes based on homology to proteins from DB" and "Predict ab initio" steps are used
EXAMPLE_nGASP/run/ce_proteins.cfg
# # Location of data and options for eukaryotic genome annotation # # Organism-specific and pipeline parameters GENE_PARAM = /path_to/FGENESHPIPE/PARAM/matrices/C_elegans_nGASP # gene prediction parameters PIPE_PARAM = /path_to/FGENESHPIPE/PARAM/non_mamm.par # location of parameters files # Predict genes with GC donor splice sites or not PREDICT_GC = 1 # Number of threads # (if NUM_THREADS > 1 then other threads settings are not used) NUM_THREADS = 10 # number of threads (general) # the following threads settings are used if NUM_THREADS = 1 PROTEIN_THREADS = 1 # number of threads to map proteins or run 'blast' # Using known proteins from DB # (predict genes based on homology to known proteins) USE_PROTEINS = 1 # 0 - no, 1 -yes PROTEIN_DB = /data/EXAMPLE_nGASP/NR_ce/nr_ce # protein DB PROTEIN_DB_INDEX = /data/EXAMPLE_nGASP/NR_ce/nr_ce.ind # protein DB index file PROTEIN_DB_TAG = NR # short name for protein DB # Location of BLAST+ or BLAST programs # BLAST+ programs BLASTP = /path_to/ncbi-blast-2.2.28+/bin/blastp # blastp (protein vs. protein DB) BLAST2 = /path_to/ncbi-blast-2.2.28+/bin/blastp # blast 2 proteins # BLAST programs # # BLASTP = /path_to/blast-2.2.26/bin/blastall # blastp (protein vs. protein DB) # BLAST2 = /path_to/blast-2.2.26/bin/bl2seq # blast 2 proteins # Predicting ab initio genes # (including genes with evidence from partial mRNAs or reads) PREDICT_AB_INITIO = 1 # predict ab initio genes # Finding protein homologs (from protein database) for ab initio genes BLAST_AI_PROTEINS = 1 # find homologs for ab initio predicted genes (0 - no, 1 - yes) # Predicting ab initio genes in long introns of other genes INTRONIC_GENES = 0 # predict ab initio genes in long introns of other genes
run_pipe.pl (help and options)
run_pipe.pl (help and options)
Run run_pipe.pl without parameters for help.
Usage: run_pipe.pl <config_file> -f <fasta_file> | -l <seq_list> [-masked] -d <results_dir> [options]
where
<config file> - configuration file
------------------------------------------------------------------------------------------
Source of sequences:
-fasta <fasta_file>,
-f <fasta_file> - file with sequences in multi-FASTA format
-list <seq_list>,
-l <seq_list> - list of single-FASTA files with sequences
-masked - sequences are soft-masked (repeats are masked by lowercase letters)
------------------------------------------------------------------------------------------
Options to set a range of sequences for calculations:
-from <seq_from_No> - run for sequences from <seq_from_No>
-to <seq_to_No> - run for sequences to <seq_to_No>
-seq <seq_No_1,...> - run for sequences <seq_No_1>, etc.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Note:
options -from, -to, -seq cannot be used together with
options -jobs <N>, -chunk <n>
------------------------------------------------------------------------------------------
Options to split sequences into subsets and run calculations in parallel:
-jobs <N>,
-j <N> - split sequences into N parts and
print commands to run N processes in parallel
-chunk <n> - split sequences into chunks (<n> sequences each) and
print commands to run <number_processes> in parallel
where
<number_processes> = ( number_of_sequences / chunk ) + 1
The following options are used only with -jobs <N> or -chunk <n>:
-make_sh - print commands to .sh file
'run_cmd.sh.XXXX' by default (where XXXX is unique substring) or
use option -sh <filename> to provide name for .sh file
-sh_name <filename>,
-sh <filename> - filename for .sh file
(this option automatically switches ON -make_sh)
-run - run commands (start calculations) in parallel
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Note (!):
with -run option all processes are started at once!
use -jobs <N> or
choose -chunk <n> with care for not to overload your system
------------------------------------------------------------------------------------------
Options to make separate .res files for each sequence (only with -f <fasta_file>):
-split_res - if sequences are in multi-FASTA file,
make separate .res files for each sequence
(this option is used with -f <fasta_file>)
(separate .res files are named according to options
-res_name_by_count, -res_name_by_seqid, see below)
-res_name_by_count - name .res files by assigning numbers like 1.fa.res, 2.fa.res, ...
(by default)
-res_name_by_seqid - name .res files by assigning seq IDs like chr20.fa.res, chrX.fa.res
(these options are used with -split_res)
----------------------------------------------------------------
Note:
With no -split_res option,
naming of result files for multi-fasta input looks like this:
a) if no -from, -to or -seq options are provided
then results are saved in a file
<fasta_file>.res
for example:
-f chr.fa -> chr.fa.res
b) if -from <from> and -to <to> options are provided
then results are saved in a file
<fasta_file>.res.<from>-<to>
for example:
-f chr.fa -from 5 -to 10 -> chr.fa.res.5-10
-f chr.seq -from 5 -to 5 -> chr.seq.res.5 (here from = to)
c) if -seq <seq_No> or
-seq <seq_No_1,seq_No_2,...> option is provided
then results are saved in a file
<fasta_file>.res.<seq_No> or
<fasta_file>.res.<seq_No_1>_<seq_No_2>_...
for example:
-f chr.fa -seq 3 -> chr.fa.res.3
-f chr.seq -seq 3,5,7 -> chr.seq.res.3_5_7
------------------------------------------------------------------------------------------
-d <results_dir> - directory to store results
-w <work_dir> - working directory (if not provided,
temporary directory is created automatically for each run)
-clean - clean working directory *before* calculations
(if it exists and contains any files)
-debug - do not clean working directory *after* calculations
(so that developers can see files and spot problems)
In examples below the following options mean
-f seq.fa - run for a multi-fasta file
-l seq.list - run for a list of sequences (a list of single-fasta files)
1) run for a whole set of sequences (for small genomes)
run_pipe.pl celegans.cfg -f seq.fa -d results &> run.log & run_pipe.pl celegans.cfg -l seq.list -d results &> run.log &
use -masked for repeats-masked sequences
run_pipe.pl human.cfg -f seq.fa -masked -d results &> run.log & run_pipe.pl human.cfg -l seq.list -masked -d results &> run.log &
2) run for a range of sequences (from a multi-fasta file or a list of sequences)
run_pipe.pl celegans.cfg -f seq.fa -from 1 -to 5 -d results &> run_1.log & run_pipe.pl celegans.cfg -f seq.fa -from 6 -to 10 -d results &> run_2.log &or
run_pipe.pl celegans.cfg -l seq.list -seq 4 -d results &> run.log & run_pipe.pl celegans.cfg -l seq.list -seq 4,7,9 -d results &> run.log &
make separate .res files for each sequence
run_pipe.pl celegans.cfg -f seq.fa -from 1 -to 5 -d results -split_res &> run.log & run_pipe.pl celegans.cfg -l seq.list -seq 4,7,9 -d results -split_res &> run.log &
3) run for a whole set of sequences (in parallel)
print commands (and see how a set of sequences is split)
run_pipe.pl celegans.cfg -f seq.fa -d results -jobs 5 run_pipe.pl celegans.cfg -l seq.list -d results -chunk 100
make .sh file with commands (and -run to run it)
run_pipe.pl human.cfg -f seq.fa -masked -d results -jobs 5 -sh run_cmd.sh run_pipe.pl human.cfg -f seq.fa -masked -d results -jobs 5 -sh run_cmd.sh -run
make separate .res files for each sequence
run_pipe.pl human.cfg -f seq.fa -masked -d results -jobs 5 -sh run_cmd.sh -split_res run_pipe.pl human.cfg -f seq.fa -masked -d results -jobs 5 -sh run_cmd.sh -split_res -run
4) run for subsets of sequences (manually created from a multi-fasta file or a list of sequences)
run_pipe.pl celegans.cfg -f seq_1.fa -d results &> run_1.log & run_pipe.pl celegans.cfg -f seq_2.fa -d results &> run_2.log & ... run_pipe.pl human.cfg -l seq_1.list -masked -d results &> run_1.log & run_pipe.pl human.cfg -l seq_2.list -masked -d results &> run_2.log & ...
saving results in different directories and providing working directories explicitly
run_pipe.pl human.cfg -f seq_1.fa -masked -d results_1/ -w work_1/ &> run_1.log & run_pipe.pl human.cfg -f seq_2.fa -masked -d results_2/ -w work_2/ &> run_2.log & ... run_pipe.pl celegans.cfg -l seq_1.list -d results_1/ -w work_1/ &> run_1.log & run_pipe.pl celegans.cfg -l seq_2.list -d results_2/ -w work_2/ &> run_2.log & ...
Sequences for annotation can be provided to the pipeline in two ways:
a) as a multi-FASTA file via -f <fasta_file> option or
b) as a list of single-FASTA files via -l <seq_list> option.
If sequences are soft-masked (repeats are masked by lowercase letters)
then also use -masked option.
Directory with results will contain .res files with Fgenesh++ predictions
(predicted gene structures and corresponding proteins in Fgenesh++ format -
see APPENDIX 4 for details).
In case of multiple .res files you can merge them into one multi-res file
with the following script:
SCRIPTS/merge_res_files.pl
If not provided explicitly, unique temporary working directory for temporary and intermediate files will be created automatically and cleaned after calculations are finished.
The pipeline prints some log output to stdout and stderr (and errors to stderr).
This log output does not make much sense for users but can help to spot errors and bugs (if any) for developers. To save log output (and errors) run the pipeline with the following command:
in bash, sh redirect stdout and stderr to one file or two different files
FGENESHPIPE/run_pipe.pl ... &> <log_err_file>or
FGENESHPIPE/run_pipe.pl ... > <log_file> 2> <log_err_file>
in tcsh, csh
FGENESHPIPE/run_pipe.pl ... >& <log_err_file>
Note: tcsh and csh cannot redirect stdout and stderr to two different files
but >& will redirect the combined output to one file)
EXAMPLES
in bash
FGENESHPIPE/run_pipe.pl celegans.cfg -f seq.fa -d results &> run.log_error
FGENESHPIPE/run_pipe.pl celegans.cfg -f seq.fa -d results > run.log 2> run.log_err
in tcsh
FGENESHPIPE/run_pipe.pl celegans.cfg -f seq.fa -d results >& run.log_err
There are several strategies to run the pipeline in parallel to annotate genomic sequences.
Basically, the pipeline can be parallelized either by splitting
1) a set of target sequences (genomic sequences to annotate) or
2) a set of query sequences (full-length or partial mRNAs, proteins from DB)
into subsets and running the pipeline for these subsets simultaneously.
Some combination of 1) and 2) is also possible.
Let us consider each approach.
1) split a set of genomic sequences into subsets and run the pipeline in parallel for these subsets
You can split genomic sequences with options -jobs <N> or -chunk <n>.
First, run pipeline commands with options -jobs <N> or -chunk <n> and
-sh <file.sh> to see how it splits sequences into subsets.
For example,
run_pipe.pl ... -jobs 10 -sh run_cmd.sh
Look at run_cmd.sh for generated commands with -from <n1> and -to <n2> options
that relate to subsets of genomic sequences.
Then, if the split looks OK to you, run .sh file or add -run to the command
line to start calculations in parallel.
For example,
./run_cmd.shor
run_pipe.pl ... -jobs 10 -sh run_cmd.sh -run
Notes:
a)
if you use this approach then set all THREADS values in a config file to 1
NUM_THREADS = 1 # number of threads (general) MRNA_THREADS = 1 # number of threads to map full-length mRNAs EST_THREADS = 1 # number of threads to map partial mRNAs / ESTs PROTEIN_THREADS = 1 # number of threads to map proteins or run 'blast'
b)
directory for result files (provided via -d <dir> option) is not getting
cleaned when the pipeline starts therefore multiple pipeline runs can write
files with gene predictions into the same results directory;
but if the pipeline creates files with the same names as ones already stored in results directory then new files will overwrite the old ones
c)
since multiple pipeline runs are started and running at once (for subsets
of sequences) then multiple .res files (with gene predictions) will appear
in results directory - one for each subset of sequences (or one per each
sequence if -split_res option is applied)
After all calculations are complete, multiple .res files can be merged
into one multi-res file with the following script:
SCRIPTS/merge_res_files.pl
d)
you can split genomic sequences yourself using other methods or scripts,
write commands to some .sh file and start calculations (run .sh file)
e) if you have hundreds or thousands of genomic sequences (contigs) then to run calculations in some controllable way it is probably better not to start all the calculations at once (as N parallel processes with big subsets of sequences);
instead, split genomic sequences into multiple subsets (for example, with
-chunk <n>) and get .sh file with commands. Run the commands from this file
in chunks (X commands at a time). Wait until some jobs are finished (check
by inspection of running jobs or by counting of .res files in results
directory) and then run another series of commands, and so on.
2) run the pipeline for the whole set of genomic sequences at once
but specify in a config file the number of threads for programs that support parallelization (these programs can automatically split a set of query sequences (full-length or partial mRNAs, proteins from DB) into subsets and perform calculations with such subsets in parallel).
You can specify the number of threads (cores) for all such programs or set an individual value for each program.
For example,
NUM_THREADS = 10 # number of threads (general)
or (the same)
# the following threads settings are used if NUM_THREADS = 1 MRNA_THREADS = 10 # number of threads to map full-length mRNAs EST_THREADS = 10 # number of threads to map partial mRNAs / ESTs PROTEIN_THREADS = 10 # number of threads to map proteins or run 'blast'
3) combination of 1) and 2) is also possible
if you want to annotate a small number of sequences with maximal speed
(and computer resources allow it) use both -jobs <N> and set the number
of threads in configuration file
For example,
run_pipe.pl ... -jobs 3 # command
NUM_THREADS = 12 # config file
Overall: 3*12 = 36 cores/threads
Conclusion:
Generally, approach 1) is faster than 2) since not all steps of the pipeline support parallelization by query sequences (at some steps the pipeline runs as 1 process).
Obviously, approach 3) is the fastest if computer resources allow you to annotate a certain number of sequences in this way.
Approach 2) is useful for a small number of sequences because it creates only
one .res file for a multi-FASTA input file
(if -split_res option is not applied).
(See also Sources of evidence from above.)
Fgenesh++ pipeline can deal with several types of evidence.
1) gene models predicted using full-length mRNA sequences
This corresponds to MAP_MRNAS block in configuration file.
See Using full-length mRNAs for details.
2) gene models predicted using proteins from some protein DB
(NR, OrthoDB or custom protein DB, or their subsets)
This corresponds to USE_PROTEINS block in configuration file.
First, prot_map maps known proteins from a protein database (for example,
NR or its subset) to genomic sequences and good mappings are selected.
Then fgenesh+ program predicts gene models in regions with good mappings
using mapped proteins.
After that predicted gene models are additionally checked / filtered by
script that analyses BLAST2 (bl2seq) alignments between predicted and
homologous proteins. Only gene models that have good coverage of predicted
and homologous proteins by BLAST alignment are selected. Some other criteria
are also checked.
These protein-supported predictions are made in regions not occupied by mRNA- supported predictions.
See APPENDIX 3 for how to prepare protein DB for calculations.
To speed up calculations make a subset of proteins that includes proteins from closely related species or some taxon.
3) using partial mRNAs to improve gene predictions
This corresponds to MAP_ESTS block in configuration file.
Partial mRNAs (including ESTs) can be used to make more accurate predictions
by fgenesh.
Partial mRNAs are aligned to genomic sequences by mrna_map and good
alignments are selected. A list of alignment blocks is compiled with
information whether such blocks are flanked at one or both ends with
splice sites or not. This information about regions (potential exons)
supported by partial mRNAs and potential splice sites is used by fgenesh
when predicting genes ab initio (see below) (although in this case if such
info is used for predictions they are not pure ab initio).
With this approach, boundaries of CDS exons and gene structures are predicted more correctly.
Technically, using partial mRNAs is not an additional step of improvement
after making initial predictions but rather incorporating additional
information (potential splice sites and their weights) in the process of
prediction by fgenesh. By improvement we mean higher accuracy of gene
models.
Each ab initio predicted gene is represented by only one gene model (the one with the maximum weight) even if partial mRNAs support several alternative splicing gene variants.
See also Using partial mRNAs.
4) using reads to improve gene predictions
This corresponds to USE_READS block in configuration file.
Reads are aligned to genomic sequences by ReadsMap pipeline and a list of
potential splice sites is compiled for each sequence (*.sites files).
These *.sites files should be placed in some directory and path to this
directory should be provided in configuration file.
Information about potential splice sites is used by fgenesh when predicting
genes ab initio (see below). With this approach, boundaries of CDS exons and gene structures are
predicted more correctly.
Technically, as with partial mRNAs (ESTs), this additional information
is used in the process of prediction by fgenesh, not as a separate step
after making initial predictions.
Currently, evidence from partial mRNAs and reads can not be used simultaneously in Fgenesh++. Apart from that, there is no limitation on combination of evidence data.
5) ab initio predictions
Ab initio predictions (without full-length mRNA or protein type of evidence) are made in regions not occupied by mRNA-supported and protein- supported predictions.
To make ab initio predictions, we run fgenesh with gene prediction parameters
trained for a specified (or closely related) organism.
Some additional evidence from partial mRNAs or reads can be used by fgenesh
in the process of gene prediction (see 3, 4 from above).
To make ab initio predictions, set
PREDICT_AB_INITIO = 1
In addition, best BLAST hits to ab initio predicted genes (proteins) can be found by BLAST of predicted proteins vs protein database. To do this, set
BLAST_AI_PROTEINS = 1
6) ab initio gene predictions in long introns of other genes
Optionally, pipeline can make gene predictions in long introns of other genes.
To do this, set
INTRONIC_GENES = 1 # predict genes in long introns of other genes
If full-length mRNAs are of good quality (for example, known mRNAs from
RefSeq or assembled by Trinity or similar programs with no or little errors
in CDS part) then Fgenesh++ is able to predict genes based on these mRNA
alignments with very high accuracy.
If a set of full-length mRNAs includes isoforms then gene structures with alternative splicing are predicted.
There can be several sources for full-length mRNAs:
1) transcripts assembled by Trinity or similar programs
Assembled transcripts usually include a mix of full-length and partial mRNAs (and also chimeras).
To get full-length mRNAs from assembled transcripts we need to predict ORFs, choose the best ORF in each transcript and decide whether this transcript can be considered as full-length mRNA (by CDS part, with ATG and STOP codon present) or not.
To predict ORFs and get full-length mRNAs (with definition lines in required
format) with BestOrf program and mrnas_annot.pl script, see APPENDIX 1
If you use TransDecoder (or other program) to predict ORFs then it is your
task to choose full-length mRNAs and make definition lines in the format required
for Fgenesh++ (see below).
2) RefSeq mRNAs for a specified organism
if full-length mRNAs for a specified organism are present in RefSeq you can
use them to predict genes in genomic sequeces, see APPENDIX 2
1) and 2) can be combined (merged) into one file.
In this case make sure that sequences in the merged file have unique IDs.
For example,
./seq_renumber.pl -i human.mrna.fa -check
To make sequences IDs unique (if they are not) use the command
./seq_renumber.pl -i human.mrna.fa -o human.mrna.unique_ids.fa
Note:
if some full-length mRNAs are exactly the same (and they are mapped to genomic sequences with good quality) then this can lead to prediction of the same gene structures. We need a script to filter out duplicate gene predictions from the final set of predicted genes.
Fgenesh++ expects definition lines for full-length mRNAs to have the following fields at the end after the '#' character.
# len = 845 atg = 283 stop = 711 target = chr22
where
len - length of mRNA
atg - first coordinate of ATG
stop - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)
Note: target field is optional, it is assumed target = 'na' if it is absent
EXAMPLES of deflines:
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = chr22 >1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = na >1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711or
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = chr2 >NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = na >NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820
To check that full-length mRNAs have no annotation errors and their definition lines are in correct format run
./check_mrnas.pl
Example:
./check_mrnas.pl human.mrna.fa
For Fgenesh++ provide location of a FASTA file with full-length mRNAs in a config file, for example:
MRNA_FILE = /data/softberry/Human_data/human.mrna.fa # full-length mRNAs
Partial mRNAs can come from assembled transcripts that are not full-length mRNAs or from ESTs.
Put them all into one FASTA file and provide its location in config file, for example:
EST_FILE = /data/softberry/Human_data/human.rna.fa # partial mRNAs
It is not required for sequences in this file to have unique IDs, nor any special format of definition lines is required.
mrnas_annot.pl (help and options)
Suppose we have transcripts assembled from RNA-Seq reads by Trinity
or other programs.
The best strategy to use transcripts in Fgenesh++ is to split them into full-length mRNAs and partial mRNAs.
To find full-length mRNAs we have to predict ORFs in transcritps: in each transcript choose the best ORF and decide whether a transcript can be considered as full-length mRNA (by CDS, with ATG and STOP codon present) or not.
ORFs in transcripts can be predicted by BestOrf, TransDecoder or other
programs.
In Fgenesh++ package we provide BestOrf and MaxOrf that predict ORFs
in transcripts. BestOrf uses the same gene prediction matrices as Fgenesh
and calculates coding potential when choosing the best ORF for a transcript
while MaxOrf reports the longest ORF (by CDS).
BestOrf has also higher accuracy of ORF predictions than MaxOrf in out tests on
annotated RefSeq mRNAs.
Script mrnas_annot.pl runs BestOrf (or MaxOrf) on transcripts and splits
them into full-length mRNAs and other transcripts (partial mRNAs).
FASTA file with full-length mRNAs have special format for definition lines with the following fields:
# len = 845 atg = 283 stop = 711 target = chr22
where
len - length of mRNA
atg - first coordinate of ATG
stop - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)
Note: target field is optional, it is assumed target = 'na' if it is absent
EXAMPLES of deflines:
>1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = chr22 >1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711 target = na >1_TRINITY_DN64_c0_g1_i1 len=845 path=[0:0-531 ..] # len = 845 atg = 283 stop = 711
Once you get a file with full-length mRNAs (with deflines in special format) provide its location in configuration file, for example:
MRNA_FILE = /data/softberry/Human_data/human.trinity.mrna.fa # full-length mRNAs
File with partial mRNAs can be used as ESTs, for example:
EST_FILE = /data/softberry/Human_data/human.trinity.rna.fa # partial mRNAs
Here we provide commands how to split assembled transcripts into full-length mRNAs and other transcripts (partial mRNAs).
1) make sure that assembled transcripts have unique sequence IDs
check if sequence IDs are unique or not
./seq_renumber.pl -i human.trinity.fa -check
If sequence IDs are not unique then make them unique.
For example, rename file
mv human.trinity.fa human.trinity.fa.initial
Then, make new file with unique sequence IDs
./seq_renumber.pl -i human.trinity.fa.initial -o human.trinity.fa
(run ./seq_renumber.pl for help and more options)
2) predict ORFs in transcripts and split them into full-length mRNAs and other transcripts (partial mRNAs)
using bestorf to predict ORFs (by default)
./mrnas_annot.pl -i human.trinity.fa -par Human.dat -min_cds:100
using maxorf to predict ORFs
./mrnas_annot.pl -i human.trinity.fa -p maxorf -min_cds:100
EXAMPLE:
./mrnas_annot.pl -i human.trinity.fa -par Human.dat -min_cds:100
gives the following output Warning: no -chr <chr_name> or -chr_code <chr_code>, assuming 'na' (source of transcripts is unknown) Predicting ORFs with 'bestorf' and 'Human.dat' parameters Creating files: human.trinity.mrna.fa - full-length mRNAs (based on predicted ORFs) human.trinity.mrna.bestorf - bestorf output (ORF predictions) for mRNAs human.trinity.protein.fa - proteins (CDS translation) for full-length mRNAs human.trinity.rna.fa - partial mRNAs (with no ORF prediction) human.trinity.log - log file Overall number of sequences : 73232 Sequences with ORF predicted : 57014 Annotated as full-length mRNAs: 57014 ( 77.9%) in direct strand : 26447 ( 46.4%) in reverse strand : 30567 ( 53.6%) Saved as partial mRNAs : 16218 ( 22.1%) Coverage of transcripts by ORF (CDS) (from 0 to 1) min : 0.03 max : 1.00 average : 0.51
Execution time: ~46 min
3) you can additionally check that full-length mRNAs have definition lines in required format and mRNAs have no annotation errors
./check_mrnas.pl human.trinity.mrna.fa
4) provide locations of files with full-length (and partial) mRNAs in configuration file
MRNA_FILE = /data/softberry/Human_data/human.trinity.mrna.fa # full-length mRNAs
and
EST_FILE = /data/softberry/Human_data/human.trinity.rna.fa # partial mRNAs
mrnas_annot.pl (help and options)
Run ./mrnas_annot.pl without parameters for help.
Predict ORFs and annotate transcripts, v1.07 (Softberry)
Usage: ./mrnas_annot.pl -i <seq_file_in> [options]
where
-i <seq_file_in> - file with mRNAs (transcripts)
Options:
-program <program>
-p <program> - program to predict ORFs:
bestorf (by default) or
maxorf (maxorf.pl)
-par <param_file> - file with parameters for 'bestorf'
-cov <thr> - minimal coverage of mRNA by ORF (CDS) length (from 0 to 1)
(0 by default)
-basename <name>
-b <name> - filename without extension as basename for output files
(basename from input file by default)
-chr <chr_name>,
-chr_name <chr_name> - chromosome name (chr22, chrX, etc.) or 'na'
-code <chr_code>,
-chr_code <chr_code> - chromosome code (number, X, Y, etc.) or 'na'
if options -chr <chr_name> and -chr_code <chr_code> are not provided
then mRNAs are not assigned to any particular chromosomes or sequences
(and in Fgenesh++ they are mapped to all genomic sequences and
good mappings are selected)
-check 0|1 - post-annotational check for mRNAs (1 by default)
-info
-progress - print progress
Options for an ORF prediction program ('bestorf' or 'maxorf') are passed directly to a program.
The following options can be used with 'bestorf' or 'maxorf':
-strand:0|1|2 - predict ORFs in + strand (0), - strand (1) or both (2) (2 by default)
-s:0|1|2
-min_cds:<LEN> - minimal CDS length (in aa) (100 by default)
(do not make it too low otherwise the rate of false positives increases)
-stop_in_cds:0|1 - include STOP codon in CDS / ORF coordinates (1) or not (0) (1 by default)
The following options can be used with 'maxorf':
-polya:0|1 - use polyA tail as a hint to define mRNA strand (direct or reverse)
(1 by default)
(option is used only when predicting ORFs in both strands)
-polya_hint:0|1 - same as -polya:0|1
-polya_thr:<N> - number of A's at the end of mRNA to consider it as polyA tail
(number of T's at the start in reverse strand)
(5 by default)
Output:
if input file is 'transcripts.fna' then the following files will be created:
(basename for these files can be changed with -b <name> or -basename <name>)
# files for full-length mRNAs
transcripts.mrna.fa - annotated full-length mRNAs in coding strand
transcripts.protein.fa - proteins for full-length mRNAs
transcripts.mrna.bestorf or
transcripts.mrna.maxorf - CDS / ORF predictions for full-length mRNAs (by 'bestorf' or 'maxorf')
# files for other transcripts
transcripts.rna.fa - transcripts with no predicted ORFs
# log file
transcripts.log - log file
Examples:
./mrnas_annot.pl -i human.trinity.all.fa -par Human.dat -min_cds:100
./mrnas_annot.pl -i chr22.trinity.all.fa -par Human.dat -min_cds:100 -chr chr22
./mrnas_annot.pl -i transcripts.fna -p bestorf -par Human.dat
./mrnas_annot.pl -i transcripts.fna -p maxorf
./mrnas_annot.pl -i rf_hs_chr22.cdna -p bestorf -par Human.dat -strand:0 -min_cds:55 -chr chr22
./mrnas_annot.pl -i rf_hs_chr22.cdna -p maxorf -strand:0 -min_cds:55 -chr_code 22
./mrnas_annot.pl -i QSOX1_cDNAs_13.txt -p bestorf -par Human.dat -strand 0 -cov 0.5
./mrnas_annot.pl -i QSOX1_cDNAs_13.txt -p maxorf -strand:0 -cov 0.5
RefSeq can be used as a source of known full-length mRNAs for an organism.
Here we describe how to download RefSeq data and prepare full-length mRNAs for a specified organism to use them in Fgenesh++.
FTP to RefSeq to browse its directories and see what files are present
ftp ftp://ftp.ncbi.nlm.nih.gov/refseq/or
ftp anonymous@ftp.ncbi.nlm.nih.gov cd refseq/
1) download RefSeq *.rna.gbff.gz (and optionally *.protein.faa.gz) files for a specified organism
Example for Human:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.rna.gbff.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.protein.faa.gz (optional)
Example for Arabidopsis thaliana (plants):
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.rna.gbff.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.protein.faa.gz (optional)
Example for Saccharomyces cerevisiae S288c (fungi):
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.rna.gbff.gz wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.protein.faa.gz (optional)
2) test downloaded .gz files
gunzip -t *.gz
if some files are corrupted then re-download them
3) gunzip and concatenate files
for Human
gunzip -c human.*.rna.gbff.gz > human.rna.gbff gunzip -c human.*.protein.faa.gz > human.protein.faa (optionally)
for Arabidopsis thaliana
gunzip -c plant.*.rna.gbff.gz > plant.rna.gbff gunzip -c plant.*.protein.faa.gz > plant.protein.faa (optionally)
for Fungi
gunzip -c fungi.*.rna.gbff.gz > fungi.rna.gbff gunzip -c fungi.*.protein.faa.gz > fungi.protein.faa (optionally)
The order in which files are concatenated is not important for
refseq_mrnas.pl script.
Or if you want to concatenate files in an ordered manner then make a shell script like this (example for human.rna.gbff):
gunzip_refseq.sh
# set the number of volumes for .gz files # human.<volume>.rna.gbff.gz # # for example, if we have files # # from human.1.rna.gbff.gz # to human.14.rna.gbff.gz # # then the number of volumes is 14 MAX_VOL=14 # delete old file if it is present rm human.rna.gbff # unzip files to human.rna.gbff for (( i=1; i<=$MAX_VOL; i++ )) do cmd="gunzip -c human.$i.rna.gbff.gz >> human.rna.gbff" echo $cmd eval $cmd done
Concatenated file <filename>.rna.gbff (and optionally <filename>.protein.faa) is then used
as an input to refseq_mrnas.pl script to get full-length mRNAs <filename>.mrna.fa
(and optionally RefSeq proteins <filename>.protein.fa) for a specified organism.
4) you can check how many entries (if any) are present for a specified organism in RefSeq data
for example,
grep "^ ORGANISM Saccharomyces cerevisiae S288c" fungi.rna.gbff | wc -l
Note that these include not only mRNAs but also other RNAs.
Script refseq_mrnas.pl takes as an input RefSeq <filename>.rna.gbff file and
extracts full-length mRNAs for a specified organism to <filename>.mrna.fa file.
(Optionally it also takes RefSeq <filename>.protein.faa and extracts proteins for full-length mRNAs for a specified organism to <filename>.protein.fa file.)
By default, it selects full-length mRNAs with IDs starting with NM_ prefix
and with REVIEWED / VALIDATED / PROVISIONAL status codes.
There are options in the script that also allow to get mRNAs with both
prefixes (NM_ and XM_) and choose entries with specified status codes.
Note:
mRNAs with NM_ prefix are protein-coding transcripts (usually curated) while
mRNAs with XM_ prefix are predicted model protein-coding transcripts (computed).
Here are some links to RefSeq documentation.
RefSeq accession prefixes:
What is the difference between XM_ and NM_ accessions?RefSeq status codes:
How can I tell if a RefSeq record has been curated?
Script refseq_mrnas.pl processes <filename>.rna.gbff and creates files:
<filename>.mrna.fa - mRNAs in FASTA format with deflines in special format
<filename>.mrna.err.fa - mRNAs with errors in RefSeq annotation
<filename>.protein.fa - RefSeq proteins for extracted mRNAs (optional)
Deflines of mRNAs in FASTA file have special format with the following fields:
# len = 14121 atg = 129 stop = 13820 target = chr2
where
len - length of mRNA
atg - first coordinate of ATG
stop - last coordinate of STOP codon
target - chromosome to which this mRNA belongs or 'na' (optional)
Note: target field is optional, it is assumed target = 'na' if it is absent
EXAMPLES of deflines:
>NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = chr2 >NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820 target = na >NP_000375.1 NM_000384 apolipoprotein B, mRNA. # len = 14121 atg = 129 stop = 13820
Once you created <filename>.mrna.fa provide its location in configuration file, for example:
MRNA_FILE = /data/softberry/Human_data/rf_human.mrna.fa # full-length mRNAs
It is usually indicated in RefSeq <filename>.rna.gbff to which chromosome mRNA belongs.
For example,
source 1..3159
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="10"
(this mRNA belongs to chromosome 10)
Fgenesh++ can work with full-length mRNAs in two ways:
1) if you annotate chromosome sequences (and they are properly named like 'chr22.fa') then Fgenesh++ can map mRNAs only to those chromosomes where they belong to,
2) if you annotate contigs / scaffolds (so that you do not know to which contigs RefSeq mRNAs belong) then Fgenesh++ tries to map mRNAs to all contigs and selects good mappings.
Let us discuss these two cases in details.
Suppose you have chromosome sequences properly named like (for example, for human) chr1.fa, chr2.fa, .., chr22.fa, chrX.fa, chrY.fa (or with some other extension like chr1.seq, etc.).
Human chromosomes are named from 1 to 22 and 'X', 'Y' in human.rna.gbff file, for example:
source 1..3159
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="10"
To convert 10 (as in the example above) to chr10 (in 'target' field so that
Fgenesh++ will know to map this mRNA to chr10.fa (or chr10.seq, or chr10
with any extension) run refseq_mrnas.pl with an option -chr_prefix:chr
(so that the prefix 'chr' is added to '10' and we have 'chr10' at the end),
for example:
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' -chr_prefix:chr
Now for this mRNA we have "target = chr10" in defline in FASTA file with mRNAs.
Or, for 'Saccharomyces cerevisiae S288c' chromosomes are named from I to XVI in fungi.rna.gbff, for example:
source 1..1848
/organism="Saccharomyces cerevisiae S288c"
/mol_type="mRNA"
/strain="S288c"
/db_xref="taxon:559292"
/chromosome="III"
So, if your files with S.c. chromosomes are I.seq, II.seq, .., XVI.seq
then run refseq_mrnas.pl with an option -chr_prefix: (empty prefix) or
-chr_mode
(so that the empty prefix is added to 'III' and we have 'III' at the end),
for example:
./refseq_mrnas.pl fungi.rna.gbff -org:'Saccharomyces cerevisiae S288c' -chr_prefix: ./refseq_mrnas.pl fungi.rna.gbff -org:'Saccharomyces cerevisiae S288c' -chr_mode
Now for this mRNA we have "target = III" in defline in FASTA file with mRNAs.
run refseq_mrnas.pl with no -chr_prefix: or -chr_mode options,
for example:
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens'
Now for all mRNAs we have "target = na" in deflines in FASTA file.
Fgenesh++ maps such mRNAs to all contigs and selects good mappings.
Run ./refseq_mrnas.pl without parameters for help.
Extraction of full-length mRNAs from RefSeq for a specified organism, v2.04 (Softberry).
Usage: ./refseq_mrnas.pl <refseq .rna.gbff> -org:<species name>
where
<refseq .rna.gbff> - RefSeq .rna.gbff file name
-org:<species name> - organism for which to select mRNAs
use species name from the 'ORGANISM' field of RefSeq entries,
for example:
ORGANISM Homo sapiens
ORGANISM Arabidopsis thaliana
ORGANISM Saccharomyces cerevisiae S288c
Options:
-chr_mode - assumes that target sequences (where mRNAs will be mapped to) are
chromosomes and they are properly named (like 'chr22.fa')
in this case for each mRNA field 'target = <chr>' is added to
mRNA defline (and Fgenesh++ will map that mRNA only to chromosome <chr>)
if target sequences are NOT chromosomes or if we want to map each mRNA
to all target sequences, do not use -chr_mode or -chr_prefix:
(in this case field 'target = na' is added to mRNA defline
and Fgenesh++ will map that mRNA to all target sequences)
-chr_prefix:<chr> - prefix for chromosomes (for example, 'chr', 'Chr' or even '')
for example, if chromosome codes in RefSeq are 1, 2, .., 22, X, Y and you
have chromosome sequences in files chr1.fa, chr2.fa, ..., chrX.fa, chrY.fa
then use -chr_prefix:chr
(automatically applies -chr_mode)
-prefix:<prefix> - prefix for generated files
<.mrna.fa>, <.mrna.err.fa> and <.protein.fa> (optionally)
by default, prefix is 'refseq_Nm' where 'Nm' are the first letters
of species name (for example, 'refseq_Hs' for 'Homo sapiens')
-ids:<NM|XM|all> - select mRNAs with ID prefixes NM_, XM_ or both (NM by default)
Note: -ids:all or -ids:XM automatically sets -status:all
-status:<[RVP]|all> - select mRNAs with this status (RVP by default)
(R - REVIEWED, V - VALIDATED, P - PROVISIONAL), or
all - status is NOT checked which allows to select entries with
any status (including also PREDICTED, INFERRED, MODEL)
-get_prot_from <.protein.faa>
- for extracted mRNAs also make <.protein.fa> file with proteins
from Refseq <.protein.faa> file
-check_mrna:<0|1> - check mRNA annotation (1 by default) and skip mRNA if errors are found
checks for:
right ATG and STOP codon coordinates
ORF length is dividable by 3
protein translated from mRNA does not contain internal '*'
-no_check - same as -check_mrna:0
-h, -help - print this help
Examples:
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' > Hs.log 2> Hs.err
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' -chr_prefix:chr -prefix:refseq_human > human.log 2> human.err
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' -chr_prefix:chr -ids:all -prefix:refseq_human_all > human_all.log 2> human_all.err
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' -chr_prefix:chr -status:all > Hs.log 2> Hs.err
./refseq_mrnas.pl human.rna.gbff -org:'Homo sapiens' -chr_prefix:chr -get_prot_from human.protein.faa > Hs.log 2> Hs.err
./refseq_mrnas.pl cow.rna.gbff -org:'Bos taurus' -chr_prefix:chr > Bt.log 2> Bt.err
./refseq_mrnas.pl cow.rna.gbff -org:'Bos taurus' -chr_prefix:chr -ids:all -prefix:refseq_Bt_all > Bt_all.log 2> Bt_all.err
./refseq_mrnas.pl plant.rna.gbff -org:'Arabidopsis thaliana' -chr_prefix:ath > At.log 2> At.err
./refseq_mrnas.pl plant.rna.gbff -org:'Arabidopsis thaliana' -ids:all -no_check -prefix:refseq_At_test > At_test.log 2> At_test.err
./refseq_mrnas.pl fungi.rna.gbff -org:'Saccharomyces cerevisiae S288c' -chr_mode > Sc.log 2> Sc.err
./refseq_mrnas.pl fungi.rna.gbff -org:'Saccharomyces cerevisiae S288c' -chr_prefix: > Sc.log 2> Sc.err
Script creates the following files:
<filename>.mrna.fa - mRNAs in FASTA format with deflines in special format
Exact names of output files depend on the -prefix: option.
For example, filenames for human (-org:'Homo sapiens') are
by default (no -prefix: option)
refseq_Hs.mrna.fa
refseq_Hs.mrna.err.fa
refseq_Hs.protein.fa
with -prefix:rf_human
rf_human.mrna.fa
rf_human.mrna.err.fa
rf_human.protein.fa
Once you created <filename>.mrna.fa provide its location in configuration file, for example:
MRNA_FILE = /data/softberry/Human_data/rf_human.mrna.fa # full-length mRNAs
1) check_mrnas.pl to check that full-length mRNAs are annotated correctly
Example:
./check_mrnas.pl rf_human.mrna.fa
2) set_mrnas_target.pl to change 'target' in mRNA deflines
Example:
make 'target = na'
./set_mrnas_target.pl refseq_Hs.mrna.fa refseq_Hs.mrna.fa.target_na -set_target na
BLAST+ can be downloaded from NCBI:
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html
The legacy BLAST executables can be downloaded from NCBI:
https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/
Use BLAST+ or BLAST release 2.2.13 or some higher version.
NR (non-redundant protein sequence database) can be downloaded from
ftp ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
NR or custom protein database must be present in directory provided in
configuration file in both formats - as FASTA file and as formatted for
BLAST by makeblastdb or formatdb program.
Format protein database for BLAST with one of the following commands:
makeblastdb -in <DB> -dbtype prot # BLAST+ formatdb -i <DB> # BLAST
For example,
makeblastdb -in nr -dbtype prot # BLAST+ formatdb -i nr # BLAST
Note: do NOT use -parse_seqids (BLAST+) or -o T (BLAST) option since the
pipeline requires a protein database to be formatted without parsing Seq-ids!
After formatting, protein DB will be present in both formats:
nr (FASTA formatted) nr.*.phr (BLAST formatted) nr.*.pin (BLAST formatted) nr.*.psq (BLAST formatted)
Put the name of FASTA file ('nr' in example above) in configuration file, for example:
PROTEIN_DB = /data/NR/nr # protein DB
Also, prepare index file with make_nr_indexed.pl script.
For example,
FGENESHPIPE/scripts/make_nr_indexed.pl -f /data/NR/nr -i /data/NR/nr.ind
Put the name of index file in configuration file, for example:
PROTEIN_DB_INDEX = /data/NR/nr.ind # protein DB index file
Predicting genes using proteins from protein DB is still the most time consuming step since the size of protein databases grows exponentially.
We highly recommend to use relevant NR subsets for calculations because the initial NR size is very big.
Use, for example, the following subsets:
nr_animals - animals (Metazoa) proteins (to annotate animals) nr_plants - plant proteins (to annotate plants) nr_fungi - fungi proteins (to annotate fungi)
Or, even better, prepare more specific subsets that include proteins for species from some taxon level to which a specified organism (that you are going to annotate) belongs.
Also, we recommend to exclude from NR proteins not relevant for your annotation. For example, partial proteins, "unnamed protein product" proteins, hypothetical proteins, predicted proteins (with 'XP_' or 'PREDICTED' in deflines).
OrthoDB (or its subsets) can be a good substitute for NR.
If you use OrthoDB make sure that proteins are at least 6 aa long and do not contain STOP '*' codons.
Script skip_short_seq.pl allows to filter out short protein sequences.
Script skip_STOP_seq.pl allows to filter out protein sequences with internal
STOP codons.
Legacy BLAST does not work with non-standard characters (like J, O) in protein sequences.
Examples of warning messages because of non-standard characters in proteins:
a)
formatdb
WARNING: [000.000] Sequence number 4446542 (lcl|22997_nr.01), 1 illegal character was removed:
1 J
>gi|123965153|gb|ABM74351.1| solute carrier family 8 member 3 [Uraeotyphlus cf. malabaricus KR-2007]
SIEVITSQEKEITIRKINGETTTTTIRVWNETVSNLTLMALGSSAPEILLSLVEVCGHGFAAGELGPSTIVGSAAFNMFI
IIAICVYVIPDGETRRIKHLRVFFVTAAWSIFAYIWLYMILAVFSPGIVQVWEGLLTLFFFPICVFLAWVADRRLLFYKY
MHKKYRTDKHRGIMIETEGEHPKGIEMDGKMMNSHFLDGSLVNMEGKEVDESRREMIRILKDJKQKHPEKDLDQLVEMAN
YYALSHQQKSRAFYRIQATRMMTGAGNILKKHAAEQAKKSSSMNEVQIDEPEEFMSKIYFDPCSYQCLENCGAVLLTVVR
RGGDISKTLYVDYKTEDGSANAGADYEFTEGTVVFKSGETQKEFSVGIIDDDIF
b)
formatdb
Selenocysteine (U) at position 48 replaced by X
Script skip_non_standard_seq.pl allows to filter out protein sequences
with non-standard characters.
As an example, for 'nr_animals' subset commands look like:
format protein database for BLAST / BLAST+
makeblastdb -in nr_animals -dbtype prot # BLAST+ formatdb -i nr_animals # BLAST
make index file
FGENESHPIPE/scripts/make_nr_indexed.pl -f /data/NR/nr_animals -i /data/NR/nr_animals.ind
put lines in config file
PROTEIN_DB = /data/NR/nr_animals # protein DB PROTEIN_DB_INDEX = /data/NR/nr_animals.ind # protein DB index file
As another example, for 'nr_plants' commands look similar:
format protein database for BLAST / BLAST+
makeblastdb -in nr_plants -dbtype prot # BLAST+ formatdb -i nr_plants # BLAST
make index file
FGENESHPIPE/scripts/make_nr_indexed.pl -f /data/NR/nr_plants -i /data/NR/nr_plants.ind
put lines in config file
PROTEIN_DB = /data/NR/nr_plants # protein DB PROTEIN_DB_INDEX = /data/NR/nr_plants.ind # protein DB index file
Header of Fgenesh++ output for a sequence contains information about that sequence name ('Seq name:' field), sequence length and numbers of predicted genes and exons, for example:
FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA Time : Fri Nov 29 21:47:46 2024 Seq name: ENm011 Length of sequence: 7878624, Sequence: 1, File: ENm011.seq.N Number of predicted genes 20: in +chain 10, in -chain 10. Number of predicted exons 114: in +chain 66, in -chain 48.
Gene models are described by the following fields:
G Str Feature Start End Score ORF Len
G - predicted gene number, starting from sequence start
Str - DNA strand (+ for direct or - for complementary)
Feature - type of coding sequence:
CDSf - first coding segment (starting with start codon),
CDSi - internal coding segment (internal exon),
CDSl - last coding segment (ending with stop codon),
CDSo - one/only coding segment for genes with only one CDS
(starting with start codon and ending with stop codon);
UTR5 - untranslated part of an exon (from 5'-end)
UTR3 - untranslated part of an exon (from 3'-end)
EXON - untranslated exon
Number before the Feature is the exon number.
Note that
UTR5 and CDSf belong to the same exon (here exon 1), and
CDSl and UTR3 belong to the same exon (here exon 5):
G Str Feature Start End
20 + 1 UTR5 580523 - 580682 ...
20 + 1 CDSf 580683 - 580773 ...
20 + 2 CDSi 581922 - 582019 ...
20 + 3 CDSi 586551 - 586625 ...
20 + 4 CDSi 591469 - 591523 ...
20 + 5 CDSl 592300 - 592355 ...
20 + 5 UTR3 592356 - 592386 ...
20 + 6 EXON 594044 - 594127 ...
20 + 7 EXON 594391 - 594482 ...
20 + 8 EXON 595161 - 595340 ...
20 + 9 EXON 595678 - 596015 ...
TSS - TATA-box (or transcription start site for genes with mRNA support)
PolA - PolyA signal (or end of transcript for genes with mRNA support)
Start - start position of the feature
End - end position of the feature
Score - Log likelihood*10 score for the feature
(or special value for genes with mRNA support)
ORF - start/end positions where the first complete codon starts
and the last codon ends
Len - length of the Feature
Note that in current Fgenesh++ format features 'TSS' and 'PolA' have several meanings.
For mRNA supported genes:
TSS - transcription start site PolA - PolyA site (end of transcript)
For protein supported and ab initio predicted genes:
TSS - TATA-box (always predicted, even if bad) PolA - PolyA signal
Fgenesh++ can predict complete as well as incomplete gene models. Incomplete gene models are those with initial (CDSf for first) or/and terminal (CDSl for last) exon(s) absent. Note that gene models with only one CDS (CDSo) having both ATG and stop codon are considered as complete gene models.
In Fgenesh++ output, you can distinguish between mRNA supported, protein supported and ab initio gene predictions:
An example of ab initio gene prediction.
G Str Feature Start End Score ORF Len 5 + 1 CDSf 114111 - 114191 10.58 114111 - 114191 81 5 + 2 CDSi 114249 - 114407 9.31 114249 - 114407 159 5 + 3 CDSi 114571 - 114676 8.12 114571 - 114675 106 5 + 4 CDSi 114772 - 114905 20.57 114774 - 114905 134 5 + 5 CDSl 115007 - 115246 18.58 115007 - 115246 240 5 + PolA 115316 1.13Note: relative strength of ab initio genes can be found by summing over all scores of their CDS exons.
An example of mRNA supported gene prediction.
G Str Feature Start End Score ORF Len (Hom_x1 Hom_x2 Hom_% Hom_id) 14 - PolA 437594 4728.00 NP_000198.1 14 - 1 UTR3 437594 - 437666 4728.00 73 378 - 450 99 NP_000198.1 14 - 1 CDSl 437667 - 437812 4728.00 437667 - 437810 146 232 - 377 99 NP_000198.1 14 - 2 CDSf 438600 - 438786 4728.00 438601 - 438786 187 45 - 231 100 NP_000198.1 14 - 2 UTR5 438787 - 438803 4728.00 17 28 - 44 100 NP_000198.1 14 - 3 EXON 438983 - 439009 4728.00 27 1 - 27 96 NP_000198.1 14 - TSS 439009 4728.00 NP_000198.1
Four more fields are added in comparison with ab initio predictions:
Hom_x1,
Hom_x2 - coordinates of that part of mapped mRNA (in bp) that corresponds to predicted exon
Hom_% - homology in % between predicted exon and corresponding part of mapped mRNA
Hom_id - ID of mapped mRNA (or ID of its protein)
An example of protein supported gene prediction.
G Str Feature Start End Score ORF Len (Hom_x1 Hom_x2 Hom_%) 6 + TSS 116762 -7.48 6 + 1 CDSf 118349 - 118471 251.85 118349 - 118471 123 1 - 41 100 6 + 2 CDSi 118634 - 118723 192.68 118634 - 118723 90 42 - 71 100 6 + 3 CDSi 118846 - 119022 373.98 118846 - 119022 177 72 - 130 100 6 + 4 CDSl 119271 - 119366 210.07 119271 - 119366 96 131 - 162 100 6 + PolA 119473 1.13
Three more fields are added in comparison with ab initio predictions:
Hom_x1,
Hom_x2 - coordinates of that part of homologous protein that corresponds to predicted CDS
Hom_% - homology in % between predicted CDS and corresponding part of homologous protein
Predicted proteins are listed at the end of *.res files.
For mRNA supported and protein supported predictions information about a homolog to a predicted protein is given in defline of a predicted protein (after first '##'), for example:
for protein supported prediction (note also 'BY PROTMAP' tag)
>FGENESH: 6 4 exon (s) 118349 - 119366 161 aa, chain + ## BY PROTMAP: gi|119622872|gb|EAX02467.1| troponin I type 2 (skeletal, fast), isoform CRA_a [Homo sapiens] ## query_len = 161 match_len = 161 query_cov = 1.00 match_cov = 1.00 score = 322 pid = 100.0 evalue = 1e-120
for mRNA supported prediction (note also 'BY MRNA_MAP' tag)
>FGENESH: 14 2 exon (s) 437667 - 438786 110 aa, chain - ## BY MRNA_MAP: NP_000198.1
For ab initio predictions information about homology to a predicted protein is also given in defline of a predicted protein (after first '##'), for example:
>FGENESH: 5 5 exon (s) 114111 - 115246 239 aa, chain + ## NR: gi|94536846|ref|NP_612634.3| synaptotagmin VIII [Homo sapiens] ## query_len = 239 match_len = 401 query_cov = 0.82 match_cov = 0.49 query_x1 = 28 query_x2 = 223 match_x1 = 190 match_x2 = 385 score = 362 pid = 92.0 evalue = 1e-125
If no homologs were found to a predicted protein then defline contains
'no hits' tag, for example:
>FGENESH: 11 7 exon (s) 241806 - 368165 245 aa, chain - ## NR: no hits
All Fgenesh++ output for sequence ENm011 (file "ENm011.seq.res") is shown below.
FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA Time : Fri Nov 29 21:47:46 2024 Seq name: ENm011 Length of sequence: 7878624, Sequence: 1, File: ENm011.seq.N Number of predicted genes 20: in +chain 10, in -chain 10. Number of predicted exons 114: in +chain 66, in -chain 48. Positions of predicted genes and exons: Variant 1 from 1, Score:5.023016 G Str Feature Start End Score ORF Len 1 - PolA 12632 -6.07 1 - 1 CDSl 13095 - 13244 24.11 13095 - 13244 150 1 - 2 CDSi 25482 - 25934 61.84 25482 - 25934 453 1 - 3 CDSi 26363 - 27134 3.80 26363 - 27133 772 1 - 4 CDSf 30308 - 30408 1.64 30310 - 30408 101 2 - PolA 30570 23320.00 NP_001900.1 2 - 1 UTR3 30570 - 31317 23320.00 748 1373 - 2120 100 NP_001900.1 2 - 1 CDSl 31318 - 31485 23320.00 31318 - 31485 168 1205 - 1372 100 NP_001900.1 2 - 2 CDSi 31618 - 31716 23320.00 31618 - 31716 99 1106 - 1204 100 NP_001900.1 2 - 3 CDSi 31809 - 31953 23320.00 31809 - 31952 145 961 - 1105 100 NP_001900.1 2 - 4 CDSi 32721 - 32843 23320.00 32723 - 32842 123 838 - 960 100 NP_001900.1 2 - 5 CDSi 35139 - 35371 23320.00 35141 - 35371 233 605 - 837 100 NP_001900.1 2 - 6 CDSi 36784 - 36902 23320.00 36784 - 36900 119 486 - 604 100 NP_001900.1 2 - 7 CDSi 37331 - 37454 23320.00 37332 - 37454 124 362 - 485 100 NP_001900.1 2 - 8 CDSi 39124 - 39283 23320.00 39124 - 39282 160 202 - 361 100 NP_001900.1 2 - 9 CDSf 41607 - 41674 23320.00 41609 - 41674 68 134 - 201 100 NP_001900.1 2 - 9 UTR5 41675 - 41807 23320.00 133 1 - 133 100 NP_001900.1 2 - TSS 41807 23320.00 NP_001900.1 3 - PolA 80566 -1.07 3 - 1 CDSl 80704 - 81029 30.41 80704 - 81027 326 3 - 2 CDSi 81123 - 81249 6.09 81124 - 81249 127 3 - 3 CDSf 81617 - 81910 28.98 81617 - 81910 294 3 - TSS 84211 -8.28 4 + TSS 110820 -3.68 4 + 1 CDSo 112921 - 113397 804.88 112921 - 113397 477 1 - 159 100 5 + 1 CDSf 114111 - 114191 10.58 114111 - 114191 81 5 + 2 CDSi 114249 - 114407 9.31 114249 - 114407 159 5 + 3 CDSi 114571 - 114676 8.12 114571 - 114675 106 5 + 4 CDSi 114772 - 114905 20.57 114774 - 114905 134 5 + 5 CDSl 115007 - 115246 18.58 115007 - 115246 240 5 + PolA 115316 1.13 6 + TSS 116762 -7.48 6 + 1 CDSf 118349 - 118471 251.85 118349 - 118471 123 1 - 41 100 6 + 2 CDSi 118634 - 118723 192.68 118634 - 118723 90 42 - 71 100 6 + 3 CDSi 118846 - 119022 373.98 118846 - 119022 177 72 - 130 100 6 + 4 CDSl 119271 - 119366 210.07 119271 - 119366 96 131 - 162 100 6 + PolA 119473 1.13 7 + TSS 130773 -8.18 7 + 1 CDSf 130960 - 131012 107.56 130960 - 131010 53 1 - 17 100 7 + 2 CDSi 157902 - 158039 246.88 157903 - 158037 138 19 - 63 100 7 + 3 CDSi 159247 - 159411 324.48 159248 - 159409 165 65 - 118 100 7 + 4 CDSi 161234 - 161375 264.90 161235 - 161375 142 120 - 166 100 7 + 5 CDSi 161747 - 161839 172.04 161747 - 161839 93 167 - 197 100 7 + 6 CDSi 162095 - 162138 82.14 162095 - 162136 44 198 - 211 100 7 + 7 CDSi 162315 - 162396 147.97 162316 - 162396 82 213 - 239 100 7 + 8 CDSi 164547 - 164681 257.58 164547 - 164681 135 240 - 284 100 7 + 9 CDSi 165052 - 165129 155.82 165052 - 165129 78 285 - 310 100 7 + 10 CDSl 165289 - 165378 169.21 165289 - 165378 90 311 - 340 100 8 - PolA 165912 1.13 8 - 1 CDSo 166960 - 168378 70.26 166960 - 168378 1419 8 - TSS 170023 -5.88 9 + TSS 186853 -4.08 9 + 1 CDSf 187635 - 187717 1.09 187635 - 187715 83 9 + 2 CDSi 197244 - 197388 3.93 197245 - 197388 145 9 + 3 CDSi 200672 - 200706 6.28 200672 - 200704 35 9 + 4 CDSi 210284 - 210329 5.65 210285 - 210329 46 9 + 5 CDSi 211536 - 211652 19.77 211536 - 211652 117 9 + 6 CDSi 211746 - 211823 0.75 211746 - 211823 78 9 + 7 CDSi 212147 - 212260 13.22 212147 - 212260 114 9 + 8 CDSi 212361 - 212470 12.61 212361 - 212468 110 9 + 9 CDSi 212644 - 212734 17.26 212645 - 212734 91 9 + 10 CDSi 214778 - 214818 -0.14 214778 - 214816 41 9 + 11 CDSl 216253 - 216307 -3.11 216254 - 216307 55 9 + PolA 216498 1.13 10 + TSS 223577 -2.48 10 + 1 CDSf 225178 - 225194 49.30 225178 - 225192 17 1 - 5 100 10 + 2 CDSi 228714 - 228836 243.90 228715 - 228834 123 7 - 46 100 10 + 3 CDSi 229942 - 230024 166.14 229943 - 230023 83 48 - 74 100 10 + 4 CDSi 230597 - 230670 153.93 230599 - 230670 74 76 - 99 100 10 + 5 CDSl 234071 - 234235 317.41 234071 - 234235 165 100 - 154 100 10 + PolA 234391 1.13 11 - 1 CDSi 241806 - 241896 6.92 241807 - 241896 91 11 - 2 CDSi 261282 - 261405 -1.03 261282 - 261404 124 11 - 3 CDSi 273691 - 273813 0.06 273693 - 273812 123 11 - 4 CDSi 273894 - 274006 3.51 273896 - 274006 113 11 - 5 CDSi 274102 - 274236 7.12 274102 - 274236 135 11 - 6 CDSi 310521 - 310600 1.42 310521 - 310598 80 11 - 7 CDSf 368096 - 368165 10.57 368097 - 368165 70 11 - TSS 368880 -5.68 12 - PolA 410320 1.13 12 - 1 CDSl 410802 - 411038 430.65 410802 - 411038 237 181 - 103 100 12 - 2 CDSi 411332 - 411480 266.30 411332 - 411478 149 102 - 54 100 12 - 3 CDSf 413182 - 413338 310.28 413183 - 413338 157 52 - 1 100 13 + 1 CDSf 418459 - 418762 539.77 418459 - 418761 304 1 - 101 100 13 + 2 CDSl 424060 - 424262 327.95 424062 - 424262 203 103 - 169 100 13 + PolA 424883 -5.47 14 - PolA 437594 4728.00 NP_000198.1 14 - 1 UTR3 437594 - 437666 4728.00 73 378 - 450 99 NP_000198.1 14 - 1 CDSl 437667 - 437812 4728.00 437667 - 437810 146 232 - 377 99 NP_000198.1 14 - 2 CDSf 438600 - 438786 4728.00 438601 - 438786 187 45 - 231 100 NP_000198.1 14 - 2 UTR5 438787 - 438803 4728.00 17 28 - 44 100 NP_000198.1 14 - 3 EXON 438983 - 439009 4728.00 27 1 - 27 96 NP_000198.1 14 - TSS 439009 4728.00 NP_000198.1 15 - PolA 441761 1.13 15 - 1 CDSl 442048 - 442207 324.54 442048 - 442206 160 529 - 477 100 15 - 2 CDSi 443047 - 443180 261.27 443049 - 443180 134 475 - 432 100 15 - 3 CDSi 443483 - 443578 207.38 443483 - 443578 96 431 - 400 100 15 - 4 CDSi 443817 - 443873 120.93 443817 - 443873 57 399 - 381 100 15 - 5 CDSi 444295 - 444364 160.17 444295 - 444363 70 380 - 358 100 15 - 6 CDSi 444448 - 444583 257.95 444450 - 444581 136 356 - 313 100 15 - 7 CDSi 444702 - 444847 292.21 444703 - 444846 146 311 - 264 100 15 - 8 CDSi 445250 - 445300 98.90 445252 - 445299 51 262 - 247 100 15 - 9 CDSi 445681 - 445748 141.80 445683 - 445748 68 245 - 224 100 15 - 10 CDSi 445906 - 445994 159.40 445906 - 445992 89 223 - 195 100 15 - 11 CDSi 446306 - 446480 334.93 446307 - 446480 175 193 - 136 100 15 - 12 CDSi 447465 - 447686 432.35 447465 - 447686 222 135 - 62 100 15 - 13 CDSi 448505 - 448585 145.64 448505 - 448585 81 61 - 35 100 15 - 14 CDSf 449500 - 449601 218.47 449500 - 449601 102 34 - 1 100 16 - PolA 546531 1.13 16 - 1 CDSo 547566 - 548147 1202.02 547566 - 548147 582 194 - 1 100 17 - 1 CDSl 577271 - 577559 351.28 577271 - 577558 289 133 - 50 100 17 - 2 CDSi 578372 - 578428 63.99 578374 - 578427 57 36 - 19 100 17 - 3 CDSf 579571 - 579623 92.87 579573 - 579623 53 17 - 1 100 18 + TSS 579828 15136.00 NP_620591.1 18 + 1 UTR5 579828 - 579964 15136.00 137 1 - 137 100 NP_620591.1 18 + 1 CDSf 579965 - 580030 15136.00 579965 - 580030 66 138 - 203 100 NP_620591.1 18 + 2 CDSi 580659 - 580773 15136.00 580659 - 580772 115 204 - 318 100 NP_620591.1 18 + 3 CDSi 581922 - 582019 15136.00 581924 - 582019 98 319 - 416 100 NP_620591.1 18 + 4 CDSi 586551 - 586625 15136.00 586551 - 586625 75 417 - 491 100 NP_620591.1 18 + 5 CDSi 591469 - 591570 15136.00 591469 - 591570 102 492 - 593 100 NP_620591.1 18 + 6 CDSi 592300 - 592386 15136.00 592300 - 592386 87 594 - 680 100 NP_620591.1 18 + 7 CDSi 594044 - 594127 15136.00 594044 - 594127 84 681 - 764 100 NP_620591.1 18 + 8 CDSi 594391 - 594482 15136.00 594391 - 594480 92 765 - 856 100 NP_620591.1 18 + 9 CDSi 595159 - 595340 15136.00 595160 - 595339 182 857 - 1038 100 NP_620591.1 18 + 10 CDSl 595678 - 595739 15136.00 595680 - 595739 62 1039 - 1100 100 NP_620591.1 18 + 10 UTR3 595740 - 596015 15136.00 276 1101 - 1376 100 NP_620591.1 18 + PolA 596015 15136.00 NP_620591.1 19 + TSS 580523 14286.00 NP_005696.1 19 + 1 UTR5 580523 - 580682 14286.00 160 1 - 160 99 NP_005696.1 19 + 1 CDSf 580683 - 580773 14286.00 580683 - 580772 91 161 - 251 99 NP_005696.1 19 + 2 CDSi 581922 - 582019 14286.00 581924 - 582019 98 252 - 349 100 NP_005696.1 19 + 3 CDSi 586551 - 586625 14286.00 586551 - 586625 75 350 - 424 99 NP_005696.1 19 + 4 CDSi 591469 - 591570 14286.00 591469 - 591570 102 425 - 526 100 NP_005696.1 19 + 5 CDSi 592300 - 592386 14286.00 592300 - 592386 87 527 - 613 100 NP_005696.1 19 + 6 CDSi 594044 - 594127 14286.00 594044 - 594127 84 614 - 697 100 NP_005696.1 19 + 7 CDSi 594391 - 594482 14286.00 594391 - 594480 92 698 - 789 100 NP_005696.1 19 + 8 CDSi 595159 - 595340 14286.00 595160 - 595339 182 790 - 971 100 NP_005696.1 19 + 9 CDSl 595678 - 595739 14286.00 595680 - 595739 62 972 - 1033 100 NP_005696.1 19 + 9 UTR3 595740 - 596015 14286.00 276 1034 - 1309 100 NP_005696.1 19 + PolA 596015 14286.00 NP_005696.1 20 + TSS 580523 13749.00 NP_620593.1 20 + 1 UTR5 580523 - 580682 13749.00 160 1 - 160 99 NP_620593.1 20 + 1 CDSf 580683 - 580773 13749.00 580683 - 580772 91 161 - 251 99 NP_620593.1 20 + 2 CDSi 581922 - 582019 13749.00 581924 - 582019 98 252 - 349 100 NP_620593.1 20 + 3 CDSi 586551 - 586625 13749.00 586551 - 586625 75 350 - 424 100 NP_620593.1 20 + 4 CDSi 591469 - 591523 13749.00 591469 - 591522 55 425 - 479 100 NP_620593.1 20 + 5 CDSl 592300 - 592355 13749.00 592302 - 592355 56 480 - 535 100 NP_620593.1 20 + 5 UTR3 592356 - 592386 13749.00 31 536 - 566 100 NP_620593.1 20 + 6 EXON 594044 - 594127 13749.00 84 567 - 650 100 NP_620593.1 20 + 7 EXON 594391 - 594482 13749.00 92 651 - 742 100 NP_620593.1 20 + 8 EXON 595161 - 595340 13749.00 180 743 - 922 100 NP_620593.1 20 + 9 EXON 595678 - 596015 13749.00 338 923 - 1260 100 NP_620593.1 20 + PolA 596015 13749.00 NP_620593.1 Predicted protein(s): >FGENESH: 1 4 exon (s) 13095 - 30408 491 aa, chain - ## NR: gi|40254450|ref|NP_003632.2| interferon induced transmembrane protein 1 (9-27) [Homo sapiens] ## query_len = 491 match_len = 125 query_cov = 0.18 match_cov = 0.64 query_x1 = 379 query_x2 = 467 match_x1 = 8 match_x2 = 87 score = 81 pid = 44.0 evalue = 6e-18 MGSSLVGQRGLTCSQVWSSSRQDVALAGVSGCLVVAASEAPTLLAGGGGGGEVAVPAAAV SQALGCSRAAGLALRRERRRRGRQRRRRRAPRAPHPLKVGPLGGGSVDRQRPAGRRKDAR TEAQGVRPRAEESLVGEEAPSGGLRQTRMGLGAAEPRPQTPRRAPHGSWERGVAQSASDL VPRWRSVHFTACYCASMYGPGVRGSGQVCALGVPRLEVGGGAHELETPHLCRFLVLVAEP ASPCVGSCLLVNGSERARGCAPGSGEPAVQPAVEPVAVPVAVSGVRGRRGQAQGPGQCPA PLGDPASTTDGAQEARVPLDGAFWIPRPPAGSPKGCFACVSKPPALQAPAAPAPEPSASP PMAPTLFPMESKSSKTDSVRAAGAPPACKHLAEKKTMTNPTTVIEVYPDTTEVNDYYLWS IFNFVYLNFCCLGFIALAYSLKVRDKKLLNDLNGAVEDAKTARLFNITSSALAASCIILV FIFLRYPLTDY >FGENESH: 2 9 exon (s) 31318 - 41674 412 aa, chain - ## BY MRNA_MAP: NP_001900.1 MQPSSLLPLALCLLAAPASALVRIPLHKFTSIRRTMSEVGGSVEDLIAKGPVSKYSQAVP AVTEGPIPEVLKNYMDAQYYGEIGIGTPPQCFTVVFDTGSSNLWVPSIHCKLLDIACWIH HKYNSDKSSTYVKNGTSFDIHYGSGSLSGYLSQDTVSVPCQSASSASALGGVKVERQVFG EATKQPGITFIAAKFDGILGMAYPRISVNNVLPVFDNLMQQKLVDQNIFSFYLSRDPDAQ PGGELMLGGTDSKYYKGSLSYLNVTRKAYWQVHLDQVEVASGLTLCKEGCEAIVDTGTSL MVGPVDEVRELQKAIGAVPLIQGEYMIPCEKVSTLPAITLKLGGKGYKLSPEDYTLKVSQ AGKTLCLSGFMGMDIPPPSGPLWILGDVFIGRYYTVFDRDNNRVGFAEAARL >FGENESH: 3 3 exon (s) 80704 - 81910 248 aa, chain - ## NR: gi|56204817|emb|CAI19051.1| actin, alpha 1, skeletal muscle [Homo sapiens] ## query_len = 248 match_len = 287 query_cov = 1.00 match_cov = 0.99 query_x1 = 2 query_x2 = 248 match_x1 = 4 match_x2 = 287 score = 348 pid = 62.0 evalue = 1e-121 MDDNITLLVIDNTSGMCKASFTDDDAPWAVFPSIVGCPRHQGMTVGMGQKDFYVGGEAQS ERGILTLKYPIEHGIISSWDDMEKIWHHTFYNELREHPEVATATTSSSLEKSYELPDSQV IPTGNERFRCPEEPFPPSFRGMKSCGIHETIFNSIMKCDMDILEDQYTNTVLSGGTSMHP SITHRVQKEITALAPGTMKINVIAPPERKYSAWISGSIPASPSTFQQMWISKQAYDEPGP SILHRKCF >FGENESH: 4 1 exon (s) 112921 - 113397 158 aa, chain + ## BY PROTMAP: gi|119622870|gb|EAX02465.1| synaptotagmin VIII, isoform CRA_a [Homo sapiens] ## query_len = 158 match_len = 158 query_cov = 1.00 match_cov = 1.00 score = 330 pid = 100.0 evalue = 1e-123 MGHPPVSPSAPAPAGTTAIPGLIPDLVAGTPCELWDSQEGCGDNPAKWGLQLSTDALSLA STPGPRWALIAGALAAGVLLVSCLLCAACCCCRRHRKKPRDKESVGLGSARGTTTTHLVR SGSLLTQSREGLKSRLQSPGQRGEFSPRDGLTPTEAGR >FGENESH: 5 5 exon (s) 114111 - 115246 239 aa, chain + ## NR: gi|94536846|ref|NP_612634.3| synaptotagmin VIII [Homo sapiens] ## query_len = 239 match_len = 401 query_cov = 0.82 match_cov = 0.49 query_x1 = 28 query_x2 = 223 match_x1 = 190 match_x2 = 385 score = 362 pid = 92.0 evalue = 1e-125 MVGWVGLDGWMGLGWVGLGSWVGLGSWAELPGATLQVQLFNFKRFSGHEPLGELRLPLGT VDLQHVLEHWYLLGPPAATQPEQVGELCFSLRYVPSSGRLTVVVLEARGLRPGLAEPYVK VQLMLNQRKWKKRKTATKKGTAAPYFNEAFTFLVPFSQVQNVDLVLAVWDRSLPLRTEPV GKVHLGARASGQPLQHWADMLAHARRPIAQRHPLRPAREVDRMLALQPRLRLRLPLPHS >FGENESH: 6 4 exon (s) 118349 - 119366 161 aa, chain + ## BY PROTMAP: gi|119622872|gb|EAX02467.1| troponin I type 2 (skeletal, fast), isoform CRA_a [Homo sapiens] ## query_len = 161 match_len = 161 query_cov = 1.00 match_cov = 1.00 score = 322 pid = 100.0 evalue = 1e-120 MLQIAATELEKEESRREAEKQNYLAEHCPPLHIPGSMSEVQELCKQLHAKIDAAEEEKYD MEVRVQKTSKELEDMNQKLFDLRGKFKRPPLRRVRMSADAMLKALLGSKHKVCMDLRANL KQVKKEDTEKERDLRDVGDWRKNIEEKSGMEGRKKMFESES >FGENESH: 7 10 exon (s) 130960 - 165378 339 aa, chain + ## BY PROTMAP: gi|10880979|ref|NP_002330.1| lymphocyte-specific protein 1 isoform 1 [Homo sapiens] ## query_len = 339 match_len = 339 query_cov = 1.00 match_cov = 1.00 score = 687 pid = 100.0 evalue = 0 MAEASSDPGAEEREELLGPTAQWSVEDEEEAVHEQCQHERDRQLQAQDEEGGGHVPERPK QEMLLSLKPSEAPELDEDEGFGDWSQRPEQRQQHEGAQGALDSGEPPQCRSPEGEQEDRP GLHAYEKEDSDEVHLEELSLSKEGPGPEDTVQDNLGAAGAEEEQEEHQKCQQPRTPSPLV LEGTIEQSSPPLSPTTKLIDRTESLNRSIEKSNSVKKSQPDLPISKIDQWLEQYTQAIET AGRTPKLARQASIELPSMAVASTKSRWETGEVQAQSAAKTPSCKDIVAGDMSKKSLWEQK GGSKTSSTIKSTPSGKRYKFVATGHGKYEKVLVEGGPAP >FGENESH: 8 1 exon (s) 166960 - 168378 472 aa, chain - ## NR: no hits MAPEVCGPSLQGTPGPPPPLLPKPGKDNLRLQKLLRKAARKKMMGGTHLAPPRAFRTSLS PVSEASHDQEVTAPHAAEGPHPAEAPRLPEAPRPAEAPRMVAALPRSPHTPIIHHVASPL QKSTFSIGLTQRRILAAQFRAMGPQVVASAPEPTRPPSGFVPVSGGGGTHVTQVHIQLAP SPHNGTPEPPRTAPEVGSNSQDGDATPSPPRAQPLVPVAHIRPLPTTVQAASPLPEEPPV PRPPPGFQASVPREASARVVVPIAPTCRSLESSPHSLVPMGPGREHLEEPPMAGPAAEAE RVSSPAWASSPTPPSGPHPCPVPKVAPKPRLSGWTWLKKQLLEEAPEPPCPEPRQSLEPE VPTPTEQEVPAPTEQEVPALTAPRAPASRTSRMWDAVLYRMSVAEAQGRLAGPSGGEHTP ASLTRLPFLYRPRFNARKLQEATRPPPTVRSILELSPQPKNFNRTATGWRLQ >FGENESH: 9 11 exon (s) 187635 - 216307 304 aa, chain + ## NR: gi|119570954|gb|EAW50569.1| troponin T type 3 (skeletal, fast) [Homo sapiens] ## query_len = 304 match_len = 257 query_cov = 0.71 match_cov = 0.84 query_x1 = 89 query_x2 = 304 match_x1 = 42 match_x2 = 257 score = 302 pid = 73.0 evalue = 1e-103 MASSELRALCSSFVVESSALLAGTFHHNPSRMYPLLGVKIAADTSWRRREDLASFSKHVL CILASFSLLLWPFLYLKPPTFTMSDEEVLTAPKIPEGEKVDFDDIQKKRQNKDLMELQAL IDSHFEARKKEEEELVALKERIEKRRAERAEQQRIRAEKERERQNRLAEEKARREEEDAK RRAEDDLKKKKALSSMGANYSSYLAKADQKRGKKQTAREMKKKILAERRKPLNIDHLGED KLRDKAKELWETLHQLEIDKFEFGEKLKRQKYDITTLRSRIDQAQKHSKKAGTPAKGKVG GRWK >FGENESH: 10 5 exon (s) 225178 - 234235 153 aa, chain + ## BY PROTMAP: gi|68566020|sp|Q16540|RM23_HUMAN Mitochondrial 39S ribosomal protein L23 (L23mt) (MRP-L23) (L23 mitochondrial-related protein) (Ribosomal protein L23-like) ## query_len = 153 match_len = 153 query_cov = 1.00 match_cov = 1.00 score = 315 pid = 100.0 evalue = 1e-117 MARNVVYPLYRLGGPQLRVFRTNFFIQLVRPGVAQPEDTVQFRIPMEMTRVDLRNYLEGI YNVPVAAVRTRVQHGSNKRRDHRNVRIKKPDYKVAYVQLAHGQTFTFPDLFPEKDESPEG SAADDLYSMLEEERQQRQSSDPRRGGVPSWFGL >FGENESH: 11 7 exon (s) 241806 - 368165 245 aa, chain - ## NR: no hits MPALPMAFMRTGAPFKLRQLLATAGLLQPSGPNQSAGSSKGVQEEGSPAEPGLQLGWTCP PAAEGQDARSGGRDKQDMTWSGVTARTEEARPAFLNTLGWWGCGKKRVCFFTSSTESAHY GCPLGSQNPQHERNGATQLKPGPLNPDTKPSSLEMNMLHFTTTALPDSGIGSGRYPRQPL PPQRDPLGLTLPWAQGRARDGPWEKDGFMGRWPGQWAQAKAGSADVPNVTTAGGGTAMWV ARGTP >FGENESH: 12 3 exon (s) 410802 - 413338 180 aa, chain - ## BY PROTMAP: gi|4504609|ref|NP_000603.1| insulin-like growth factor 2 [Homo sapiens] ## query_len = 180 match_len = 180 query_cov = 1.00 match_cov = 1.00 score = 373 pid = 100.0 evalue = 1e-139 MGIPMGKSMLVLLTFLAFASCCIAAYRPSETLCGGELVDTLQFVCGDRGFYFSRPASRVS RRSRGIVEECCFRSCDLALLETYCATPAKSERDVSTPPTVLPDNFPRYPVGKFFQYDTWK QSTQRLRRGLPALLRARRGHVLAKELEAFREAKRHRPLIALPTQDPAHGGAPPEMASNRK >FGENESH: 13 2 exon (s) 418459 - 424262 168 aa, chain + ## BY PROTMAP: gi|98986327|ref|NP_057496.2| insulin-like growth factor 2 antisense [Homo sapiens] ## query_len = 168 match_len = 168 query_cov = 1.00 match_cov = 1.00 score = 352 pid = 100.0 evalue = 1e-131 MSKRKWRGFRGAQQERAQPPAASPQPCPAPHAGLPGGSRRRAPAPAGQQQMRAESRSGAQ RRRGSARRGAHREAGGCVRGRTRSSGSERSNALWQAVDAAEALALSSPLRRPWDQAQHFT NPAPFSKGPQSAPPSPPAGRRRRGADLALTPLAGEGHTRWRQPGRPGK >FGENESH: 14 2 exon (s) 437667 - 438786 110 aa, chain - ## BY MRNA_MAP: NP_000198.1 MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN >FGENESH: 15 14 exon (s) 442048 - 449601 528 aa, chain - ## BY PROTMAP: gi|88900501|ref|NP_954986.2| tyrosine hydroxylase isoform a [Homo sapiens] ## query_len = 528 match_len = 528 query_cov = 1.00 match_cov = 1.00 score = 1059 pid = 100.0 evalue = 0 MPTPDATTPQAKGFRRAVSELDAKQAEAIMVRGQGAPGPSLTGSPWPGTAAPAASYTPTP RSPRFIGRRQSLIEDARKEREAAVAAAAAAVPSEPGDPLEAVAFEEKEGKAVLNLLFSPR ATKPSALSRAVKVFETFEAKIHHLETRPAQRPRAGGPHLEYFVRLEVRRGDLAALLSGVR QVSEDVRSPAGPKVPWFPRKVSELDKCHHLVTKFDPDLDLDHPGFSDQVYRQRRKLIAEI AFQYRHGDPIPRVEYTAEEIATWKEVYTTLKGLYATHACGEHLEAFALLERFSGYREDNI PQLEDVSRFLKERTGFQLRPVAGLLSARDFLASLAFRVFQCTQYIRHASSPMHSPEPDCC HELLGHVPMLADRTFAQFSQDIGLASLGASDEEIEKLSTLYWFTVEFGLCKQNGEVKAYG AGLLSSYGELLHCLSEEPEIRAFDPEAAAVQPYQDQTYQSVYFVSESFSDAKDKLRSYAS RIQRPFSVKFDPYTLAIDVLDSPQAVRRSLEGVQDELDTLAHALSAIG >FGENESH: 16 1 exon (s) 547566 - 548147 193 aa, chain - ## BY PROTMAP: gi|4885665|ref|NP_005161.1| achaete-scute complex homolog-like 2 [Homo sapiens] ## query_len = 193 match_len = 193 query_cov = 1.00 match_cov = 1.00 score = 387 pid = 100.0 evalue = 1e-144 MDGGTLPRSAPPAPPVPVGCAARRRPASPELLRCSRRRRPATAETGGGAAAVARRNERER NRVKLVNLGFQALRQHVPHGGASKKLSKVETLRSAVEYIRALQRLLAEHDAVRNALAGGL RPQAVRPSAPRGPPGTTPVAASPSRASSSPGRGGSSEPGSPRSAYSSDDSGCEGALSPAE RELLDFSSWLGGY >FGENESH: 17 3 exon (s) 577271 - 579623 132 aa, chain - ## BY PROTMAP: gi|20454856|sp|Q9P2W6|CK021_HUMAN Uncharacterized protein C11orf21 ## query_len = 132 match_len = 132 query_cov = 1.00 match_cov = 1.00 score = 263 pid = 91.0 evalue = 2e-97 MGRTWCGMWRRRRPGRRSAVPRWPHLSSQSGVEPPDRCARSLYNVLTHQEAPGSMMPPAA AQPSAHGALVPPATAHEPVDHPALHWLACCCCLSLPGQLPLAIRLGWDLDLEAGPSSGKL CPRARRWQPLPS >FGENESH: 18 10 exon (s) 579965 - 595739 320 aa, chain + ## BY MRNA_MAP: NP_620591.1 MGPWSRVRVAKCQMLVTCFFILLLGLSVATMVTLTYFGAHFAVIRRASLEKNPYQAVHQW AFSAGLSLVGLLTLGAVLSAAATVREAQGLMAGGFLCFSLAFCAQVQVVFWRLHSPTQVE DAMLDTYDLVYEQAMKGTSHVRRQELAAIQDVFLCCGKKSPFSRLGSTEADLCQGEEAAR EDCLQGIRSFLRTHQQVASSLTSIGLALTVSALLFSSFLWFAIRCGCSLDRKGKYTLTPR ACGRQPQEPSLLRCSQGGPTHCLHSEAVAIGPRGCSGSLRWLQESDAAPLPLSCHLAAHR ALQGRSRGGLSGCPERGLSD >FGENESH: 19 9 exon (s) 580683 - 595739 290 aa, chain + ## BY MRNA_MAP: NP_005696.1 MVTLTYFGAHFAVIRRASLEKNPYQAVHQWAFSAGLSLVGLLTLGAVLSAAATVREAQGL MAGGFLCFSLAFCAQVQVVFWRLHSPTQVEDAMLDTYDLVYEQAMKGTSHVRRQELAAIQ DVFLCCGKKSPFSRLGSTEADLCQGEEAAREDCLQGIRSFLRTHQQVASSLTSIGLALTV SALLFSSFLWFAIRCGCSLDRKGKYTLTPRACGRQPQEPSLLRCSQGGPTHCLHSEAVAI GPRGCSGSLRWLQESDAAPLPLSCHLAAHRALQGRSRGGLSGCPERGLSD >FGENESH: 20 5 exon (s) 580683 - 592355 124 aa, chain + ## BY MRNA_MAP: NP_620593.1 MVTLTYFGAHFAVIRRASLEKNPYQAVHQWAFSAGLSLVGLLTLGAVLSAAATVREAQGL MAGGFLCFSLAFCAQVQVVFWRLHSPTQVEDAMLDTYDLVYEQAMKVSVLWEEVSFQPSG EHRG //
If Fgenesh++ does not find any reliable genes (for example, in short sequences), the output looks like this:
FGENESH++ 7.2.2 Mapped known genes and predicted genes in genomic DNA Seq name: small_seq Length of sequence: 60 no reliable predictions //
Fgenesh++ / Fgenesh output can be converted to GFF3 or GenBank format by scripts
fgenesh_2_gff3.pl
fgenesh_2_genbank.pl
Last updated: March 13, 2026