This software can NOT be copied or distributed without Softberry license.
Copyright Softberry, Inc. 1999-2015.
Fgenesh++ package includes the following copyrighted software:
Fgenesh - most accurate and fastest HMM-based gene prediction program;
Fgenesh+ - gene prediction program that uses similar protein information;
gene finding parameters for Fgenesh and Fgenesh+;
est_map - program for mapping known mRNAs to genome
(genome alignment with splice sites identification) and mapping
a set of available ESTs to improve the gene prediction accuracy
and add 5' and 3'- noncoding sequences;
prot_map - mapping protein database to genomic sequences;
perl scripts.
Note:
in the current release/manual of the pipeline programs Fgenesh and Fgenesh+
are called by their pipeline internal names - 'ppd' and 'ppdn+', respectively.
Fgenesh++ package uses the following public software and data:
NCBI BLAST+ blastp or BLAST blastall, bl2seq programs;
NCBI NR database (non-redundant protein database);
NCBI RefSeq database.
Fgenesh++ package requires the following analyzed sequences:
genomic sequences;
genomic sequences with repeats masked by N (optionally).
References to Fgenesh++ software:
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. (2006)
Automatic annotation of eukaryotic genes, pseudogenes and promoters.
Genome Biol. 2006;7 Suppl 1:S10.1-12.
Solovyev V.V. (2002) Finding genes by computer: probabilistic and
discriminative approaches. In Current Topics in Computational Biology
(eds. T. Jiang, T. Smith, Y. Xu, M. Zhang), The MIT Press, p. 365-401.
Salamov A., Solovyev V. (2000) Ab initio gene finding in Drosophila
genomic DNA. Genome Res. 10(4), 516-522.
Copy FGENESHPIPE/ directory from the distribution kit to appropriate directory.
Make configuration file (see "Configuration file").
An example of running the pipeline is provided for three sequences
(nGASP train sequences 1.seq, 2.seq, 3.seq) in EXAMPLE_nGASP/ directory.
There you can find example data, configuration and command files
used to run the pipeline.
When preparing files, put each sequence in a separate fasta file.
When making such lists, provide full paths to files with sequences.
EXAMPLE
list of files with non-masked sequences (e.g., file 'seq.list'):
/home/EXAMPLE_nGASP/SEQ_train/1.seq /home/EXAMPLE_nGASP/SEQ_train/2.seq /home/EXAMPLE_nGASP/SEQ_train/3.seq ...
list of files with repeats-masked sequences (e.g., 'seq_N.list'):
/home/EXAMPLE_nGASP/SEQ_train.N/1.seq.N /home/EXAMPLE_nGASP/SEQ_train.N/2.seq.N /home/EXAMPLE_nGASP/SEQ_train.N/3.seq.N ...
Naming convention for files with non-masked and repeats-masked sequences
used in the pipeline:
files with non-masked sequences can be named in any way, with or without
(one or several) extensions, e.g., filename for non-masked sequence ENm003
could look like 'ENm003.seq', 'ENm003.fa', 'ENm003.hg17.fa', 'ENm003', etc.
To derive filename for repeats-masked sequence, add to the filename of
corresponding non-masked sequence some additional extension, e.g., '.N'
or '.masked', etc.
Here are some examples of filenames for files with sequences:
non-masked repeats-masked 1.seq -> 1.seq.N ENm003.seq -> ENm003.seq.N ENm003.fa -> ENm003.fa.masked ENm003.hg17.fa -> ENm003.hg17.fa.n ENm003 -> ENm003.m
Make a file with locations of BLAST+ or BLAST programs.
For example, you can edit
FGENESHPIPE/blast_paths.list
If you use BLAST, provide paths to 'blastall' and 'bl2seq' programs, for example:
# location of blast programs /home/blast-2.2.26/bin/blastall ## for blastp /home/blast-2.2.26/bin/bl2seq ## for blasting 2 sequences
If you use BLAST+, provide paths to 'blastp' program (on both lines), for example:
# location of blast programs /home/ncbi-blast-2.2.28+/bin/blastp ## for blastp /home/ncbi-blast-2.2.28+/bin/blastp ## for blasting 2 sequences
or, if you can run it without parth, then:
# location of blast programs blastp ## for blastp blastp ## for blasting 2 sequences
You should put location of 'blast_paths.list' in configuration file (see "Configuration file").
Organism-specific gene prediction parameters are usually located in
FGENESHPIPE/EXE_CFG/ directory. For example,
C_elegans_nGASP - gene prediction parameters for C.elegans used in the example how to run the pipeline (EXAMPLE_nGASP/).
Parameters file contains parameters that regulate gene prediction process.
For a number of parameters you might want to set up values other than those initially provided in the file, e.g., maximal intron length allowed in a gene or minimal length of intron suitable for prediction of genes in introns of other genes. Some parameters may require expertise to manipulate them properly.
We provide parameters files for mammals and for non-mammals:
mammals.par - for mammals
non_mamm.par - for non-mammals
Put path to parameters file in configuration file (see "Configuration file").
In configuration file you can indicate which steps of the pipeline to run, provide paths to parameters files, data files or directories and set other options.
Example:
EXAMPLE_nGASP/run/ce.cfg
----------------------------------------------------------------------- # # Location of data and options for eukaryotic genome annotation # # Organism-specific and pipeline parameters GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP # gene prediction parameters PIPE_PARAM = /home/FGENESHPIPE/non_mamm.par # location of parameters files # Mapping known mRNAs MAP_mRNAs = 1 # map known mRNA sequences to genome sequences CDNA_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.cdna # *.cdna file for known mRNAs PROT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.pro # *.pro file for known mRNAs DAT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.dat # *.dat file for known mRNAs # Mapping ESTs MAP_ESTS = 0 # map ESTs to genomic sequences EST_FILE = /home/EXAMPLE_nGASP/DATA/rna_matches.fa # file with ESTs # Using reads USE_READS = 0 # use reads info to improve gene models DIR_SITES = /home/EXAMPLE_nGASP/reads_sites/ # directory with reads *.sites files # Using known proteins for prediction # (predict genes based on homology to known proteins) USE_PROTEINS = 1 # 0 - no, 1 -yes PROG_PROT = 1 # 1 - use prot_map, 2 - use blast NUM_THREADS = 5 # number of processors for 'prot_map' or 'blast' PROTEIN_DB = /home/NR/nr_ce # protein DB BLAST_PATHS = /home/FGENESHPIPE/blast_paths.list # location of "blast_paths.list" # Predicting genes in long introns of other genes INTRONIC_GENES = 0 # predict genes in long introns of other genes -----------------------------------------------------------------------
In the line 'GENE_PARAM' provide location of a file with organism-specific gene prediction parameters, for example:
GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP # gene prediction parameters
In the line 'PIPE_PARAM' provide location of a file with pipeline general parameters. For example,
for mammals - PIPE_PARAM = /home/FGENESHPIPE/mammals.par # location of parameters files for non-mammals - PIPE_PARAM = /home/FGENESHPIPE/non_mamm.par
Then, you can switch ON (1) or OFF (0) each of the pipeline steps:
MAP_mRNAs = 1 # map known mRNA sequences to genome sequences MAP_ESTS = 0 # map ESTs to genomic sequences USE_READS = 0 # use reads info to improve gene models USE_PROTEINS = 1 # 0 - no, 1 -yes INTRONIC_GENES = 0 # predict genes in long introns of other genes
Please note that in the current version of the pipeline either "using ESTs" (MAP_ESTS) or "using reads" (USE_READS) step can be used (switched ON) but not both at the same time.
For each step (if it is switched ON) provide necessary data / options:
- MAP_mRNAs
if "mapping known mRNAs" step is used, prepare for a query organism three
files (*.cdna, *.pro, *.dat) from RefSeq (see APPENDIX 2) and put their
locations on corresponding lines
if "using ESTs to improve gene models" step is used, a file with ESTs must
be provided
- USE_READS
if "using reads to improve gene models" step is used, a directory with
*.sites files (potential splice sites) obtained during mapping of reads
to sequences must be provided
- USE_PROTEINS
# PROG_PROT = 1 # 1 - use prot_map, 2 - use blast
"Prediction of genes based on homology to known proteins" step can be run
with either 'prot_map' (1) or 'BLAST' (2) method:
"prot_map" (default) method runs prot_map to map proteins from protein
database to genomic sequences and selects good quality mappings, then
fgenesh+ predicts more refined gene models in regions with good mapped
proteins;
"BLAST" (alternative) method first predicts genes ab initio (by fgenesh),
then finds homologs to predicted proteins in a database (by BLAST) and
then tries to refine gene models (by fgenesh+) using protein homologs found.
We recommend to use "prot_map" method because it is more straightforward
and gives higher accuracy in our tests while our "BLAST" method uses some
heuristics to merge genes if they got split when predicted ab initio.
# NUM_THREADS = 5 # number of processors for 'prot_map' or 'blast'
Prot_map program can be run in parallel, put the number of processors / threads for 'prot_map' in 'NUM_THREADS' line.
# PROTEIN_DB = /home/NR/nr_ce # protein DB
Prepare protein database (NR or its subset or custom protein database) (see APPENDIX 1) and provide its location in 'PROTEIN_DB' line.
# BLAST_PATHS = /home/FGENESHPIPE/blast_paths.list # location of "blast_paths.list"
Provide location of 'blast_paths.list' in 'BLAST_PATHS' line.
- AB INITIO
The pipeline always runs "ab initio gene prediction" step in regions with no genes predicted by other methods (using known mRNAs or known proteins), therefore it is not to set up in configuration file.
- INTRONIC_GENES
The final optional step, prediction of genes in long introns of other genes, can be switched ON and OFF by setting the corresponding value to 1 or 0.
Lines starting with '#' are comment lines.
You can put "na" or any string (or leave it empty) instead of paths or options in case if some steps are switched OFF.
Also, for convenience, if some steps are not used / switched OFF, you can remove them from configuration file.
For example, this is how configuration file can look if only "predict genes based on homology to known proteins" step is used (and ab initio that is always ON) -
EXAMPLE_nGASP/run/ce_protmap.cfg
# # Location of data and options for eukaryotic genome annotation # # Organism-specific and pipeline parameters GENE_PARAM = /home/FGENESHPIPE/EXE_CFG/C_elegans_nGASP # gene prediction parameters PIPE_PARAM = /home/FGENESHPIPE/non_mamm_30_06.par # location of parameters files # Using known proteins for prediction # (predict genes based on homology to known proteins) USE_PROTEINS = 1 # 0 - no, 1 -yes PROG_PROT = 1 # 1 - use prot_map, 2 - use blast NUM_THREADS = 5 # number of processors for 'prot_map' or 'blast' PROTEIN_DB = /home/NR/nr_ce # protein DB BLAST_PATHS = /home/FGENESHPIPE/blast_paths.list # location of "blast_paths.list"
Examples of other config files:
EXAMPLE_nGASP/run/
ce_est.cfg - 'Mapping ESTs' is switched ON ce_reads.cfg - 'Using reads' is switched ON ce_protmap.cfg - 'Using known proteins' only (and 'ab initio' that always runs)
Run 'run_pipe.pl' without parameters for help.
Usage: FGENESHPIPE/run_pipe.pl-l [-m ] -d [-w ] where <project config file> - project configuration file -l <seq_list> - list of sequences -m <seq_masked_list> - list of masked sequences (if available) -d <dir_results> - directory to store results -w <dir_tmp> - tmp directory must be unique for each run / process (if not provided, tmp directory is created automatically)
Examples:
FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_1.list -m ../seq_lists/seq_1N.list -d results_1/ -w tmp_work_1/ FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_2.list -m ../seq_lists/seq_2N.list -d results_2/ -w tmp_work_2/ ... FGENESHPIPE/run_pipe.pl celegans_prj.cfg -l seq.list -d results/
List of non-masked sequences (or any sequences if only one set is available -
non-masked or masked) is provided via the -l
Directory with results will contain *.resn3 files with Fgenesh++ predictions
(predicted gene structures and corresponding proteins in Fgenesh++ format).
If not provided explicitly, unique temporary directory for temporary and
intermediate files will be created automatically and cleaned after
calculations are finished.
If there are many sequences, it is better to split them into several lists and run in parallel, for example:
FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_1.list -m ../seq_lists/seq_1N.list -d results_1 FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_2.list -m ../seq_lists/seq_2N.list -d results_2 FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_3.list -m ../seq_lists/seq_3N.list -d results_3 ...
Directory with results is not created anew or cleaned if it already exists. Therefore, it is possible to write results into the same directory:
FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_1.list -m ../seq_lists/seq_1N.list -d results FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_2.list -m ../seq_lists/seq_2N.list -d results FGENESHPIPE/run_pipe.pl human_prj.cfg -l ../seq_lists/seq_3.list -m ../seq_lists/seq_3N.list -d results ...
But if the pipeline creates files with same names as ones stored in results directory, new files will overwrite old ones.
Commands can be written to and run as *.sh files, for example, 'run_1.sh', 'run_2.sh', etc.
When running, the pipeline makes some log output on the screen.
This log output does not make much sense for users but can help to spot errors and bugs, if any, for developers. If you want to save log output to a file, run the pipeline with the following command:
(bash, sh) FGENESHPIPE/run_pipe.pl ... 2>(tcsh, csh) FGENESHPIPE/run_pipe.pl ... >& Examples: (bash) FGENESHPIPE/run_pipe.pl celegans_prj.cfg -l seq.list -d results/ 2> run.log (tcsh) FGENESHPIPE/run_pipe.pl celegans_prj.cfg -l seq.list -d results/ >& run.log
Fgenesh++ pipeline can deal with several types of evidence:
1) gene models predicted using full-length mRNA sequences
This corresponds to "MAP_mRNAs" block in configuration file.
Sometimes full-length mRNA sequences can be obtained from RefSeq
for a specified organism.
See "APPENDIX 2. Preparing known mRNA data for mRNA mapping" for how
to get mRNAs from RefSeq and prepare them for Fgenesh++.
If you want to use full-length mRNAs, switch ON 'MAP_mRNAs' block in
configuration file.
For example,
MAP_mRNAs = 1 # map known mRNA sequences to genome sequences CDNA_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.cdna # *.cdna file for known mRNAs PROT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.pro # *.pro file for known mRNAs DAT_FILE = /home/EXAMPLE_nGASP/DATA/ngasp_1.dat # *.dat file for known mRNAs
mRNAs are mapped to genomic sequences and good mappings (gene models) are selected. No other type of evidence is used for these mRNA-supported (as we call them) gene models.
2) gene models predicted using known proteins from NR (or its subset, e.g.,
eukaryotic part of NR)
This corresponds to "USE_PROTEINS" block in configuration file.
"prot_map" (default) method predicts gene models using combination of prot_map
and Fgenesh+, with additional selection of reliable models through blast2
alignments between predicted and homologous proteins.
First, 'prot_map' maps known proteins from a protein database (for example,
NR or its subset) to genomic sequences and good mappings are selected.
Then 'fgenesh+' program predicts gene models in regions of good mappings
using mapped proteins.
After that predicted gene models are additionally checked / filtered by
script that analyses blast2 (bl2seq) alignments between predicted and
homologous proteins. Only gene models that have good coverage of predicted
and homologous proteins by blast alignment are selected. Some other criteria
are also checked.
"BLAST" (alternative) method first predicts genes ab initio (by fgenesh),
then finds homologs to predicted proteins in a database (by BLAST) and
then tries to refine gene models (by fgenesh+) using protein homologs found.
We recommend to use "prot_map" method because it is more straightforward
and gives higher accuracy in our tests while our "BLAST" method uses some
heuristics to merge genes if they got split when predicted ab initio.
If you want to use protein-supported (as we call them) predictions, switch ON
"USE_PROTEINS" block in configuration file.
For example,
USE_PROTEINS = 1 # 0 - no, 1 -yes PROG_PROT = 1 # 1 - use prot_map, 2 - use blast NUM_THREADS = 5 # number of processors for 'prot_map' or 'blast' PROTEIN_DB = /home/NR/nr_ce # protein DB BLAST_PATHS = /home/FGENESHPIPE/blast_paths.list # location of "blast_paths.list"
Protein-supported predictions are made in regions not occupied by mRNA-supported predictions.
3) Ab initio predictions (without mRNA or protein type of evidence) are made
in regions not occupied by mRNA-supported and protein-supported predictions.
To make ab initio predictions, we use 'fgenesh' and gene prediction parameters
trained for specified (or close) organism.
The pipeline always runs ab initio predictions (in regions with no genes
predicted by other methods) therefore it is not to set up in configuration
file.
USE_READS = 1 # use reads info to improve gene models DIR_SITES = /home/EXAMPLE_nGASP/reads_sites/ # directory with reads *.sites files
5) using ESTs to improve gene predictions
This corresponds to "MAP_ESTS" block in configuration file.
Similar to reads, ESTs can be used to make more accurate predictions.
ESTs are aligned to genomic sequences by 'est_map' and good alignments are
selected. A list of alignment blocks is compiled with information whether
such blocks are flanked at one or both ends with splice sites or not. This
information about regions (potential exons) supported by ESTs and potential
splice sites is used by 'fgenesh+' (see 2) from above) and 'fgenesh' (see 3)
from above).
Boundaries of CDS exons and gene structures are predicted more correctly.
Also, 5'- and 3'- untranslated parts of first and last CDS exons can be
predicted.
Technically, as with reads, this additional information is used in the
process of prediction, not as a step after making initial predictions.
If you want to use evidence from ESTs, switch ON "MAP_ESTS" block in
configuration file.
For example,
MAP_ESTS = 1 # map ESTs to genomic sequences EST_FILE = /home/EXAMPLE_nGASP/DATA/rna_matches.fa # file with ESTs
Currently, evidence from reads and ESTs can not be used simultaneously in
Fgenesh++. Apart from that, there is no limitation on combination of evidence
data.
Also, each predicted gene is represented by only one gene model (the one with
the maximum weight) even if reads or ESTs support several alternative
splicing gene variants. In next versions, we will make it possible to predict
alternative gene variants based on reads / EST data.
6) ab initio gene predictions in long introns of other genes
Optionally, pipeline can make gene predictions in long introns of other genes.
If you want to use it, switch this option ON. For example,
INTRONIC_GENES = 1 # predict genes in long introns of other genes
If you switch all options in configuration file to 0, Fgenesh++ runs as Fgenesh, i.e. it makes ab initio predictions in genomic sequences.
To run the pipeline, do the following steps:
- install Fgenesh++ pipeline (see "INSTALLATION");
- prepare genomic sequences (see "Preparation of sequences");
- if "mapping known mRNAs" step is used, prepare three special files
*.cdna, *.pro, *.dat with known mRNAs data for a query organism
(see APPENDIX 2);
- if "prediction of genes based on homology to known proteins" step
is used, prepare NR or its subset or other / custom protein database
(see APPENDIX 1);
- if "using reads to improve gene models" step is used, run ReadsMap
pipeline to map reads to sequences and obtain *.sites files with
potential splice sites;
- if "using ESTs to improve gene models" step is used, prepare file
with EST sequences;
(currently either "using reads" or "using ESTs" step can be used but
not both at the same time)
- make configuration file (see "Configuration file");
- run the pipeline (see "Command to run the pipeline")
*.resn3 files in results directory will contain predicted gene structures
and corresponding proteins in Fgenesh++ format - see APPENDIX 3 for details.
If "prediction of genes based on homology to known proteins" step is
used then NR (or other protein database) is required as well as BLAST+
or BLAST software.
BLAST+ can be downloaded from NCBI ftp or web page:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
The legacy BLAST executables can be downloaded from NCBI ftp:
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/
Use BLAST+ or BLAST release 2.2.13 or some higher version.
NR (non-redundant protein sequence database) can be downloaded from
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
NR or custom protein database must be present in directory provided in
configuration file in both formats - as FASTA file and as formatted for
BLAST by 'makeblastdb' or 'formatdb' program.
Format protein database for BLAST with "makeblastdb -in
makeblastdb -in nr -dbtype prot (BLAST+) formatdb -i nr (BLAST)
Do NOT use '-parse_seqids' (BLAST+) or '-o T' (BLAST) option since the
pipeline requires a protein database to be formatted without parsing Seq-ids!
After formatting, protein DB will be present in both formats:
nr (FASTA formatted) nr.*.phr (BLAST formatted) nr.*.pin (BLAST formatted) nr.*.psq (BLAST formatted)
Put the name of FASTA file ('nr' in example above) in configuration file.
We highly recommend to use relevant NR subsets for calculations. It can
greatly speed up calculations because the initial NR size is very big.
For example, only eukaryotic proteins from NR (nr_euk). Or, better, even
more specific subsets:
nr_plants - plant proteins (to annotate plants) nr_animals - animals (Metazoa) proteins (to annotate animals) nr_fungi - fungi proteins (to annotate fungi)
Also, we recommend to exclude from NR proteins not relevant for your annotation. For example, partial proteins, "unnamed protein product" proteins, hypothetical proteins, predicted proteins (with 'XP_' or 'PREDICTED' in deflines).
To run mRNA mapping you should prepare three files:
rf.cdna - with mRNA sequences in fasta format
rf.pro - with corresponding protein sequences in fasta format
rf.dat - file with coordinates of ATG, STOP codon and corresponding
chromosome names (if chromosome name is not available, put "na")
You can create such files using RefSeq database from NCBI and our scripts
(or prepare them using your own data and scripts).
1) download RefSeq files *.rna.gbff.gz and *.protein.faa.gz for considered
organism(s)
ftp://ftp.ncbi.nlm.nih.gov/refseq/
Example for Human:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.rna.gbff.gz . wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/human.*.protein.faa.gz .
Example for Arabidopsis thaliana (plants):
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.rna.gbff.gz . wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/plant/plant.*.protein.faa.gz .
Example for fungi:
wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.rna.gbff.gz . wget ftp://ftp.ncbi.nlm.nih.gov/refseq/release/fungi/fungi.*.protein.faa.gz .
2) gunzip files
gunzip *.gz
3) concatenate files
(The order in which files are concatenated is not important here.)
# Human cat human.*.rna.gbff > human.rna.gbff cat human.*.protein.faa > human.protein.faa # Arabidopsis thaliana cat plant.*.rna.gbff > plant.rna.gbff cat plant.*.protein.faa > plant.protein.faa # Fungi cat fungi.*.rna.gbff > fungi.rna.gbff cat fungi.*.protein.faa > fungi.protein.faa
Concatenated files are used to prepare *.cdna, *.pro, *.dat files.
There is a script in REFSEQ_SCRIPT/ directory to prepare *.cdna, *.pro, *.dat
files for considered organism.
refseq_NM.pl - selects from RefSeq full-length mRNAs with IDs starting with
NM_ and with status keys REVIEWED / PROVISIONAL / VALIDATED
(see Refseq documentation for more information on IDs and status of entries)
Note about XM_ mRNAs from RefSeq:
We use mRNAs with NM_ prefix only (with status 'PROVISIONAL', 'REVIEWED',
'VALIDATED'). Usually they are better than XM_ and have some evidence.
More information about prefixes can be found on NCBI site.
But if you like XM_ mRNAs for your organism, you can comment the following
line in 'refseq_NM.pl':
if( $gb_id !~ /^NM_/ ) { $yes = 0; } #! only mRNAs with NM_ IDs
(To comment a line, put # at the beginning of the line.)
It is usually indicated in RefSeq to which chromosomes mRNAs belong.
Fgenesh++ works with RefSeq mRNAs in two ways:
1) if you annotate chromosome sequences (properly named), Fgenesh++ maps
mRNAs from RefSeq only to chromosomes where they belong;
2) if you annotate contigs / scaffolds (so that you do not know to what
contigs RefSeq mRNAs belong), Fgenesh++ tries to map mRNAs to all contigs
and selects good mappings.
In case 1) to make correspondence between mRNAs and files with chromosome
sequences, run 'refseq_NM.pl' script with options -chr_prefix:
Suppose you have chromosome sequences properly named, like for example for
human chr1.fa, chr2.fa, .., chr22.fa, chrX.fa, chrY.fa (or with some other
extension - chr1.seq, etc.).
Human chromosomes are named from 1 to 22 and 'X', 'Y' in human *.rna.gbff
files, for example:
source 1..3159 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="10"
To adjust for RefSeq chromosomes notation / your sequences files names,
run 'refseq_NM.pl' with options -chr_prefix:chr -chr_ext:fa
(So that mRNA from chromosome '10' from RefSeq in the example above will
be mapped to your 'chr10.fa' file with chromosome sequence.)
Example:
./refseq_NM.pl rf_hs human.rna.gbff human.protein.faa "Homo sapiens" -chr_prefix:chr -chr_ext:fa
Or, for 'Saccharomyces cerevisiae' chromosomes are named from I to XVI in fungi *.rna.gbff, for example:
source 1..1848 /organism="Saccharomyces cerevisiae S288c" /mol_type="mRNA" /strain="S288c" /db_xref="taxon:559292" /chromosome="III"
So, if your files with S.c. chromosomes are "I.seq", "II.seq", .., "XVI.seq",
run 'refseq_NM.pl' with option -chr_ext:seq (adds extension .seq to I, II, ...)
Do not use -chr_prefix: because there is no prefix in I.seq before I, etc.
Example:
./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c" -chr_ext:seq
In case 2) (contigs, scaffolds, i.e. sequences not assembled into chromosomes)
run 'refseq_NM.pl' without -chr_prefix:
Example:
./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c"
In this case you will see lines with "na" instead of chromosome names in *.dat file.
./refseq_NM.pl<.rna.gbff file> <.protein.faa file> \ [-status:<[PRV]>] [-check_atg:<0|1>] [-check_stop:<0|1>] \ [-chr_prefix: -chr_ext: ]
where
- base name for generated .cdna, .pro, .dat files <.rna.gbff file> - .rna.gbff file name <.protein.faa file> - .protein.faa file name - organism -status:<[PRV]> - select mRNAs with this status (PRV by default) (P - PROVISIONAL, R - REVIEWED, V - VALIDATED) -check_atg:<0|1> - skip mRNAs with translation start codon != ATG (1 by default) -check_stop:<0|1> - skip mRNAs with wrong stop codon annotated (1 by default) -chr_prefix: - prefix for chromosome files (for example, 'chr' or 'Chr') -chr_ext: - extension for chromosome files (for example, 'fa' or 'seq') (prefix and extension are provided if sequences are chromosomes and sequence files are properly named)
Examples:
./refseq_NM.pl rf_hs human.rna.gbff human.protein.faa "Homo sapiens" -chr_prefix:chr -chr_ext:fa ./refseq_NM.pl rf_BT cow.rna.gbff cow.protein.faa "Bos taurus" -chr_prefix:chr -chr_ext:fa ./refseq_NM.pl rf_BT cow.rna.gbff cow.protein.faa "Bos taurus" -chr_prefix:chr -chr_ext:fa -status:RV ./refseq_NM.pl rf_at plant.rna.gbff plant.protein.faa "Arabidopsis thaliana" -chr_prefix:ath -chr_ext:seq ./refseq_NM.pl rf_at plant.rna.gbff plant.protein.faa "Arabidopsis thaliana" -chr_prefix:ath -chr_ext:seq -check_atg:0 -check_stop:1 ./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c" -chr_ext:seq ./refseq_NM.pl rf_Sc fungi.rna.gbff fungi.protein.faa "Saccharomyces cerevisiae S288c"
As an organism name, use words from 'ORGANISM' field of RefSeq entries, e.g., 'Homo sapiens' for human, etc.:
ORGANISM Homo sapiens ORGANISM Arabidopsis thaliana ORGANISM Saccharomyces cerevisiae S288c
Basenames for files (from examples above):
(rf_hs - 'rf' for refseq, 'hs' for homo sapiens rf_at - 'rf' for refseq, 'hs' for arabidopsis thaliana rf_Sc - 'rf' for refseq, 'Sc' for Saccharomyces cerevisiae)
3 files will be generated (example for Human):
rf_hs.cdna - set of Refseq cdna sequences in fasta format rf_hs.pro - set of Refseq protein sequences in fasta format (slightly modified from Refseq original *.faa file) rf_hs.dat - file with ATG, STOP coordinates and chromosome names
rf_hs.dat file has the following format:
36 2327 chr21.seq 221 3175 chr3.seq 28 3024 chr3.seq 86 3049 na
where 36 and 2327 are coordinates of the first nucleotide of ATG and
the last nucleotide of stop codon in some mRNA sequence from RefSeq
annotated to belong to chromosome 21,
chr21.seq - sequence name where to map this mRNA,
"na" means it is not clear to which chromosome mRNA belongs, in this
case Fgenesh++ tries to map such an mRNA to all genomic sequences and
selects good mappings.
Data in *.cdna, *.pro and *.dat are organised in the same order, so that
the first line in *.dat corresponds to the first sequences in *.cdna and
*.pro, and so on.
Once you generated *.cdna, *.pro, *.dat files, put their locations in
configuration file.
G - predicted gene number, starting from start of sequence; Str - DNA strand (+ for direct or - for complementary); Feature - type of coding sequence: CDSf - first (starting with start codon), CDSi - internal (internal exon), CDSl - last coding segment (ending with stop codon), CDSo - for genes with only one CDS (starting with start codon and ending with stop codon); EXO - partially translated or untranslated exon (coding sequence, if any, from such an exon is annotated as CDSf, CDSl or CDSo); TSS - TATA-box (or transcription start site for genes with mRNA support); PolA - PolyA signal (or end of transcript for genes with mRNA support); Start and End - position of the Feature; Score - Log likelihood*10 score for the feature; ORF - start/end positions where the first complete codon starts and the last codon ends; Len - length of predicted ORF.
In current Fgenesh++ format features 'TSS' and 'PolA' have several meanings.
For mRNA supported genes:
TSS - transcription start site,
PolA - PolyA site.
For protein supported and ab initio predicted genes:
TSS - TATA-box (always predicted, even if bad),
PolA - PolyA signal.
Fgenesh++ can predict complete as well as incomplete gene models.
Incomplete gene models are those with initial (CDSf for first) or/and
terminal (CDSl for last) exon(s) absent. Note that gene models with
only one CDS (CDSo) having both ATG and stop codon are considered as
complete gene models.)
In Fgenesh++ output, you can distinguish between mRNA supported,
protein supported and ab initio gene predictions:
An example of ab initio gene prediction.
G Str Feature Start End Score ORF Len 2 + TSS 118455 -6.59 2 + 1 CDSf 118568 - 118585 5.16 118568 - 118585 18 2 + 2 CDSi 132799 - 132867 3.27 132799 - 132867 69 2 + 3 CDSi 133530 - 133666 5.47 133530 - 133664 135 2 + 4 CDSl 139414 - 139570 5.57 139415 - 139570 156 2 + PolA 141461 1.12
An example of mRNA supported gene prediction.
G Str Feature Start End Score ORF Len (1) (2) (3) 8 - PolA 249364 51.60 NP_000030.1 8 - 0 EXO 249364 - 250022 51.60 249364 - 250022 659 239 - 897 100 NP_000030.1 8 - 1 CDSl 249419 - 250022 51.60 249419 - 250021 604 239 - 897 100 NP_000030.1 8 - 2 CDSi 250612 - 250768 51.60 250614 - 250766 157 82 - 238 100 NP_000030.1 8 - 3 CDSf 250956 - 250998 51.60 250957 - 250998 43 19 - 81 100 NP_000030.1 8 - 0 EXO 250956 - 251018 51.60 250956 - 251018 63 19 - 81 100 NP_000030.1 8 - 0 EXO 251216 - 251233 51.60 251216 - 251233 18 1 - 18 100 NP_000030.1 8 - TSS 251233 51.60 NP_000030.1
Three more fields are added in comparison with ab initio predictions:
(1) - coordinates of that part of known mRNA (in bp) that corresponds to predicted exon;
(2) - homology in % between predicted exon and corresponding part of known mRNA;
(3) - ID of protein corresponding to known mRNA.
An example of protein supported gene prediction.
G Str Feature Start End Score ORF Len (1) (2) 3 - 1 CDSl 162093 - 162186 157.68 162093 - 162185 93 618 - 590 100 3 - 2 CDSi 170757 - 170838 145.17 170759 - 170836 78 588 - 563 100 3 - 3 CDSi 171377 - 171561 339.29 171378 - 171560 183 561 - 501 100 3 - 4 CDSi 171880 - 172018 247.86 171882 - 172016 135 499 - 455 100 3 - 5 CDSi 172685 - 172790 191.48 172686 - 172790 105 453 - 419 100 3 - 6 CDSi 174346 - 174563 382.22 174346 - 174561 216 418 - 347 100 3 - 7 CDSi 176164 - 176877 1246.74 176165 - 176875 711 345 - 109 100 3 - 8 CDSi 178973 - 179057 159.79 178974 - 179057 84 107 - 80 100 3 - 9 CDSi 183740 - 183833 184.33 183740 - 183832 93 79 - 49 100 3 - 10 CDSf 186433 - 186575 277.99 186435 - 186575 141 47 - 1 100
Two more fields are added in comparison with ab initio predictions:
(1) - coordinates of that part of homologous protein that corresponds to predicted CDS;
(2) - homology in % between predicted CDS and corresponding part of homologous protein.
Predicted proteins are listed at the end of *.resn3 files.
For mRNA supported and protein supported predictions information about a homolog
to a predicted protein is given in ID line of a predicted protein (after first ##),
for example:
# protein supported prediction >FGENESH: 3 10 exon (s) 162093 - 186575 619 aa, chain - ## BY PROTMAP: gi|14249338|ref|NP_116114.1| hypothetical protein LOC84811 [Homo sapiens]gi|13623491|gb|AAH06350.1| Hypothetical protein MGC13125 [Homo sapiens] ## 619 ## 100.0 1.00 1.00 1276.0 # mRNA supported prediction >NP_443200.1_#_18_#_1109 5 3 exon (s) 203739 - 205462 363 aa, chain - ## gi|16445025|ref|NP_443200.1| apolipoprotein AV [Homo sapiens] ## 363 ## orf_perfect
All Fgenesh++ output for sequence ENm003 (file "ENm003.seq.N.resn3") is shown below.
FGENESH++ 3.1.1 Mapped known genes and predicted genes in genomic DNA Seq name: ENm003 Length of sequence: 500000 Number of predicted genes 11 in +chain 2 in -chain 9 Number of predicted exons 86 in +chain 10 in -chain 76 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 - PolA 1928 -1.68 1 - 1 CDSl 3118 - 3219 0.29 3118 - 3219 102 1 - 2 CDSi 5228 - 5384 11.00 5228 - 5383 156 1 - 3 CDSi 6666 - 6724 -0.82 6668 - 6724 57 1 - 4 CDSi 9401 - 9438 -0.43 9401 - 9436 36 1 - 5 CDSi 10303 - 10499 9.57 10304 - 10498 195 1 - 6 CDSi 10636 - 10762 -0.88 10638 - 10760 123 1 - 7 CDSi 12156 - 12321 3.94 12157 - 12321 165 1 - 8 CDSi 28607 - 28648 3.50 28607 - 28648 42 1 - 9 CDSf 69822 - 69932 6.96 69822 - 69932 111 1 - TSS 71289 -5.09 2 + TSS 118455 -6.59 2 + 1 CDSf 118568 - 118585 5.16 118568 - 118585 18 2 + 2 CDSi 132799 - 132867 3.27 132799 - 132867 69 2 + 3 CDSi 133530 - 133666 5.47 133530 - 133664 135 2 + 4 CDSl 139414 - 139570 5.57 139415 - 139570 156 2 + PolA 141461 1.12 3 - 1 CDSl 162093 - 162186 157.68 162093 - 162185 93 618 - 590 100 3 - 2 CDSi 170757 - 170838 145.17 170759 - 170836 78 588 - 563 100 3 - 3 CDSi 171377 - 171561 339.29 171378 - 171560 183 561 - 501 100 3 - 4 CDSi 171880 - 172018 247.86 171882 - 172016 135 499 - 455 100 3 - 5 CDSi 172685 - 172790 191.48 172686 - 172790 105 453 - 419 100 3 - 6 CDSi 174346 - 174563 382.22 174346 - 174561 216 418 - 347 100 3 - 7 CDSi 176164 - 176877 1246.74 176165 - 176875 711 345 - 109 100 3 - 8 CDSi 178973 - 179057 159.79 178974 - 179057 84 107 - 80 100 3 - 9 CDSi 183740 - 183833 184.33 183740 - 183832 93 79 - 49 100 3 - 10 CDSf 186433 - 186575 277.99 186435 - 186575 141 47 - 1 100 4 - 1 CDSl 192536 - 192670 242.45 192536 - 192670 135 458 - 416 100 4 - 2 CDSi 193383 - 193448 121.76 193383 - 193448 66 415 - 394 100 4 - 3 CDSi 195769 - 195855 164.50 195769 - 195855 87 393 - 365 100 4 - 4 CDSi 196530 - 196640 218.51 196530 - 196640 111 364 - 328 100 4 - 5 CDSi 197141 - 197230 164.54 197141 - 197230 90 327 - 298 100 4 - 6 CDSi 197989 - 198059 124.84 197989 - 198057 69 297 - 275 100 4 - 7 CDSi 198466 - 198532 103.87 198467 - 198532 66 273 - 252 100 4 - 8 CDSi 198710 - 198757 84.47 198710 - 198757 48 251 - 236 100 4 - 9 CDSi 199125 - 199247 236.78 199125 - 199247 123 235 - 195 100 4 - 10 CDSi 199407 - 199493 161.23 199407 - 199493 87 194 - 166 100 4 - 11 CDSi 200122 - 200139 16.83 200122 - 200139 18 165 - 160 100 4 - 12 CDSi 200563 - 200670 118.34 200563 - 200670 108 141 - 112 100 4 - 13 CDSi 201093 - 201254 281.00 201093 - 201254 162 111 - 58 100 4 - 14 CDSf 201431 - 201601 315.93 201431 - 201601 171 57 - 1 100 5 - PolA 202982 74.60 NP_443200.1 5 - 0 EXO 202982 - 204678 74.60 202982 - 204678 1697 170 - 1866 99 NP_443200.1 5 - 1 CDSl 203739 - 204678 74.60 203739 - 204677 940 170 - 1866 99 NP_443200.1 5 - 2 CDSi 205197 - 205308 74.60 205199 - 205306 112 58 - 169 100 NP_443200.1 5 - 3 CDSf 205423 - 205462 74.60 205424 - 205462 40 1 - 57 100 NP_443200.1 5 - 0 EXO 205423 - 205479 74.60 205423 - 205479 57 1 - 57 100 NP_443200.1 5 - TSS 205479 74.60 NP_443200.1 6 - PolA 234313 65.70 NP_000473.1 6 - 0 EXO 234313 - 235492 65.70 234313 - 235492 1180 291 - 1466 99 NP_000473.1 6 - 1 CDSl 234478 - 235492 65.70 234478 - 235491 1015 291 - 1466 99 NP_000473.1 6 - 2 CDSi 236270 - 236396 65.70 236272 - 236394 127 164 - 290 99 NP_000473.1 6 - 3 CDSf 236754 - 236802 65.70 236755 - 236802 49 1 - 163 100 NP_000473.1 6 - 0 EXO 236754 - 236915 65.70 236754 - 236915 162 1 - 163 100 NP_000473.1 6 - TSS 236915 65.70 NP_000473.1 7 + TSS 243519 39.60 NP_000031.1 7 + 0 EXO 243519 - 243551 39.60 243519 - 243551 33 1 - 33 100 NP_000031.1 7 + 0 EXO 244181 - 244248 39.60 244181 - 244248 68 34 - 101 100 NP_000031.1 7 + 1 CDSf 244194 - 244248 39.60 244194 - 244247 55 34 - 101 100 NP_000031.1 7 + 2 CDSi 244384 - 244507 39.60 244386 - 244505 124 102 - 225 100 NP_000031.1 7 + 3 CDSl 246375 - 246495 39.60 246376 - 246495 121 226 - 533 100 NP_000031.1 7 + 0 EXO 246375 - 246682 39.60 246375 - 246682 308 226 - 533 100 NP_000031.1 7 + PolA 246682 39.60 NP_000031.1 8 - PolA 249364 51.60 NP_000030.1 8 - 0 EXO 249364 - 250022 51.60 249364 - 250022 659 239 - 897 100 NP_000030.1 8 - 1 CDSl 249419 - 250022 51.60 249419 - 250021 604 239 - 897 100 NP_000030.1 8 - 2 CDSi 250612 - 250768 51.60 250614 - 250766 157 82 - 238 100 NP_000030.1 8 - 3 CDSf 250956 - 250998 51.60 250957 - 250998 43 19 - 81 100 NP_000030.1 8 - 0 EXO 250956 - 251018 51.60 250956 - 251018 63 19 - 81 100 NP_000030.1 8 - 0 EXO 251216 - 251233 51.60 251216 - 251233 18 1 - 18 100 NP_000030.1 8 - TSS 251233 51.60 NP_000030.1 9 - 1 CDSl 260007 - 260164 216.16 260007 - 260162 156 1051 - 1003 84 9 - 2 CDSi 261087 - 261219 217.15 261088 - 261219 132 1001 - 958 93 9 - 3 CDSi 262731 - 262894 224.97 262731 - 262892 162 957 - 904 79 9 - 4 CDSi 271421 - 272310 1397.33 271422 - 272309 888 902 - 606 90 9 - 5 CDSi 272876 - 273215 495.64 272878 - 273213 336 604 - 493 80 9 - 6 CDSi 274885 - 274994 169.71 274886 - 274993 108 491 - 456 88 9 - 7 CDSi 275452 - 275537 135.57 275454 - 275537 84 454 - 427 96 9 - 8 CDSi 275813 - 275938 217.61 275813 - 275938 126 426 - 385 97 9 - 9 CDSi 277279 - 277429 275.71 277279 - 277428 150 384 - 335 98 9 - 10 CDSi 281557 - 281700 252.23 281559 - 281699 141 333 - 287 95 9 - 11 CDSi 283942 - 284012 139.45 283944 - 284012 69 285 - 263 100 9 - 12 CDSi 287092 - 287247 277.74 287092 - 287247 156 262 - 211 100 9 - 13 CDSi 287514 - 287667 246.50 287514 - 287666 153 210 - 160 92 9 - 14 CDSi 288769 - 288878 186.07 288771 - 288878 108 158 - 123 94 9 - 15 CDSi 289021 - 289062 54.07 289021 - 289062 42 110 - 97 88 9 - 16 CDSi 289477 - 289620 222.91 289477 - 289620 144 96 - 49 87 9 - 17 CDSi 289864 - 289974 192.60 289864 - 289974 111 48 - 12 91 9 - 18 CDSf 290535 - 290570 42.89 290535 - 290570 36 11 - 1 100 10 - PolA 302475 1.12 10 - 1 CDSl 302565 - 302641 -0.55 302565 - 302639 75 10 - 2 CDSi 309864 - 309987 5.78 309865 - 309987 123 10 - 3 CDSi 310804 - 310928 11.92 310804 - 310926 123 10 - 4 CDSi 340830 - 340991 13.82 340831 - 340989 159 10 - 5 CDSi 367655 - 367718 6.56 367656 - 367718 63 10 - 6 CDSi 370559 - 370675 9.39 370559 - 370675 117 10 - 7 CDSf 372345 - 372401 5.39 372345 - 372401 57 10 - TSS 372529 -9.59 11 - PolA 449354 31.00 NP_001021.1 11 - 0 EXO 449354 - 449697 31.00 449354 - 449697 344 1 - 344 97 NP_001021.1 11 - 1 CDSo 449408 - 449662 31.00 449408 - 449662 255 1 - 344 97 NP_001021.1 11 - TSS 449697 31.00 NP_001021.1 Predicted protein(s): >FGENESH: 1 9 exon (s) 3118 - 69932 332 aa, chain - MSVPPWGGQALVGGPGSAGGLVEEVESKDGQDGAQKKFGQGHGGSLSDAEEEEGWAGAGR NSRWPKDHNSGSMGAGAAEGSTKLCDPQSLAPGEELSVFLLTAAAKGPNLLRPRPLAQHP QAWHGAAMVPKGCLAVVCSRGMMGVLAGRKDVEHVTVQIPDEPKSQLRLLTESQCLVGSP KPECPLQGPDGLFCGDPHKEITSFERLGLAEDKKALSKFARCEQEPMGGDLSLSWGSGLY PAEEHLWLEEVRAEMQELIPGFCPAPSSEAGPPGVLAADEELDIEGQVSSGITGTNCQLG GSSPSSLPVAAAADTTYGSESAQENWWGHELV >FGENESH: 2 4 exon (s) 118568 - 139570 126 aa, chain + MSMPHEGRHSRKLEAGGKRKEKQLQELAELVRSQAPANSAPIKRAGSREPSDQGLETQRP FLITGLEESGCQLVRVAPSPHLLSQRCSESPGTSKVTHAREGSQWESSSECGKGAERRMY NPGPGV >FGENESH: 3 10 exon (s) 162093 - 186575 619 aa, chain - ## BY PROTMAP: gi|14249338|ref|NP_116114.1| hypothetical protein LOC84811 [Homo sapiens]gi|13623491|gb|AAH06350.1| Hypothetical protein MGC13125 [Homo sapiens] ## 619 ## 100.0 1.00 1.00 1276.0 MAAAPPLSKAEYLKRYLSGADAGVDRGSESGRKRRKKRPKPGGAGGKGMRIVDDDVSWTA ISTTKLEKEEEEDDGDLPVVAEFVDERPEEVKQMEAFRSSAKWKLLGGHNEDLPSNRHFR HDTPDSSPRRVRHGTPDPSPRKDRHDTPDPSPRRARHDTPDPSPLRGARHDSDTSPPRRI RHDSSDTSPPRRARHDSPDPSPPRRPQHNSSGASPRRVRHDSPDPSPPRRARHGSSDISS PRRVHNNSPDTSRRTLGSSDTQQLRRARHDSPDLAPNVTYSLPRTKSGKAPERASSKTSP HWKESGASHLSFPKNSKYEYDPDISPPRKKQAKSHFGDKKQLDSKGDCQKATDSDLSSPR HKQSPGHQDSDSDLSPPRNRPRHRSSDSDLSPPRRRQRTKSSDSDLSPPRRSQPPGKKAA HMYSGAKTGLVLTDIQREQQELKEQDQETMAFEAEFQYAETVFRDKSGRKRNLKLERLEQ RRKAEKDSERDELYAQWGKGLAQSRQQQQNVEDAMKEMQKPLARYIDDEDLDRMLREQER EGDPMANFIKKNKAKENKNKKVRPRYSGPAPPPNRFNIWPGYRWDGVDRSNGFEQKRFAR LASKKAVEELAYKWSVEDM >FGENESH: 4 14 exon (s) 192536 - 201601 447 aa, chain - ## BY PROTMAP: gi|30582123|gb|AAP35288.1| zinc finger protein 259 [Homo sapiens]gi|61361942|gb|AAX42130.1| zinc finger protein 259 [synthetic construct]gi|61361934|gb|AAX42129.1| zinc finger protein 259 [synthetic construct]gi|16924221|gb|AAH17380.1| Zinc finger protein 259 [Homo sapiens]gi|16878306|gb|AAH17349.1| Zinc finger protein 259 [Homo sapiens]gi|15082497|gb|AAH12162.1| Zinc finger protein 259 [Homo sapiens]gi|13279038|gb|AAH04256.1| Zinc finger protein 259 [Homo sapiens]gi|3510462|gb|AAC33514.1| zinc finger protein [Homo sapiens]gi|6137318|sp|O75312|ZPR1_HUMAN Zinc-finger protein ZPR1 (Zinc finger protein 259)gi|4508021|ref|NP_003895.1| zinc finger protein 259 [Homo sapiens] ## 459 ## 96.0 1.00 1.00 884.0 MAASGAVEPGPPGAAVAPSPAPAPPPAPDHLFRPISAEDEEQQPTEIESLCMNCYCNGMT RLLLTKIPFFREIIVSSFSCEHCGWNNTEIQSAGRIQDQGVRYTLSVRALEDMNREVVKT DSAATRIPELDFEIPAFSQKGGKFKIRDQPARRANKDATAERIDEFIVKLKELKQVASPF TLIIDDPSGNSFVENPHAPQKDDALVITHYNRTRQQEEMLGLQEEAPAEKPEEEDLRNEV LQFSTNCPECNAPAQTNMKLVQIPHFKEVIIMATNCENCGHRTNEVKSGGAVEPLGTRIT LHITDASDMTRDLLKSETCSVEIPELEFELGMAVLGGKFTTLEGLLKDIRELVTKNPFTL GDSSNPGQTERLQEFSQKMDQIIEGNMKAHFIMDDPAGNSYLQNVYAPEDDPEMKVERYK RTFDQNEELGLNDMKTEGYEAGLAPQR >NP_443200.1_#_18_#_1109 5 3 exon (s) 203739 - 205462 363 aa, chain - ## gi|16445025|ref|NP_443200.1| apolipoprotein AV [Homo sapiens] ## 363 ## orf_perfect MAAVLTWALALLSAFSATQARKGFWDYFSQTSGDKGRVEQIHQQKMAREPATLKDSLEQD LNNMNKFLEKLRPLSGSEAPRLPQDPVGMRRQLQEELEEVKARLQPYMAEAHELVGWNLE GLRQQLKPYTMDLMEQVALRVQELQEQLRVVGEDTKAQLLGGVDEAWALLQGLQSRVVHH TGRFKELFHPYAESLVSGIGRHVQELHRSVAPHAPASPARLSRCVQVLSRKLTLKAKALH ARIQQNLDQLREELSRAFAGTGTEEGAGPDPQMLSEEVRQRLQAFRQDTYLQIAAFTRAI DQETEEVQQQLAPPPPGHSAFAPEFQQTDSGKVLSKLQARLDDLWEDITHSLHDQGHSHL GDP >NP_000473.1_#_115_#_1305 6 3 exon (s) 234478 - 236802 396 aa, chain - ## gi|4502151|ref|NP_000473.1| apolipoprotein A-IV precursor [Homo sapiens] ## 396 ## orf_perfect MFLKAVVLTLALVAVAGARAEVSADQVATVMWDYFSQLSNNAKEAVEHLQKSELTQQLNA LFQDKLGEVNTYAGDLQKKLVPFATELHERLAKDSEKLKEEIGKELEELRARLLPHANEV SQKIGDNLRELQQRLEPYADQLRTQVNTQAEQLRRQLTPYAQRMERVLRENADSLQASLR PHADELKAKIDQNVEELKGRLTPYADEFKVKIDQTVEELRRSLAPYAQDTQEKLNHQLEG LTFQMKKNAEELKARISASAEELRQRLAPLAEDVRGNLRGNTEGLQKSLAELGGHLDQQV EEFRRRVEPYGENFNKALVQQMEQLRTKLGPHAGDVEGHLSFLEKDLRDKVNSFFSTFKE KESQDKTLSLPELEQQQEQHQEQQQEQVQMLAPLES >NP_000031.1_#_47_#_346 7 3 exon (s) 244194 - 246495 99 aa, chain + ## gi|4557323|ref|NP_000031.1| apolipoprotein C-III precursor [Homo sapiens] ## 99 ## orf_perfect MQPRVLLVVALLALLASARASEAEDASLLSFMQGYMKHATKTAKDALSSVQESQVAQQAR GWVTDGFSSLKDYWSTVKDKFSEFWDLDPEVRPTSAVAA >NP_000030.1_#_39_#_842 8 3 exon (s) 249419 - 250998 267 aa, chain - ## gi|4557321|ref|NP_000030.1| apolipoprotein A-I precursor [Homo sapiens] ## 267 ## orf_perfect MKAAVLTLAVLFLTGSQARHFWQQDEPPQSPWDRVKDLATVYVDVLKDSGRDYVSQFEGS ALGKQLNLKLLDNWDSVTSTFSKLREQLGPVTQEFWDNLEKETEGLRQEMSKDLEEVKAK VQPYLDDFQKKWQEEMELYRQKVEPLRAELQEGARQKLHELQEKLSPLGEEMRDRARAHV DALRTHLAPYSDELRQRLAARLEALKENGGARLAEYHAKATEHLSTLSEKAKPALEDLRQ GLLPVLESFKVSFLSALEEYTKKLNTQ >FGENESH: 9 18 exon (s) 260007 - 290570 1041 aa, chain - ## BY PROTMAP: gi|37360178|dbj|BAC98067.1| mKIAA0999 protein [Mus musculus] ## 1052 ## 89.0 1.00 1.00 1903.0 MKLGDADPNFDRLIAECQQLKEERQVDPLNEDVLLAMEDMGLDKEQTLQSLRSDAYDHYS AIYSLLCDRHKRHKTLRLGALPSMPRALAFQAPVNIQAEQAGTAMNISVPQPDGTLNLDS DEGEEPSPEALVRYLSMRRHTVGVADPRTEVMEDLQKLLPGFPGVNPQAPFLQVAPNVNF MHNLLPMQNLQPTGQLEYKEQSLLQPPTLQLLNGMGPLGRRASDGGANIQLHAQQLLKRP RGPSPLVTMTPAVPAVTPVDEESSDGEPDQEAVQRYLANRSKRHTLAMTNPTAEIPPDLQ RQLGQQPFRSRVWPPHLVPDQHRSTYKDSNTLHLPTERFSPVRRFSDGAASIQAFKAHLE KMGNNSSIKQLQQECEQLQKMYGGQIDERTLEKTQQQHMLYQQEQHHQILQQQIQDSICP PQPSPPLQAACENQPALLTHQLQRLRIQPSSPPPNHPNNHLFRQPSNSPPPMSSAMIQPH GAASSSQFQGLPSRSAIFQQQPENCSSPPNVALTCLGMQQPAQSQQVTIQVQEPVDMLSN MPGTAAGSSGRGISISPSAGQMQMQHRTNLMATLSYGHRPLSKQLSADSAEAHSLNVNRF SPANYDQAHLHPHLFSDQSRGSPSSYSPSTGVGFSPTQALKVPPLDQFPTFPPSAHQQPP HYTTSALQQALLSPTPPDYTRHQQVPHILQGLLSPRHSLTGHSDIRLPPTEFAQLIKRQQ QQRQQQQQQQQQQEYQELFRHMNQGDAGSLAPSLGGQSMTERQALSYQNADSYHHHTSPQ HLLQIRAQECVSQASSPTPPHGYAHQPALMHSESMEEDCSCEGAKDGFQDSKSSSTLTKG CHDSPLLLSTGGPGDPESLLGTVSHAQELGIHPYGHQPTAAFSKNKVPSREPVIGNCMDR SSPGQAVELPDHNGLGYPARPSVHEHHRPRALQRHHTIQNSDDAYVQLDNLPGMSLVAGK ALSSARMSDAVLSQSSLMGSQQFQDGENEECGASLGGHEHPDLSDGSQHLNSSCYPSTCI TDILLSYKHPEVSFSMEQAGV >FGENESH: 10 7 exon (s) 302565 - 372401 241 aa, chain - MQVVKGYIIQDFLQETKQLVAIKIIDKTQLDEENLKKIFREVQIMKMLCHPHIIRLYQVM ETERMIYLVTEYASGGEIFDHLVAHGRMAEKEARRKFKQIVTAVYFCHCRNIVHRDLKAE NLLLDANLNIKIADFGFSNLFTPGQLLKTWCGSPPYAAPELFEGKEYDGPKVDIWSLGVV LYVLVCGALPFDGSTLQNLRARVLSGKFRIPFFMSTVFPVSNLDLFYPPPPSPHTMARDS Y >NP_001021.1_#_36_#_290 11 1 exon (s) 449408 - 449662 84 aa, chain - ## gi|4506711|ref|NP_001021.1| ribosomal protein S27 [Homo sapiens] ## 84 ## orf_perfect MPLAKDLLHPSPEEEKRKHKKKRLVQSPNSYFMDVKCPGCYKITTVFSHAQTVVLCVGCS TVLCQPTGGKARLTEGCSFRRKQH
If Fgenesh++ does not find any reliable genes, the output looks like so:
FGENESH++ 3.1.1 Mapped known genes and predicted genes in genomic DNA Seq name: small_seq Length of sequence: 60 no reliable predictions