SoftBerry - FGENESB HELP

FGENESB Suite of Bacterial Operon and Gene Finding Programs

FGENESB is a package for automatic annotation of bacterial genomes that includes the following features:

Automatic training of gene finding parameters for new bacterial genomes using only genomic DNA as an input (optionally, pre-learned parameters from related organism can be used)
Mapping of tRNA and rRNA genes
Highly accurate Markov chains-based gene prediction
Prediction of promoters and terminators
Operon prediction based on distances between ORFs and frequencies of different genes neighboring each other in known bacterial genomes, as well as on promoter and terminator predictions
Automatic annotation of predicted genes by homology with COG and NR databases.

FGENESB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. It is the most accurate prokaryotic gene prediction engine: see Table 1 at the bottom of this page for its comparison with two other popular gene prediction programs. The package includes options to works with a set of sequences such as scaffolds of bacterial genomes or short sequencing reads extracted from bacterial communities. For community sequence annotation, we include ABsplit program that separates archebacterial and eubacterial sequences. FGENESB was used in first ever published bacterial community annotation project: see Tyson et al., (2004) Nature 428(6978), 37-43.

The final annotation can be presented in GeneBank format to be readable by visualization software such as Artemis or Softberry Bacterial Genome Explorer.

Step-by-Step Description of FGENESB annotation.

Please note that web version of FGENESB includes only gene prediction and greatly simplified operon prediction portion of the program, represented by steps 3 and 4 below. Complete FGENESB package is available only for local installation.

STEP 1. Finds all potential ribosomal RNA genes using BLAST against bacterial and/or archaeal rRNA databases, and masks detected rRNA genes.

STEP 2. Predicts tRNA genes using tRNAscan-SE program (Washington University) and masks detected tRNA genes.

STEP 3. Initial predictions of long ORFs that are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes. Generates parameters such as 5th-order in-frame Markov chains for coding regions, 2nd-order Markov models for region around start codon and upstream RBS site, Stop codon and probability distributions of ORF lengths.

STEP 4. Predicts operons based only on distances between predicted genes.

STEP 5. Runs BLASTP for predicted proteins against COG database, cog.pro.

STEP 6. Uses information about conservation of neighboring gene pairs in known genomes to improve operon prediction.

STEP 7. Runs BLASTP against NR for proteins having no COGs hits.

STEP 8. predicts potential promoters (BPROM program) or terminators (BTERM) in upstream and downstream regions, correspondingly, of predicted genes. BTERM is the program predicting bacterial -independent terminators with energy scoring based on discriminant function of hairpin elements.

STEP 9. Refines operon predictions using predicted promoters and terminators as additional evidences.

Example


1     1 Op  1  21/0.000   +    CDS

2     1 Op  2   3/0.019   +    CDS

3     2 Op  1   4/0.002   +    CDS 4     2 Op  2   4/0.002   +    CDS 2 Op  3  16/0.000   +    CDS

6     2 Op  4     .

7     3 Op  1     .

8     3 Op  2     . 9     3 Op  3     .

Example of FGENESB output: 407 - 1747 1311 ## COG0593 ATPase involved in DNA + Term 1786 - 1823 3.2 + Prom 1847 - 1906 10.5 1926 - 3065 1237 ## COG0592 DNA polymerase + Term 3074 - 3122 9.1 + Prom 3105 - 3164 4.0 3193 - 3405 278 ## COG2501 Uncharacterized ACR 3418 - 4545 899 ## COG1195 Recombinational DNA 4578 - 6506 2148 ## COG0187 DNA gyrase (topoisomerase II) B subunit + Term 6516 - 6551 4.7 + Prom 6512 - 6571 2.3 + CDS 6595 - 9066 2957 ## COG0188 DNA gyrase (topoisomerase II) A subunit + Term 9067 - 9098 3.4 + SSU_RRNA 9308 - 10861 100.0 # AY138279 [D:1..1554] # 16S ribosomal RNA # Bacillus cereus + TRNA 10992 - 11068 101.2 # Ile GAT 0 0 + TRNA 11077 - 11152 93.9 # Ala TGC 0 0 + LSU_RRNA 11233 - 14154 99.0 # AF267882 [D:1..2922] # 23S ribosomal RNA # Bacillus - CDS 14175 - 14363 158 + 5S_RRNA 14205 - 14315 97.0 # AE017026 [D:165635..165750] # 5S ribosomal RNA # Bacillus - CDS 14353 - 15249 351 ## Similar_to_GB - CDS 15170 - 15352 99 - Prom 15373 - 15432 6.9 of FGENESB output in GeneBank format:

TAAGVIIRMPVDQISQMGRNTQGVRLIRLEDEQEVATVAKAQKDDEEETSEEVSSEE"
                     /transl_table=11
     terminator      9067..9098
                     /gene="GyrA"
     gene            9308..10861
                     /gene="AY138279 [D:1..1554]"
     rRNA            9308..10861
                     /gene="AY138279 [D:1..1554]"
                     /product="16S ribosomal RNA"
                     /note="AY138279 [D:1..1554]"
     gene            10992..11068
                     /gene="Ile GAT"
     tRNA            10992..11068
                     /gene="Ile GAT"
                     /product="tRNA-Val"
                     /note="Ile GAT 0 0"
     gene            11077..11152
                     /gene="Ala TGC"
     tRNA            11077..11152
                     /gene="Ala TGC"
                     /product="tRNA-Val"
                     /note="Ala TGC 0 0"
     gene            11233..14154
                     /gene="AF267882 [D:1..2922]"
     rRNA            11233..14154
                     /gene="AF267882 [D:1..2922]"
                     /product="23S ribosomal RNA"

Table 1. Accuracy of prediction estimated on B.subtilis sequence:
Frequency of genes starting from start codon other than first - 19.1%
Borodovsky et al. (see GeneMark WEB pages) has calculated accuracy for all genes, and has constructed three sets of difficult short genes (L <= 300bp) that have protein similarity support. There genes were used to demonstrate that short genes also can be predicted reasonably well. First set (51set) has 51 genes with at least 10 strong similarities to known proteins. Then 72set has 72 genes with at least two strong similarities, and 123set has 123 genes with at least one protein homolog.

Here are the prediction results on these three sets for GeneMarkS and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.) and FgenesB (calculated by Softberry, three iterations of fgenesB-train script):

               Sn (exact        Sn (exact+overlapping
                  predictions)      predictions)


 123set: 
 Glimmer         57.0%           91.1 
 GeneMarkS       82.9            91.9 
 FgenesB         89.3            98.4


 72set: 
 Glimmer         57.0%           91.7 
 GeneMarkS       88.9            94.4 
 FgenesB         91.5            98.6


 51set: 
 Glimmer         51.0%           88.2 
 GeneMarkS       90.2            94.1
 FgenesB         92.0            98.0


 All genes of B.subtilis genome(GenBabk annotation): 
 Glimmer         62.4%           98.1 
 GeneMarkS       83.2            96.7  
 FgenesB         83.8            98.7

Please note that many genes in GenBank were annotated using GeneMark program, which should result in overestimation of its accuracy.

Services Test Online

FGENESB Suite of Bacterial Operon and Gene Finding Programs