Services Test Online


FGENESB Suite of Bacterial Operon and Gene Finding Programs

FGENESB is a package for automatic annotation of bacterial genomes that includes the following features:

Automatic training of gene finding parameters for new bacterial genomes using only genomic DNA as an input (optionally, pre-learned parameters from related organism can be used)
Mapping of tRNA and rRNA genes
Highly accurate Markov chains-based gene prediction
Prediction of promoters and terminators
Operon prediction based on distances between ORFs and frequencies of different genes neighboring each other in known bacterial genomes, as well as on promoter and terminator predictions
Automatic annotation of predicted genes by homology with COG and NR databases.

FGENESB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. It is the most accurate prokaryotic gene prediction engine: see Table 1 at the bottom of this page for its comparison with two other popular gene prediction programs. The package includes options to works with a set of sequences such as scaffolds of bacterial genomes or short sequencing reads extracted from bacterial communities. For community sequence annotation, we include ABsplit program that separates archebacterial and eubacterial sequences. FGENESB was used in first ever published bacterial community annotation project: see Tyson et al., (2004) Nature 428(6978), 37-43.

The final annotation can be presented in GeneBank format to be readable by visualization software such as Artemis or Softberry Bacterial Genome Explorer.

 

Step-by-Step Description of FGENESB annotation.

Please note that web version of FGENESB includes only gene prediction and greatly simplified operon prediction portion of the program, represented by steps 3 and 4 below. Complete FGENESB package is available only for local installation.

STEP 1. Finds all potential ribosomal RNA genes using BLAST against bacterial and/or archaeal rRNA databases, and masks detected rRNA genes.

STEP 2. Predicts tRNA genes using tRNAscan-SE program (Washington University) and masks detected tRNA genes.

STEP 3. Initial predictions of long ORFs that are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes. Generates parameters such as 5th-order in-frame Markov chains for coding regions, 2nd-order Markov models for region around start codon and upstream RBS site, Stop codon and probability distributions of ORF lengths.

STEP 4. Predicts operons based only on distances between predicted genes.

STEP 5. Runs BLASTP for predicted proteins against COG database, cog.pro.

STEP 6. Uses information about conservation of neighboring gene pairs in known genomes to improve operon prediction.

STEP 7. Runs BLASTP against NR for proteins having no COGs hits.

STEP 8. predicts potential promoters (BPROM program) or terminators (BTERM) in upstream and downstream regions, correspondingly, of predicted genes. BTERM is the program predicting bacterial -independent terminators with energy scoring based on discriminant function of hairpin elements.

STEP 9. Refines operon predictions using predicted promoters and terminators as additional evidences.

 

Example of FGENESB output:


1     1 Op  1  21/0.000   +    CDS        407 -      1747   1311  ## COG0593 ATPase involved in DNA 
                               +    Term      1786 -      1823    3.2 
                               +    Prom      1847 -      1906   10.5 
     2     1 Op  2   3/0.019   +    CDS       1926 -      3065   1237  ## COG0592 DNA polymerase 
                               +    Term      3074 -      3122    9.1 
                               +    Prom      3105 -      3164    4.0 
     3     2 Op  1   4/0.002   +    CDS       3193 -      3405    278  ## COG2501 Uncharacterized ACR
     4     2 Op  2   4/0.002   +    CDS       3418 -      4545    899  ## COG1195 Recombinational DNA
           2 Op  3  16/0.000   +    CDS       4578 -      6506   2148  ## COG0187 DNA gyrase (topoisomerase II) B subunit
                               +    Term      6516 -      6551    4.7 
                               +    Prom      6512 -      6571    2.3 
     6     2 Op  4     .       +    CDS       6595 -      9066   2957  ## COG0188 DNA gyrase (topoisomerase II) A subunit
                               +    Term      9067 -      9098    3.4 
                              + SSU_RRNA      9308 -     10861  100.0  # AY138279 [D:1..1554] # 16S ribosomal RNA # Bacillus cereus
                               +   TRNA      10992 -     11068  101.2  # Ile GAT 0 0
                               +   TRNA      11077 -     11152   93.9  # Ala TGC 0 0
                              + LSU_RRNA     11233 -     14154   99.0  # AF267882 [D:1..2922] # 23S ribosomal RNA # Bacillus
     7     3 Op  1     .       -    CDS      14175 -     14363    158 
                              + 5S_RRNA     14205 -     14315   97.0  # AE017026 [D:165635..165750] # 5S ribosomal RNA # Bacillus
     8     3 Op  2     .       -    CDS      14353 -     15249    351  ## Similar_to_GB
     9     3 Op  3     .       -    CDS      15170 -     15352     99 
                               -    Prom     15373 -     15432    6.9 

Example of FGENESB output in GeneBank format:

TAAGVIIRMPVDQISQMGRNTQGVRLIRLEDEQEVATVAKAQKDDEEETSEEVSSEE"
                     /transl_table=11
     terminator      9067..9098
                     /gene="GyrA"
     gene            9308..10861
                     /gene="AY138279 [D:1..1554]"
     rRNA            9308..10861
                     /gene="AY138279 [D:1..1554]"
                     /product="16S ribosomal RNA"
                     /note="AY138279 [D:1..1554]"
     gene            10992..11068
                     /gene="Ile GAT"
     tRNA            10992..11068
                     /gene="Ile GAT"
                     /product="tRNA-Val"
                     /note="Ile GAT 0 0"
     gene            11077..11152
                     /gene="Ala TGC"
     tRNA            11077..11152
                     /gene="Ala TGC"
                     /product="tRNA-Val"
                     /note="Ala TGC 0 0"
     gene            11233..14154
                     /gene="AF267882 [D:1..2922]"
     rRNA            11233..14154
                     /gene="AF267882 [D:1..2922]"
                     /product="23S ribosomal RNA"


Table 1. Accuracy of prediction estimated on B.subtilis sequence:
Frequency of genes starting from start codon other than first - 19.1%
Borodovsky et al. (see GeneMark WEB pages) has calculated accuracy for all genes, and has constructed three sets of difficult short genes (L <= 300bp) that have protein similarity support. There genes were used to demonstrate that short genes also can be predicted reasonably well. First set (51set) has 51 genes with at least 10 strong similarities to known proteins. Then 72set has 72 genes with at least two strong similarities, and 123set has 123 genes with at least one protein homolog.

Here are the prediction results on these three sets for GeneMarkS and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.) and FgenesB (calculated by Softberry, three iterations of fgenesB-train script):

               Sn (exact        Sn (exact+overlapping
                  predictions)      predictions)


 123set: 
 Glimmer         57.0%           91.1 
 GeneMarkS       82.9            91.9 
 FgenesB         89.3            98.4


 72set: 
 Glimmer         57.0%           91.7 
 GeneMarkS       88.9            94.4 
 FgenesB         91.5            98.6


 51set: 
 Glimmer         51.0%           88.2 
 GeneMarkS       90.2            94.1
 FgenesB         92.0            98.0


 All genes of B.subtilis genome(GenBabk annotation): 
 Glimmer         62.4%           98.1 
 GeneMarkS       83.2            96.7  
 FgenesB         83.8            98.7   

Please note that many genes in GenBank were annotated using GeneMark program, which should result in overestimation of its accuracy.