FGENESB is a package for automatic annotation of bacterial genomes that includes the following features:
Automatic training of gene finding parameters for new bacterial genomes using only genomic DNA as an input (optionally, pre-learned parameters from related organism can be used)
Mapping of tRNA and rRNA genes
Highly accurate Markov chains-based gene prediction
Prediction of promoters and terminators
Operon prediction based on distances between ORFs and frequencies of different genes neighboring each other in known bacterial genomes, as well as on promoter and terminator predictions
Automatic annotation of predicted genes by homology with COG and NR databases.
FGENESB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. It is the most accurate prokaryotic gene prediction engine: see Table 1 at the bottom of this page for its comparison with two other popular gene prediction programs. The package includes options to works with a set of sequences such as scaffolds of bacterial genomes or short sequencing reads extracted from bacterial communities. For community sequence annotation, we include ABsplit program that separates archebacterial and eubacterial sequences. FGENESB was used in first ever published bacterial community annotation project: see Tyson et al., (2004) Nature 428(6978), 37-43.
The final annotation can be presented in GeneBank format to be readable by visualization software such as Artemis or Softberry Bacterial Genome Explorer.
Step-by-Step Description of FGENESB annotation.
Please note that web version of FGENESB includes only gene prediction and greatly simplified operon prediction portion of the program, represented by steps 3 and 4 below. Complete FGENESB package is available only for local installation.
STEP 1. Finds all potential ribosomal RNA genes using BLAST against bacterial and/or archaeal rRNA databases, and masks detected rRNA genes.
STEP 2. Predicts tRNA genes using tRNAscan-SE program (Washington University) and masks detected tRNA genes.
STEP 3. Initial predictions of long ORFs that are used as a starting point for calculating parameters for gene prediction. Iterates until stabilizes. Generates parameters such as 5th-order in-frame Markov chains for coding regions, 2nd-order Markov models for region around start codon and upstream RBS site, Stop codon and probability distributions of ORF lengths.
STEP 4. Predicts operons based only on distances between predicted genes.
STEP 5. Runs BLASTP for predicted proteins against COG database, cog.pro.
STEP 6. Uses information about conservation of neighboring gene pairs in known genomes to improve operon prediction.
STEP 7. Runs BLASTP against NR for proteins having no COGs hits.
STEP 8. predicts potential promoters (BPROM program) or terminators (BTERM) in upstream and downstream regions, correspondingly, of predicted genes. BTERM is the program predicting bacterial -independent terminators with energy scoring based on discriminant function of hairpin elements.
STEP 9. Refines operon predictions using predicted promoters and terminators as additional evidences.
Example of FGENESB output:
1 1 Op 1 21/0.000 + CDS 407 - 1747 1311 ## COG0593 ATPase involved in DNA + Term 1786 - 1823 3.2 + Prom 1847 - 1906 10.5 2 1 Op 2 3/0.019 + CDS 1926 - 3065 1237 ## COG0592 DNA polymerase + Term 3074 - 3122 9.1 + Prom 3105 - 3164 4.0 3 2 Op 1 4/0.002 + CDS 3193 - 3405 278 ## COG2501 Uncharacterized ACR 4 2 Op 2 4/0.002 + CDS 3418 - 4545 899 ## COG1195 Recombinational DNA 2 Op 3 16/0.000 + CDS 4578 - 6506 2148 ## COG0187 DNA gyrase (topoisomerase II) B subunit + Term 6516 - 6551 4.7 + Prom 6512 - 6571 2.3 6 2 Op 4 . + CDS 6595 - 9066 2957 ## COG0188 DNA gyrase (topoisomerase II) A subunit + Term 9067 - 9098 3.4 + SSU_RRNA 9308 - 10861 100.0 # AY138279 [D:1..1554] # 16S ribosomal RNA # Bacillus cereus + TRNA 10992 - 11068 101.2 # Ile GAT 0 0 + TRNA 11077 - 11152 93.9 # Ala TGC 0 0 + LSU_RRNA 11233 - 14154 99.0 # AF267882 [D:1..2922] # 23S ribosomal RNA # Bacillus 7 3 Op 1 . - CDS 14175 - 14363 158 + 5S_RRNA 14205 - 14315 97.0 # AE017026 [D:165635..165750] # 5S ribosomal RNA # Bacillus 8 3 Op 2 . - CDS 14353 - 15249 351 ## Similar_to_GB 9 3 Op 3 . - CDS 15170 - 15352 99 - Prom 15373 - 15432 6.9
Example of FGENESB output in GeneBank format:
TAAGVIIRMPVDQISQMGRNTQGVRLIRLEDEQEVATVAKAQKDDEEETSEEVSSEE" /transl_table=11 terminator 9067..9098 /gene="GyrA" gene 9308..10861 /gene="AY138279 [D:1..1554]" rRNA 9308..10861 /gene="AY138279 [D:1..1554]" /product="16S ribosomal RNA" /note="AY138279 [D:1..1554]" gene 10992..11068 /gene="Ile GAT" tRNA 10992..11068 /gene="Ile GAT" /product="tRNA-Val" /note="Ile GAT 0 0" gene 11077..11152 /gene="Ala TGC" tRNA 11077..11152 /gene="Ala TGC" /product="tRNA-Val" /note="Ala TGC 0 0" gene 11233..14154 /gene="AF267882 [D:1..2922]" rRNA 11233..14154 /gene="AF267882 [D:1..2922]" /product="23S ribosomal RNA"
Table 1. Accuracy of prediction estimated on B.subtilis sequence:
Frequency of genes starting from start codon other than first - 19.1%
Borodovsky et al.
(see GeneMark WEB pages) has calculated accuracy for all genes, and has
constructed three sets of difficult short genes (L <= 300bp) that have protein
similarity support. There genes were used to demonstrate that short genes also
can be predicted reasonably well. First set (51set) has 51 genes with at least
10 strong similarities to known proteins. Then 72set has 72 genes with at least
two strong similarities, and 123set has 123 genes with at least one protein
homolog.
Here are the prediction results on these three sets for GeneMarkS and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.) and FgenesB (calculated by Softberry, three iterations of fgenesB-train script):
Sn (exact Sn (exact+overlapping predictions) predictions) 123set: Glimmer 57.0% 91.1 GeneMarkS 82.9 91.9 FgenesB 89.3 98.4 72set: Glimmer 57.0% 91.7 GeneMarkS 88.9 94.4 FgenesB 91.5 98.6 51set: Glimmer 51.0% 88.2 GeneMarkS 90.2 94.1 FgenesB 92.0 98.0 All genes of B.subtilis genome(GenBabk annotation): Glimmer 62.4% 98.1 GeneMarkS 83.2 96.7 FgenesB 83.8 98.7
Please note that many genes in GenBank were annotated using GeneMark program, which should result in overestimation of its accuracy.