Annotation of Bacterial Genomes and Community Sequences: Genes, Operons, Promoters, Terminators, Protein Sub-Cellular Localization.
Softberry software for analysis of bacterial genomes:
- Automatic annotation pipeline Fgenesb
( see flowchart), (445 citation of using in matagenomic projects according to Google scholar),
which includes self-training genefinding parameters, prediction of CDS
and identification of closest protein homolog from COG, KEGG and NR
databases, mapping rRNA and tRNA genes, prediction of operons, promoters
and terminators. Suitable for both eu- and archebacterial genomes, and
for environmental samples of bacterial DNA sequences.
- Togenbank: A set of scripts to convert FgenesB output to GenBank
and Sequin formats, for visualization in popular viewers like Artemis
and for submitting annotated sequences to Genbank.
- Absplit:
A program that sorts bacterial sequences into eu- and archebacterial
subsets.
- BPROM:
Bacterial sigma70 promoter prediction program (831 citations according to Google scholar).
- FindTerm:
Finder of rho-independent terminators.
- ProtCompB:
Program for prediction of localization of proteins in bacterial cellular
compartments.
- Bacterial GenomeExplorer:
Interactive genome viewer with search capabilities, which lets user
to vizualise known and predicted genes, mRNAs, EST, promoters, terminators,
and other features.
- Several
programs for alignment and comparison of bacterial genomes.
- Visualization
of FgenesB annotation using CGView
- Human Microbiome gene prediction
- Softberry Microbiome Annotation Database
A set of bacterial genome annotation tools listed above
was successfully used in first published annotation of bacterial communities
(Community structure and metabolism through reconstruction of microbial
genomes from the environment, Nature 2004, 428, 37 - 43), and now
is widely used in dozens academic and industrial labs for sequence analysis
in bacteria.
Evaluations:
Comparative accuracy estimated on comprehensive tests: Mavromatis et al.
VOL.4 NO.6 | JUNE 2007 | NATURE METHODS:
The first, fgenesb pipeline was used to predict coding sequences both on the
assembled contigs and the unassembled reads. The second pipeline used a
combination of CRITICA and GLIMMER (called here CG pipeline). "Fgenesb
correctly identified 10-30% more reference genes on the contigs than the
CG pipeline in every data set (Fig. 2a). Both pipelines called 7-15% of
the genes inaccurately (Fig. 2a)"
"We also evaluated the accuracy of the gene calls on unassembled reads
(where low-abundance species are usually represented). Fgenesb correctly
identified ~70% and missed ~20% of reference genes on unassembled reads in all
data sets (Fig. 2b). The remaining 10% of reference genes were inaccurately
called and another ~8% were newly predicted. Notably, the contribution of
assembly to accurate gene prediction was not more than 20%, whereas its effect
on the missed and the inaccurately predicted genes was only slightly higher.
The CG pipeline exhibited poor results (7% accurately predicted, 85% missed
and 8% inaccurately predicted genes), and we did not use these data for the
following steps of the analysis".
Prediction results on three sets of difficult to predict
bacterial genes for GeneMarkS and Glimmer (calculated in Nucleic Acids
Research, 2001, Vol. 29, No. 12, 2607-2618) and FgenesB (calculated by
Softberry):
|
123 set |
72 set |
51 set |
B.subtilis genome |
|
Sn (Sn_ov) |
Sn (Sn_ov) |
Sn (Sn_ov) |
Sn (Sn_ov) |
Glimmer |
57.0 (91.1) |
57.0 (88.2) |
51.0 (88.2) |
62.4 (98.1) |
GeneMarkS |
82.9 (91.9) |
88.9 (94.1) |
90.2 (94.1) |
83.2 (96.7) |
FgenesB |
89.3 (98.4) | 91.5 (98.6) | 92.0 (98.0) | 83.8 (98.7) |
Sn (Sensitivity), percentage of existing CDS predicted exactly
right. Sn_ov, percentage of CDS predicted exactly and with overlap. For
B.subtilis, we compared genes predicted by three programs with
all annotated genes in GenBank entry for that genome.
Application of Softberry programs for bacterial genomic and metagenomic sequence analysis in selected publications:
FgenesB:
Journal of Bacteriology (2008) 190, 4859-4864;
Journal of Bacteriology (2008) 190, 3646-3657;
Journal of Bacteriology (2008) 190, 3362-3373;
Journal of Bacteriology (2008) 190, 2777-2789;
Microbiology (2008) 154, 1422-1435;
Nature Methods (2007) 4, 495-500; PNAS (2007) 104 ,29, 11889-11894;
Nature Biotechnology (2007) 25, 447-453; PNAS (2007) 104 ,13, 5590-5595;
Nature Biotechnology (2006) 24, 1263-1269; PNAS (2006) 103, 34, 12897-12902;
Journal of Bacteriology (2006) 188, 14, 5228-5239; Microbiology (2006) 152, 1951-1968;
Molecular Microbiology (2006) 61, 4, 960-977; Nature (2006) 439, 847-850;
PLoS Biol (2006) 4,4 e95; Science (2005) 308, 554-557; Nature (2004) 428, 37-43.
BPROM:
Journal of Bacteriology (2008) 190, 4559-4567;
Journal of Bacteriology (2008) 190, 3877-3885;
Journal of Bacteriology (2008) 190, 2096-2105;
Microbiology (2008) 154, 1719-28;
Microbiology (2008) 154, 1372-83;
Infection and Immunity (2007) 75, 9, 4506-4513; Journal of Bacteriology (2007) 189, 2, 351-362;
Cellular Microbiology (2007) 9, 4, 1039-1049; Journal of Molecular Biology (2007) 368, 4, 966-981;
PNAS (2006) 103, 34, 12897-12902; Journal of Bacteriology (2006) 188, 14, 5089-5100;
Applied and Environmental Microbiology (2006) 72(4), 2539-2546; Molecular Microbiology (2006) 59, 2, 541-550;
Applied and Environmental Microbiology, (2006) 72, 1, 368-377; Journal of Bacteriology (2006) 188, 2, 789-793;
FEBS Journal (2005) 272, 24, 6324-6335; Infection and Immunity (2005) 73, 5, 2899-2909.
FindTerm:
Microbiology (2008) 154, 1422-1435;
Journal of Bacteriology (2008) 190, 3646-3657;
Microbiology (2007) 153, 2148-2158; Infection and Immunity (2007) 75, 9,4482-4489;
Enzyme and Microbial Technology (2007) 4, 5, 747-753; Journal of Bacteriology, (2006) 188, 1, 160-168;
Molecular Microbiology (2006) 59, 2, 541-550; Journal of Bacteriology (2006) 188, 1, 202-210;
Journal of Bacteriology (2006) 188, 1, 160-168.
|
|