TSSG is the most accurate mammalian promoter prediction program. The following table shows results of promoter search on genes with known mRNAs by different promoter finding programs, reproduced with changes from Liu and States (2002) Genome Research 12:462-469. It shows that TSSG has by far the fewest false positive predictions.
| Program |
Set1 (133 promoters)
|
Set2 (120 promoters)
|
||
|
True predictions
|
False Predictions
|
True predictions
|
False Predictions
|
|
| PROSCAN1.7 |
32 (24%)
|
18 (36%)
|
30 (25%)
|
22 (42%)
|
| NNPP2.0 |
56 (42%)
|
41 (42%)
|
26 (22%)
|
50 (66%)
|
| PromFD1.0 |
88 (66%)
|
43 (33%)
|
69 (58%)
|
57 (45%)
|
| Promoter2.0 |
8 (6%)
|
100 (93%)
|
14 (12%)
|
92 (88%)
|
| TSSG |
75 (56%)
|
10 (12%)
|
62 (52%)
|
18 (23%)
|
| TSSW |
57 (43%)
|
29 (34%)
|
58 (48%)
|
20 (26%)
|
Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSG uses promoter.dat file with selected factor binding sites (TFD, Ghosh,1993) developed by Dan Prestridge to calculate the density of functional sites as in J.Mol.Biol.,1995,249,923-932.
For approximately 50-55% level of true promoter region recognition, TSSG program gives one false positive prediction for about 5000 bp. This accuracy is similar with the test sequences anlysis by Prestridge's method. We estimate an accuracy of finding TSS position on ten test genes where both our and Prestridge's algorithms found promoter region to be as follows (numbers show dictance between actual and predicted TSS):
| Method/distance |
<5bp
|
5-50 bp
|
50-150 bp
|
Mean of observed distance
|
| Prestridge's |
0
|
3
|
7
|
81.2 bp
|
| TSSG |
7
|
3
|
0
|
7.3 bp
|
Another Softberry promoter recognition program TSSW is based on similar ideology, but uses data from older release of Biobase's Transfac® data base (E.Wingender, J.Biotech., 1994, 35, 273-280).
References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D.,
Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.
2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.
3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.
HSCALCAC 7637 bp DNA PRI 14-MAR-1995
Length of sequence- 7637
Threshold for LDF- 4.00
1 promoter(s) were predicted
Pos.: 1820 LDF- 16.65 TATA box predicted at 1804
Transcription factor binding sites:
for promoter at position - 1820
1764 (-) S00098 AACCAAT
1608 (-) S01152 AAGTGA
1741 (+) S01153 AARKGA
1608 (-) S01153 AARKGA
1657 (+) S01090 AATGA
1617 (-) S01027 ACGCCC
1577 (+) S00534 ACGTCA
1580 (-) S00534 ACGTCA
1580 (-) S01257 ACGTCAT
..............................
First line - name of your sequence;
second and third lines - LDF threshold and the length of presented sequence
Fourth line - Number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region;
(+) or (-) reflects the direct or complementary chain; S... means a particular motif identificator
from the Ghosh data base.
Lower cased letters mean non-conserved nucleotides in the site consensus
The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.
See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1
| IUPAC Code | Meaning | Origin of Description |
| G | G | Guanine |
| A | A | Adenine |
| T | T | Thymine |
| C | C | Cytosine |
| R | G or A | puRine |
| Y | T or C | pYrimidine |
| M | A or C | aMino |
| K | G or T | Ketone |
| S | G or C | Strong interaction |
| W | A or T | Weak interaction |
| H | A or C or T | not-G, H follows G in the alphabet |
| B | G or T or C | not-A, B follows A in the alphabet |
| V | G or C or A | not-T (not-U), V follows U in the alphabet |
| D | G or A or T | not-C, D follows C in the alphabet |
| N | G or A or T or C | aNy |