Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSW uses file with selected factor binding sites from currently supported functional site data base of (E.Wingender, J.of Biotechnology,1994, 35, 273-280).
For approximately 50-55% level of true promoter region recognition, TSSW program will give one false positive prediction for about 4000 bp.
Another promoter recognition program, TSSG, uses promoter.dat file with selected factor binding sites (TFD, Ghosh,1993).
References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D.,
Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.
2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.
3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.
HSCALCAC 7637 bp DNA PRI 14-MAR-1995 Length of sequence- 7637 Threshold for LDF- 4.00 2 promoter(s) were predicted Pos.: 1834 LDF- 11.08 TATA box predicted at 1804 Pos.: 7031 LDF- 4.64 TATA box predicted at 7001 Transcription factor binding sites: for promoter at position - 1834 1752 (+) CHICK$ACRA CCGCCC 1762 (-) HS$BAC_03 CCAAT 1764 (-) RAT$ALBU_2 AACCAAT 1757 (-) HS$APOE_08 GGGCGG 1575 (+) HS$ACHGON_ TGACGTCA 1582 (-) HS$ACHGON_ TGACGTCA 1758 (+) MOUSE$A21C ATTGG 1745 (+) MOUSE$A21C gcccagccctcccATTGGtggagacg 1609 (+) Y$CYC1_09 ctcatttggcgagcGTTGGt 1724 (+) AD$E2L_04 TGACgcA 1577 (+) AD$E4_16 ACGTCA 1580 (-) AD$E4_16 ACGTCA 1580 (-) AD$E4_18 ACGTCAT 1655 (+) HS$EGFR_15 TCAAT ..............................
First line - name of your sequence;
Second and Third lines - LDF threshold and the length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region;
(+) or (-) reflects the direct or complementary chain; S... means a particular motif identificator
from the Wingender data base.
Lower cased letters mean non-conserved nucleotides in the site consensus
The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.
See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1
IUPAC Code | Meaning | Origin of Description |
G | G | Guanine |
A | A | Adenine |
T | T | Thymine |
C | C | Cytosine |
R | G or A | puRine |
Y | T or C | pYrimidine |
M | A or C | aMino |
K | G or T | Ketone |
S | G or C | Strong interaction |
W | A or T | Weak interaction |
H | A or C or T | not-G, H follows G in the alphabet |
B | G or T or C | not-A, B follows A in the alphabet |
V | G or C or A | not-T (not-U), V follows U in the alphabet |
D | G or A or T | not-C, D follows C in the alphabet |
N | G or A or T or C | aNy |