TSSW - Recognition of human PolII promoter region and start of transcription

Method description:

Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSW uses file with selected factor binding sites from currently supported functional site data base of (E.Wingender, J.of Biotechnology,1994, 35, 273-280).

For approximately 50-55% level of true promoter region recognition, TSSW program will give one false positive prediction for about 4000 bp.

Another promoter recognition program, TSSG, uses promoter.dat file with selected factor binding sites (TFD, Ghosh,1993).

1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D., Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.

2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.

3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.

TSSW output example:

HSCALCAC     7637 bp    DNA             PRI       14-MAR-1995    
     Length of sequence-      7637
 Threshold for LDF-  4.00
     2 promoter(s)  were predicted
 Pos.:   1834 LDF- 11.08 TATA box predicted at   1804
 Pos.:   7031 LDF-  4.64 TATA box predicted at   7001
 Transcription factor binding sites:
for promoter at position -    1834
  1752 (+) CHICK$ACRA   CCGCCC
  1762 (-) HS$BAC_03    CCAAT
  1764 (-) RAT$ALBU_2   AACCAAT
  1757 (-) HS$APOE_08   GGGCGG
  1575 (+) HS$ACHGON_   TGACGTCA
  1582 (-) HS$ACHGON_   TGACGTCA
  1758 (+) MOUSE$A21C   ATTGG
  1745 (+) MOUSE$A21C   gcccagccctcccATTGGtggagacg
  1609 (+) Y$CYC1_09    ctcatttggcgagcGTTGGt
  1724 (+) AD$E2L_04    TGACgcA
  1577 (+) AD$E4_16     ACGTCA
  1580 (-) AD$E4_16     ACGTCA
  1580 (-) AD$E4_18     ACGTCAT
  1655 (+) HS$EGFR_15   TCAAT


First line - name of your sequence;
Second and Third lines - LDF threshold and the length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region; (+) or (-) reflects the direct or complementary chain; S... means a particular motif identificator from the Wingender data base.

Lower cased letters mean non-conserved nucleotides in the site consensus

The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.

See TABLE at

IUPAC Code Meaning Origin of Description
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Ketone
S G or C Strong interaction
W A or T Weak interaction
H A or C or T not-G, H follows G in the alphabet
B G or T or C not-A, B follows A in the alphabet
V G or C or A not-T (not-U), V follows U in the alphabet
D G or A or T not-C, D follows C in the alphabet
N G or A or T or C aNy