Short
description of the computation of nucleotide frequency matrices for various
promoter elements
To
get unrelated set of promoters, a pairwise comparison of a region [-50:+1] of 586
plant promoters (including 305 entries from the first release of DB) has been
performed and one of the couple of promoters showing more than 90% homology has
been excluded from the initial collection. As a result, 10 promoters were
excluded from the initial set of the collected promoter sequences.
In simple implementation of
Expectation Maximization (EM) algorithm we
considered the sequence of motif X=(x1,x2,...,xl)
, where l is the motif length. If Pi (xj) is the empiric
frequency of the nucleotide xj in position i (computed on previous
iteration), then the weight of this
motif is computed as
W(X) = log ∏ Pi(xj)/0.25
Using the EM procedure for 10 iterations the initial
collection of 576 unrelated promoters was divided into the 2 classes: 345TATA
and 231 TATA-less unrelated promoters. In calculations of TATA matrices the
allowed variation of a distance between the right boundary of the TATA-core box
and the TSS was 18-40 bp and only TATAWAWA-core
was used for calculating the weight. As an initial TATA-box matrix, the
TATA-matrix computed for 171 plant promoters from the first release of
PlantProm DB (http://mendel.cs.
rhul.ac.uk/)
was used.
For
computation of the CCAAT-box matrix we
considered the possible distance between the right boundary of CCAAT-core and
the TSS within 50-100 bp. As an initial CCAAT-box matrix, the CCAAT-matrix
computed for 131 plant promoters from the first release of PlantProm DB (http://mendel.cs. rhul.ac.uk/) was used; in accordance
with the available literature data, CCAAT boxes were identified on both
DNA strands.
The TSS-motif matrix of 5 bp in length has been computed, where the 3rd
nucleotide was the annotated (anTSS). No strong consensus was revealed. When
the EM approach was used to analyze all possible pentanucleotides with an
assumed TSS (asTSS) location in the range [anTSS-2:anTSS+2],
it was observed that the composition of asTSS-motifs is different in dicot and
monocot plants, as well as for TATA and TATA-less promoters.
To search for statistically significant
motifs of 1577 known plant regulatory elements, nsite program (http://linux1.softberry. com/berry.phtml)
has been applied.