Short description of the computation of nucleotide frequency matrices for various promoter elements

 

To get unrelated set of promoters, a pairwise comparison of a region [-50:+1] of 586 plant promoters (including 305 entries from the first release of DB) has been performed and one of the couple of promoters showing more than 90% homology has been excluded from the initial collection. As a result, 10 promoters were excluded from the initial set of the collected promoter sequences.

In simple implementation of Expectation Maximization (EM) algorithm we considered the sequence of motif X=(x1,x2,...,xl) , where l is the motif length. If Pi (xj) is the empiric frequency of the nucleotide xj in position i (computed on previous iteration), then the weight of this motif is computed as

W(X) = log Pi(xj)/0.25

 

Using the EM procedure for 10 iterations the initial collection of 576 unrelated promoters was divided into the 2 classes: 345TATA and 231 TATA-less unrelated promoters. In calculations of TATA matrices the allowed variation of a distance between the right boundary of the TATA-core box and the TSS was 18-40 bp and only TATAWAWA-core was used for calculating the weight. As an initial TATA-box matrix, the TATA-matrix computed for 171 plant promoters from the first release of PlantProm DB (http://mendel.cs. rhul.ac.uk/) was used.

 

For computation of the CCAAT-box matrix we considered the possible distance between the right boundary of CCAAT-core and the TSS within 50-100 bp. As an initial CCAAT-box matrix, the CCAAT-matrix computed for 131 plant promoters from the first release of PlantProm DB (http://mendel.cs. rhul.ac.uk/) was used; in accordance with the available literature data, CCAAT boxes were identified on both DNA strands.

 

The TSS-motif matrix of 5 bp in length has been computed, where the 3rd nucleotide was the annotated (anTSS). No strong consensus was revealed. When the EM approach was used to analyze all possible pentanucleotides with an assumed TSS (asTSS) location in the range [anTSS-2:anTSS+2], it was observed that the composition of asTSS-motifs is different in dicot and monocot plants, as well as for TATA and TATA-less promoters.

To search for statistically significant motifs of 1577 known plant regulatory elements, nsite program (http://linux1.softberry. com/berry.phtml) has been applied.