SoftBerry - NSITEM HELP

NsiteM Program Description
Recognition of Conserved Regulatory motifs (for RE Sets derived from ooTFD, RegsiteAN DB and RegsitePL DB)

AUTHORS: I.A.Shahmuradov & V.V.Solovyev

LAST UPDATE: 15 Novembef 2013

VERSION: 5.2013

ACTION: Search for statistically nonrandom motifs of various (up to 30) known human/animal and plant transcription Regulatory Elements (REs) which is available simultaneously and, at least, in ONE copy in P% or more of analyzing DNA sequences (parameter "P" is given by USER (default = 50%).

SEARCH CONDITION: (1) Expected Mean Numbers of any regulatory motif found must be less than a given number (default: 0.05).
(2) regulatory motif(s) are available simultaneously and, at least, in ONE copy in P% or more of analyzing DNA sequences.

Program description:

As NsiteH and NsiteM programs, Nsite is based on the previously proposed real site and/or their IUPAC consensus based probabilistic approach of searching for putative REs in nucleotide sequences and statistically estimating motifs found (Shahmuradov et al., Genetika (Russ.), 1986, 22, 357-367; Solovyev,V.V. and Kolchanov,N.A. In: "Computer analysis of genetic macromolecules. Structure, Function and Evolution", World Scientific, 1993,16-20).

The main features of the approach are the follows:
    (i) RE may consist of a single box (a continuous DNA segment) or two boxes, spaced by some DNA sequence, where only length, but not nucleotide content, of this spacer is important for functioning of such a composite site.
    (ii) A real RE or its IUPAC consensus contains both variable positions (where the presence of a certain group of nucleotides is permissible), and strictly conserved positions (where a strong identity between real site/consensus and predicted motif is required).
    (iii) The nonequivalence of these positions should be taken into account, i.e., complete homology at conserved positions is needed, and a violation of homology in the variable positions should be permissible.
    (iv) The homology between the RE and motif on query DNA sequence may a chance happening, therefore, estimation of its statistical significance is of major importance. A conclusion on the functional significance of the revealed homology can be reached, only if the homology is significantly nonrandom, i.e., the homology is not a chance event.
    (v) Characteristics such as nucleotide frequencies should not be used when describing the consensus because of its small size. Instead, one should use estimates based on the number of nucleotides of various types in the consensus.
    (vi) Although all available databases on REs usually annotate a fixed distance between two boxes of composite elements, some variability of the spacer length, seems, to have a place. Therefore, a search algorithm for composite REs should allow some limited flexibility in the sense of spacer length, relaying on both the known experimental data and theoretical assumption.

One of the important components of this approach is preparation of RE data set. Nsite uses the following 3 sets of human/animal and plant REs.

[1]: Set of human/animal REs prepared by merging and verification (excluding longer than 50 bp REs; elimination of redundancies) RE/TF information from ooTFD (Ghosh,D. Nucleic Acids Res, 2000, 28, 308- 310). It includes 8030 real/consensus REs.

[2]: Set of human/animal REs prepared by merging and verification RE/TF information from RegsiteAN DB developed Softberry Inc. It includes 1500 real/consensus REs.

[3]: Set of plant REs prepared by merging and verification RE/TF information from RegsitePL DB developed by Softberry Inc. The current version of this set includes 2359 real/consensus REs, but it is regularly updated.

Moreover, user can perform a search for motifs of REs from his own dataset in a format described below.

Input query sequences of length 100000 nucleotides or less must be given in FASTA format.

RULES for creating USER's set of REs:

1. USER's set must include only sequences of real REs and/or their consensuses.

2. Every real RE/consensus is described in 3 lines:
   LINE 1: Name/description of RE/consensus
   LINE 2: Sequence of of RE/consensus
   LINE 3: <par1> <par2> <par3> <par4>

3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c) and their any combinations in according to IUPAC abbreviations:R - A or G, Y - T or C, K - G or T, M - A or C, S - G or C, W - A or T, B - G or T or C, D - A or G or T, H - A or C or T, V - A or G or C, N - A or G or C or T.
In the case of composite REs, 2 boxes are seperated by "-".
Length of RE/consensus sequence must not exceed 80 symbols, including "-" in the case of composite elements.
Capital letters indicate Conservative nucleotides (positions) in which mismatch is not allowd.

4. In LINE 3:
  <par1> - a maximal number of mismatches for the 1st box
  <par2> - a maximal number of mismatches for the 2nd box (for composite REs)
  If RE contains a single box, then <par2> = 0;
  If any mismatch is not allowed, then <par1> = <par2> = 0 )

  <par3> - minimal distance between boxes of composite RE
  <par4> - maximal distance between boxes of composite RE (for a single-box REs <par3> = <par4> = 0 )

All <par1>, <par2>, <par3> and <par3> are given as INTEGER in 4i5 format.

Example of USER's set of 3 REs:

RE 1  
agTGGcgAggcg
    2    0    0    0
RE2 
caggccTGc-CCAGctgg 
    1    1    8   10
RE 3  
RRTGTGGWWW 
    0    0    0    0

NsiteM output:

Every OUTPUT file begins with description of the Program's allocation, Search Parameters, as well as used abbreviations (in the case of using Data sets created by us). The next 2 includes name and length of the first query sequence. At last, name of REs, statistical estimation and sequences of motifs found are given.

For example:


  Program   NsiteM | Version 5.2013
 Search for motifs of Regulatory Elements (REs) available in a given (by USER) portion of   
       2356 QUERY Sequences | SET of REs: RegsitePDB: 2359 Plant Transcription REs [Last update: 05.02.2012; Softberry Inc
____________________________________________________________
 Search PARAMETRS:
     Expected  Mean  Number                      :  0.0500000
     Statistical Siginicance Level               :  0.9500000
     Level of homology between known RE and motif:   90%
     Variation of Distance between RE Blocks     :   20%
 NOTE: RE - Regulatory Element/Consensus   | AC - Accession No of RE in a given DB
       OS - Organism/Species   | BF - Binding Factor or One of them
       Mism. - Mismatches   | Mean. Exp. Number - Mean Expected Number   | Up.Conf.Int. - Upper Confidence Interval
============================================================


Graphic View of RE motifs found in  80%  or more of analyzed sequences


       MOTIFS
       12345678
_______________
SEQs
    1| .++..+.+
    2| ++.+++++
    3| ++.+++++
    4| .++..+.+
    5| ++++++++
    6| ++++++++
    7| ++++++++
    8| +++++.++
    9| +++++.++
   10| ++++++++


     Motifs of RE No     1  | NAME:   175. AC: RSP00175//OS: arabidopsis (Arabidopsis thaliana) /GENE: synthetic oli  | SEQ: cacgtggc
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    2    3    5    6    7    8    9   10

     Motifs of RE No     2  | NAME:   204. AC: RSP00204//OS: arabidopsis (Arabidopsis thaliana) /GENE: AtEm6/RE: ABR  | SEQ: gacACGTggc
Found in the following   10 ( 100.0%) out of   10 analyzed sequences: 
    1    2    3    4    5    6    7    8    9   10

     Motifs of RE No     3  | NAME:   524. AC: RSP00524//OS: carrot (Daucus carota) /GENE: Dc3/RE: E4-core /BF: DPBF  | SEQ: ACACgtG
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    1    4    5    6    7    8    9   10

     Motifs of RE No     4  | NAME:   723. AC: RSP00723//OS: tobacco (Nicotiana plumbaginifolia) /GENE: rbcS 8B/RE:   | SEQ: cacgtggc
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    2    3    5    6    7    8    9   10

     Motifs of RE No     5  | NAME:  1041. AC: RSP01034//OS: arabidopsis (Arabidopsis thaliana) /GENE: RD29B/RE: ABR  | SEQ: ACGTGgC
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    2    3    5    6    7    8    9   10

     Motifs of RE No     6  | NAME:  1159. AC: RSP01151//OS: maize (Zea mays) /GENE: Em/RE: Em1a /BF: EmBP-1 (+VP1)  | SEQ: GacACGTggc
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    1    2    3    4    5    6    7   10

     Motifs of RE No     7  | NAME:  1602. AC: RSP01570//OS: rice (Oryza sativa, indica) /GENE: Synthetic oligonucle  | SEQ: cACGTGGC
Found in the following    8 (  80.0%) out of   10 analyzed sequences: 
    2    3    5    6    7    8    9   10

     Motifs of RE No     8  | NAME:  1850. AC: RSP01816//OS: tobacco (Nicotiana tabacum) /GENE: PNZIP/RE: G-box /BF:  | SEQ: gcCACGTGtc
Found in the following   10 ( 100.0%) out of   10 analyzed sequences: 
    1    2    3    4    5    6    7    8    9   10


============================================================
LIST of sequences presented in the graphic view of results

   1.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 1: GENE: F21B7.21 [At1g03600] |898875..899654 |SUPPORT |  1 exon(s) 
   		|PROD: photosystem II protein family |5"-UTR:   40 | -500:  +1 | +1 ... CDS start
   2.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 1: GENE: F4H5.23 [At1g06680] |2047878..2049417 |SUPPORT |  4 exon(s) 
   		|PROD: photosystem II oxygen-evolving complex 23 (OEC23) |5"-UTR:   61 | -500:  +1 |
   3.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 1: GENE: F9I5.11 [At1g52230] |19520367..19521199 |SUPPORT |  3 exon(s) 
   		|PROD: photosystem I subunit VI precursor |5"-UTR:   66 | -500:  +1 | +1 ... CDS st
   4.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 2: GENE: T6B20.8 [At2g30570] |complement(13025520..13027271) |SUPPORT |  2 exon(s) 
   		|PROD: photosystem II reaction center 6.1KD protein |5"-UTR:  141 | -50
   5.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 3: GENE: MSL1.18 [At3g16140] |complement(5468517..5469482) |SUPPORT |  3 exon(s) 
   		|PROD: photosystem I subunit VI precursor |5"-UTR:   61 | -500:  +1 | +1
   6.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 3: GENE: MSA6.6 [At3g21055] |complement(7376643..7377192) |SUPPORT |  1 exon(s) 
   		|PROD: photosystem II 5 kD protein precursor |5"-UTR:  114 | -500:  +1 | +
   7.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 3: GENE: F18B3.100 [At3g50820] |complement(18901857..18903407) |SUPPORT |  3 exon(s) 
   		|PROD: photosystem II oxygen-evolving complex 33 (OEC33) |5"-UTR:  11
   8.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 4: GENE: T5F17.110 [At4g28660] |14149922..14151103 |SUPPORT |  2 exon(s) 
   		|PROD: photosystem II protein W - like |5"-UTR:   92 | -500:  +1 | +1 ... CDS sta
   9.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 4: GENE: F16A16.140 [At4g28750] |complement(14202779..14203961) |SUPPORT |  3 exon(s) 
   		|PROD: photosystem I subunit PSI-E - like protein |5"-UTR:   67 | -5
  10.  EXAMPLE Group (10 genes) | Arabidopsis, Chr 5: GENE: K1F13.25 [At5g66570] |26585884..26587504 |SUPPORT |  3 exon(s) 
  		|PROD: photosystem II oxygen-evolving complex 33 (OEC33) |5"-UTR:   86 | -500:  +1
============================================================
     Motifs of REs found in  80%  or more of analyzed sequences
............................................................
   1. QUERY: > EXAMPLE Group (10 genes) | Arabidopsis, Chr 1: GENE: F21B7.21 [At1g03600] |898875..899654 |SUPPORT |  1 exon(s) 
   		|PROD: photosystem II protein family |5"-UTR:   40 | -500:  +1 | +1 ... CDS start
 Length of Query Sequence:        500 bp      | Nucleotide Frequencies:  A -  0.37   G -  0.15   T -  0.27   C -  0.21
............................................................
 RE:   204. AC: RSP00204//OS: arabidopsis (Arabidopsis thaliana) /GENE: AtEm6/RE: ABRE/6.2 /BF: ABI5
           Found in   10 (100.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00232    Found  1
     393  GACACGTGtC      402 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00223    Found  1
     402  GACACGTGtC      393 (Mism.= 1)
............................................................
 RE:   524. AC: RSP00524//OS: carrot (Daucus carota) /GENE: Dc3/RE: E4-core /BF: DPBF-1; DPBF-2;
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.01810    Found  1
     394  ACACGTG      400 (Mism.= 0)
 Motifs on "-" Strand: Mean Exp. Number   0.01313    Found  2
     401  ACACGTG      395 (Mism.= 0)
     155  ACACGTG      149 (Mism.= 0)
............................................................
 RE:  1159. AC: RSP01151//OS: maize (Zea mays) /GENE: Em/RE: Em1a /BF: EmBP-1 (+VP1)
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00185    Found  1
     393  GACACGTGtC      402 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00190    Found  1
     402  GACACGTGTC      393 (Mism.= 1)
............................................................
 RE:  1850. AC: RSP01816//OS: tobacco (Nicotiana tabacum) /GENE: PNZIP/RE: G-box /BF: 	NtbZIP
           Found in   10 (100.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00143    Found  1
     393  GaCACGTGTC      402 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00152    Found  1
     402  GaCACGTGTC      393 (Mism.= 1)
............................................................
 Totally       9 motifs of     4 different REs have been found
------------------------------------------------------------
   2. QUERY: > EXAMPLE Group (10 genes) | Arabidopsis, Chr 1: GENE: F4H5.23 [At1g06680] |2047878..2049417 |SUPPORT |  4 exon(s) 
   		|PROD: photosystem II oxygen-evolving complex 23 (OEC23) |5"-UTR:   61 | -500:  +1 |
 Length of Query Sequence:        500 bp      | Nucleotide Frequencies:  A -  0.37   G -  0.17   T -  0.26   C -  0.19
............................................................
 RE:   175. AC: RSP00175//OS: arabidopsis (Arabidopsis thaliana) /GENE: synthetic oligonucleotides/RE: ABRE /BF: ABFs
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00175    Found  1
     301  CACGTGGC      308 (Mism.= 0)
 Motifs on "-" Strand: Mean Exp. Number   0.00175    Found  1
     306  CACGTGGC      299 (Mism.= 0)
............................................................
 RE:   204. AC: RSP00204//OS: arabidopsis (Arabidopsis thaliana) /GENE: AtEm6/RE: ABRE/6.2 /BF: ABI5
           Found in   10 (100.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00288    Found  1
     299  GcCACGTGGC      308 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00226    Found  1
     308  GcCACGTGGC      299 (Mism.= 1)
............................................................
 RE:   723. AC: RSP00723//OS: tobacco (Nicotiana plumbaginifolia) /GENE: rbcS 8B/RE: G-box /BF: HY5
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00175    Found  1
     301  CACGTGGC      308 (Mism.= 0)
 Motifs on "-" Strand: Mean Exp. Number   0.00175    Found  1
     306  CACGTGGC      299 (Mism.= 0)
............................................................
 RE:  1041. AC: RSP01034//OS: arabidopsis (Arabidopsis thaliana) /GENE: RD29B/RE: ABRE 1/2 /BF: ABI3; ABI5; AREB1
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00921    Found  1
     302  ACGTGGC      308 (Mism.= 0)
 Motifs on "-" Strand: Mean Exp. Number   0.01005    Found  1
     305  ACGTGGC      299 (Mism.= 0)
............................................................
 RE:  1159. AC: RSP01151//OS: maize (Zea mays) /GENE: Em/RE: Em1a /BF: EmBP-1 (+VP1)
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00234    Found  1
     299  GCCACGTGGC      308 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00189    Found  1
     308  GCCACGTGGC      299 (Mism.= 1)
............................................................
 RE:  1602. AC: RSP01570//OS: rice (Oryza sativa, indica) /GENE: Synthetic oligonucleotides/RE: Em1a /BF: OSBZ
           Found in    8 ( 80.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00175    Found  1
     301  CACGTGGC      308 (Mism.= 0)
 Motifs on "-" Strand: Mean Exp. Number   0.00175    Found  1
     306  CACGTGGC      299 (Mism.= 0)
............................................................
 RE:  1850. AC: RSP01816//OS: tobacco (Nicotiana tabacum) /GENE: PNZIP/RE: G-box /BF: 	NtbZIP
           Found in   10 (100.00 %) SEQs (out of    0;   0.00% SEQs with motifs of this Group)
 Motifs on "+" Strand: Mean Exp. Number   0.00148    Found  1
     299  GCCACGTGgC      308 (Mism.= 1)
 Motifs on "-" Strand: Mean Exp. Number   0.00186    Found  1
     308  GCCACGTGgC      299 (Mism.= 1)
............................................................
 Totally      14 motifs of     7 different REs have been found
------------------------------------------------------------

Services Test Online

NsiteM Program Description Recognition of Conserved Regulatory motifs (for RE Sets derived from ooTFD, RegsiteAN DB and RegsitePL DB)

Program description:

RULES for creating USER's set of REs:

Example of USER's set of 3 REs:

NsiteM output:

NsiteM Program Description
Recognition of Conserved Regulatory motifs (for RE Sets derived from ooTFD, RegsiteAN DB and RegsitePL DB)