Nsite-m description

Search for regulatory motifs conserved in several sequences.

Regulatory Elements (REs) can be taken from different databases or defined by user (for local runs only). The program finds sites that occur at least in one copy in P% or more of analyzed DNA sequences (in web version P is set to 50%). Input sequences should be in FASTA format, like


Method description

As Nsite, Nsite-m is also based on search of statistically significant regulatory site consensus - see NSITE Help for more description.
The main features of the approach are the follows:
(i) RE may consist of a single box (a continuous DNA segment) or two boxes, spaced by some DNA sequence, where only length, but not nucleotide content, of this spacer is important for functioning of such a composite site.
(ii) A real RE or its IUPAC consensus contains both variable positions, where the presence of a certain group of nucleotides is permissible, and strictly conserved positions, where strict identity between real site/consensus and predicted motif is required . The nonequivalence of these positions should be taken into account, i.e., complete homology at conserved positions is required, and a violation of homology in the variable positions should be permissible.
(iii) The homology between RE and a motif on query DNA sequence may be a random happening, therefore, estimation of its statistical significance is very important. A conclusion on functional significance of revealed homology can be reached only if the homology is significantly nonrandom, i.e., the homology is not a random event.
(iv) Characteristics such as nucleotide frequencies should not be used when describing consensus because of its small size. Instead, one should use estimates based on number of specific nucleotides in the consensus.
(v) Although all available RE databases usually annotate fixed distance between two boxes of composite elements, some variability of the spacer length usually takes place. Therefore, search algorithm for composite REs should allow some limited flexibility in spacer length.

Expected occurency for each regulatory motif found must be less than given percentage (default: 5%);
The program currently uses Transfac human/animal and plant datasets (3587 and ~600 real sites/consensuses, respectively). User can perform a search for motifs of REs from his own dataset in a format described below.

Nsite-m output

Output file begins with description of the program allocation, search parameters, as well as, if using our datasets, abbreviations used. Two next lines include name and length of the first query sequence. Then, statistical analysis of search result are presented. At last, names of REs, statistical estimation and sequences of motifs found and are given.

Notice: Nsite group of programs uses provided database of regulatory motifs.
You can use attached databases:
Plant REGSITE database (Current RegSite release contains 2286 motifs and more descriptive fields than PLACE, including several fields on expression and some others)
or TFD external database. TFD is not part of the Softberry Software and should be referenced:
Ghosh D. (2000) Object-oriented transcription factors database (ooTFD). Nucleic Acids Res. 2000 Jan 1;28(1):308-130.

 Program   Nsite-m: Search for Motif Patterns (Softberry Inc.)
 File with QUERY Sequences: H-H.SEQ         
     Expected  Mean  Number                 :  0.0100000
     Print  Query  Sequence                 : No 
     Special numbering of Query Sequence    : No 
     Variation of Distance between RE Blocks: No 
     Create List of Numbered Query Sequences: No 

 NOTE: RE - Regulatory Element/Consensus
       AC - Accession No of RE in TRANSFAC
       OS - Organism/Species
       BF - Binding Factor or One of them
       Mism.             - Mismatches
       Mean. Exp. Number - Mean Expected Number
       of 3587 REs in    5 SEQUENCES
 Motif(s) of  2 REs in  50 %  or more of analyzed sequences

 RE:   429. AC: R00560  OS: human  BF: CACCC-binding 
 RE:  1272. AC: R01859  OS: human  BF: CP1  

 FOUND in every of the following    3 ( 60.00 % of all) sequences:
     3    4    5
 RE:   738. AC: R01053  OS: mouse  BF: RXR-beta
 RE:  2751. AC: R03786  OS: empty  BF: PUB1  

 FOUND in every of the following    3 ( 60.00 % of all) sequences:
     1    4    5
SUMMARY: In 2 case(s)  motif(s) of  2 REs found in  50 % or more of analyzed sequences

     Motifs of REs found in  50 %  or more of analyzed sequences

   1. QUERY: >GB/U01317.1|Human HBB (H-HBB) [60137-->2500 nt]: -2000...+500

 Length of Query Sequence:       2150
 Nucleotide Frequencies:  A -  0.32   G -  0.20   T -  0.30   C -  0.17

 RE:   738. AC: R01053  OS: mouse  BF: RXR-beta                                      
           (Found in    3 ( 60.00 %) SEQs)

 Motifs on "-" Strand: Mean Exp. Number   0.00459    Found  1

     783  TGAGGTCAGcG      773 (Mism.= 1)

RULES for creating USER RE sets:

 1. User sets must include only sequences of actual REs and/or their consensus sequences. 
 2. Every actual RE/consensus is described in three lines:
 	LINE 1: Name/description of RE/consensus    
 	LINE 2: Sequence of of RE/consensus 
 	LINE 3: <par1> <par2> <par3> <par4> 

 3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c)  
 and their combinations according to IUPAC abbreviations: 
 R - A or G, Y - T or C, K - G or T, M - A or C, S - G or C, 
 W - A or T, B - G or T or C, D - A or G or T, H - A or C or T, 
 V - A or G or C, N - A or G or C or T.

   In the case of composite REs, two boxes are seperated by "-". 

 Length of RE/consensus sequence must not exceed 80 symbols, including    "-" in 
 case of composite elements.  

 Capital letters indicate Conservative nucleotides (positions) in which  mismatch 
 is not allowed.

 4. In the LINE 3: <par1> - maximal number of mismatches for the first box                         
                  <par2> - maximal number of mismatches for the second box (for 
                           composite REs).
                           If RE contains a single box, then <par2> = 0;
                           If any mismatch is not allowed, then  <par1> = <par2> = 0. 

  		          <par3> - minimal distance between boxes of composite RE 
                  <par4> - maximal distance between boxes of composite RE
                           (for a single-box REs <par3> = <par4> = 0 )

 All <par1> <par2> <par3> and <par4> are given as INTEGERS in 4i5 format. 

 Example of USER's set of 3 REs: 

 RE 1  
     2    0    0    0
     1    1    8   10
 RE 3  
     0    0    0    0