Search for regulatory motifs conserved in several sequences.
Regulatory Elements (REs) can be taken from different databases or defined by user (for local runs only). The program finds sites that occur at least in one copy in P% or more of analyzed DNA sequences (in web version P is set to 50%). Input sequences should be in FASTA format, like
>test1 AAAAAAAAA GGCCCCCCC >test2 ACCCTTTTTC CCCCCCCCCC
As Nsite, Nsite-m is also based on search of statistically significant regulatory site consensus - see NSITE Help for more description.
The main features of the approach are the follows:
(i) RE may consist of a single box (a continuous DNA segment) or two boxes, spaced by some DNA sequence, where only length, but not nucleotide content, of this spacer is important for functioning of such a composite site.
(ii) A real RE or its IUPAC consensus contains both variable positions, where the presence of a certain group of nucleotides is permissible, and strictly conserved positions, where strict identity between real site/consensus and predicted motif is required . The nonequivalence of these positions should be taken into account, i.e., complete homology at conserved positions is required, and a violation of homology in the variable positions should be permissible.
(iii) The homology between RE and a motif on query DNA sequence may be a random happening, therefore, estimation of its statistical significance is very important. A conclusion on functional significance of revealed homology can be reached only if the homology is significantly nonrandom, i.e., the homology is not a random event.
(iv) Characteristics such as nucleotide frequencies should not be used when describing consensus because of its small size. Instead, one should use estimates based on number of specific nucleotides in the consensus.
(v) Although all available RE databases usually annotate fixed distance between two boxes of composite elements, some variability of the spacer length usually takes place. Therefore, search algorithm for composite REs should allow some limited flexibility in spacer length.
Expected occurency for each regulatory motif found must be less than given percentage (default: 5%);
The program currently uses Transfac human/animal and plant datasets (3587 and ~600 real sites/consensuses, respectively). User can perform a search for motifs of REs from his own dataset in a format described below.
Output file begins with description of the program allocation, search parameters, as well as, if using our datasets, abbreviations used. Two next lines include name and length of the first query sequence. Then, statistical analysis of search result are presented. At last, names of REs, statistical estimation and sequences of motifs found and are given.
Notice: Nsite group of programs uses provided database of regulatory motifs.
You can use attached databases:
Plant REGSITE database (Current RegSite release contains 2286 motifs and more descriptive fields than PLACE, including several fields on expression and some others)
or TFD external database. TFD is not part of the Softberry Software and should be referenced:
Ghosh D. (2000) Object-oriented transcription factors database (ooTFD). Nucleic Acids Res. 2000 Jan 1;28(1):308-130.
Program Nsite-m: Search for Motif Patterns (Softberry Inc.) ____________________________________________________________ File with QUERY Sequences: H-H.SEQ Search PARAMETERS: Expected Mean Number : 0.0100000 Print Query Sequence : No Special numbering of Query Sequence : No Variation of Distance between RE Blocks: No Create List of Numbered Query Sequences: No NOTE: RE - Regulatory Element/Consensus AC - Accession No of RE in TRANSFAC OS - Organism/Species BF - Binding Factor or One of them Mism. - Mismatches Mean. Exp. Number - Mean Expected Number ============================================================ STATISTICAL ANALYSIS of RESULTS of SEARCH of MOTIFS of 3587 REs in 5 SEQUENCES ============================================================ Motif(s) of 2 REs in 50 % or more of analyzed sequences RE: 429. AC: R00560 OS: human BF: CACCC-binding ctccacccatggg RE: 1272. AC: R01859 OS: human BF: CP1 gccttgaccaat FOUND in every of the following 3 ( 60.00 % of all) sequences: 3 4 5 ............................................................ RE: 738. AC: R01053 OS: mouse BF: RXR-beta tgaggtcaggg RE: 2751. AC: R03786 OS: empty BF: PUB1 tttatttatgttttcttctgca FOUND in every of the following 3 ( 60.00 % of all) sequences: 1 4 5 ____________________________________________________________ SUMMARY: In 2 case(s) motif(s) of 2 REs found in 50 % or more of analyzed sequences ================================================== Motifs of REs found in 50 % or more of analyzed sequences ............................................................ 1. QUERY: >GB/U01317.1|Human HBB (H-HBB) [60137-->2500 nt]: -2000...+500 Length of Query Sequence: 2150 Nucleotide Frequencies: A - 0.32 G - 0.20 T - 0.30 C - 0.17 ............................................................ RE: 738. AC: R01053 OS: mouse BF: RXR-beta (Found in 3 ( 60.00 %) SEQs) Motifs on "-" Strand: Mean Exp. Number 0.00459 Found 1 783 TGAGGTCAGcG 773 (Mism.= 1) ==============================================================================
RULES for creating USER RE sets:
1. User sets must include only sequences of actual REs and/or their consensus sequences. 2. Every actual RE/consensus is described in three lines: LINE 1: Name/description of RE/consensus LINE 2: Sequence of of RE/consensus LINE 3: <par1> <par2> <par3> <par4> 3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c) and their combinations according to IUPAC abbreviations: R - A or G, Y - T or C, K - G or T, M - A or C, S - G or C, W - A or T, B - G or T or C, D - A or G or T, H - A or C or T, V - A or G or C, N - A or G or C or T. In the case of composite REs, two boxes are seperated by "-". Length of RE/consensus sequence must not exceed 80 symbols, including "-" in case of composite elements. Capital letters indicate Conservative nucleotides (positions) in which mismatch is not allowed. 4. In the LINE 3: <par1> - maximal number of mismatches for the first box <par2> - maximal number of mismatches for the second box (for composite REs). If RE contains a single box, then <par2> = 0; If any mismatch is not allowed, then <par1> = <par2> = 0. <par3> - minimal distance between boxes of composite RE <par4> - maximal distance between boxes of composite RE (for a single-box REs <par3> = <par4> = 0 ) All <par1> <par2> <par3> and <par4> are given as INTEGERS in 4i5 format. Example of USER's set of 3 REs: RE 1 agTGGcgAggcg 2 0 0 0 RE2 caggccTGc-CCAGctgg 1 1 8 10 RE 3 RRTGTGGWWW 0 0 0 0 ------------------------------------------------------------------------