AUTHORS: I.A.Shahmuradov & V.V.Solovyev
LAST UPDATE: 15 Novembef 2013
VERSION: 5.2013
ACCESS: http://softberry.com
ACTION: Search for statistically nonrandom motifs of known human/animal and plant transcription Regulatory Elements (REs) in DNA sequences.
SEARCH CONDITION: Expected Mean Numbers of any regulatory motif found must be less than a given number (default: 0.05).
As NsiteH and NsiteM programs, Nsite is based on the previously proposed real site and/or their IUPAC consensus based probabilistic approach of searching for putative REs in nucleotide sequences and statistically estimating motifs found (Shahmuradov et al., Genetika (Russ.), 1986, 22, 357-367; Solovyev,V.V. and Kolchanov,N.A. In: "Computer analysis of genetic macromolecules. Structure, Function and Evolution", World Scientific, 1993,16-20).
The main features of the approach are the follows:
(i) RE may consist of a single box (a continuous DNA segment) or two boxes, spaced by some DNA sequence, where only length, but not nucleotide content, of this spacer is important for functioning of such a composite site.
(ii) A real RE or its IUPAC consensus contains both variable positions (where the presence of a certain group of nucleotides is permissible), and strictly conserved positions (where a strong identity between real site/consensus and predicted motif is required).
(iii) The nonequivalence of these positions should be taken into account, i.e., complete homology at conserved positions is needed, and a violation of homology in the variable positions should be permissible.
(iv) The homology between the RE and motif on query DNA sequence may a chance happening, therefore, estimation of its statistical significance is of major importance. A conclusion on the functional significance of the revealed homology can be reached, only if the homology is significantly nonrandom, i.e., the homology is not a chance event.
(v) Characteristics such as nucleotide frequencies should not be used when describing the consensus because of its small size. Instead, one should use estimates based on the number of nucleotides of various types in the consensus.
(vi) Although all available databases on REs usually annotate a fixed distance between two boxes of composite elements, some variability of the spacer length, seems, to have a place. Therefore, a search algorithm for composite REs should allow some limited flexibility in the sense of spacer length, relaying on both the known experimental data and theoretical assumption.
One of the important components of this approach is preparation of RE data set. Nsite uses the following 3 sets of human/animal and plant REs.
[1]: Set of human/animal REs prepared by merging and verification (excluding longer than 50 bp REs; elimination of redundancies) RE/TF information from ooTFD (Ghosh,D. Nucleic Acids Res, 2000, 28, 308- 310). It includes 8030 real/consensus REs.
[2]: Set of human/animal REs prepared by merging and verification RE/TF information from RegsiteAN DB developed Softberry Inc. It includes 1500 real/consensus REs.
[3]: Set of plant REs prepared by merging and verification RE/TF information from RegsitePL DB developed by Softberry Inc. The current version of this set includes 2359 real/consensus REs, but it is regularly updated.
Moreover, user can perform a search for motifs of REs from his own dataset in a format described below.
Input query sequences of length 100000 nucleotides or less must be given in FASTA format.
1. USER's set must include only sequences of real REs and/or their consensuses.
2. Every real RE/consensus is described in 3 lines:
LINE 1: Name/description of RE/consensus
LINE 2: Sequence of of RE/consensus
LINE 3: <par1> <par2> <par3> <par4>
3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c)
and their any combinations in according to IUPAC abbreviations:R - A or G,
Y - T or C, K - G or T, M - A or C, S - G or C, W - A or T, B - G or T or C,
D - A or G or T, H - A or C or T, V - A or G or C, N - A or G or C or T.
In the case of composite REs, 2 boxes are seperated by "-".
Length of RE/consensus sequence must not exceed 80 symbols, including "-" in the case of composite elements.
Capital letters indicate Conservative nucleotides (positions) in which mismatch is not allowd.
4. In LINE 3:
<par1> - a maximal number of mismatches for the 1st box
<par2> - a maximal number of mismatches for the 2nd box (for composite REs)
If RE contains a single box, then <par2> = 0;
If any mismatch is not allowed, then <par1> = <par2> = 0 )
<par3> - minimal distance between boxes of composite RE
<par4> - maximal distance between boxes of composite RE (for a single-box REs <par3> = <par4> = 0 )
All <par1>, <par2>, <par3> and <par3> are given as INTEGER in 4i5 format.
RE 1 agTGGcgAggcg 2 0 0 0 RE2 caggccTGc-CCAGctgg 1 1 8 10 RE 3 RRTGTGGWWW 0 0 0 0
Input query sequences of length 100000 nucleotides or less must be given in FASTA format.
Every OUTPUT file begins with description of the Program's allocation,
Search Parameters, as well as used abbreviations (in the case of using
Data sets created by us).
The next 2 includes name and length of the first query sequence.
At last, name of REs, statistical estimation and sequences of motifs
found and are given.
For example: Program Nsite | Version 5.2013 Search for motifs of 2356 Regulatory Elements (REs) SET of REs: RegsitePDB: 2359 Plant Transcription REs [Last update: 05.02.2012; Softberry Inc] ____________________________________________________________ Search PARAMETRS: Expected Mean Number : 0.0500000 Statistical Siginicance Level : 0.9500000 Level of homology between known RE and motif: 90% Variation of Distance between RE Blocks : 20% NOTE: RE - Regulatory Element/Consensus | AC - Accession No of RE in a given DB OS - Organism/Species | BF - Binding Factor or One of them Mism. - Mismatches | Mean. Exp. Number - Mean Expected Number | Up.Conf.Int. - Upper Confidence Interval ============================================================ QUERY: >gi|3036947|dbj|AB012638.1| Nicotiana sylvestris Lhcb1*5 genes for light harvesting chlorophyll a/b-binding protein, complete cds Right Boundary of Upstream Sequence: 520 Length of Query Sequence: 520 bp | Nucleotide Frequencies: A - 0.33 G - 0.18 T - 0.30 C - 0.18 1 ATCCCGAACC CAAAGTTTGA AATCCTGGCT CCGCCTCTGA TCCGTGCCCC 51 CTAAAGGCTT TTAGCATTAA GGGTGTCAAG TGATTTAAAT TAATCATTTT 101 CAAGGTATAC ACATATACAT ATACAAGATT TCTGCTGAAG TTTACGGGTC 151 CGCCACTGCA TACAAGTGGT CCTATTTTAC AGGTCAGCTA AATATACAGA 201 AGTGTATGAA CAATGCATGG CCAGGAGTTA CTCTTATGCT CTGGCTAAGT 251 TGATATTATA TTGCAATAAT CCACAATCAG ACGTGGCAAA TTTGGATTGG 301 CTATAAGAAG GAAATCTTCA TTGGCTTAGA TTTTTTAAAC GTATAAAGTA 351 TCTACAAAAA TCTAGTCATC TTTAACGGTG CAGAACTTTG CCAAATGGAA 401 AAGAATGCAA AAGGTTACAA ATTGTCATCC ACCAATGGAA AAGCAGATAT 451 AGATATTCAA GGATAAGGTA GTCTTTGGGC CTGTAAATTC ATTTATATAC 501 ACTTAGTACA AAGCCCATAA ............................................................ RE: 35. AC: RSP00035//OS: barley (Hordeum vulgare) /GENE: Al21/RE: D1 /BF: DOF Motifs on "+" Strand: Mean Exp. Number 0.04055 Up.Conf.Int. 1 Found 1 -113 CAAAAGG -107 (Mism.= 0) ............................................................ RE: 248. AC: RSP00248//OS: rice (Oryza sativa), Oryza sativa /GENE: alpha-globulin/RE: REB2 /BF: REB Motifs on "-" Strand: Mean Exp. Number 0.00424 Up.Conf.Int. 1 Found 1 -234 GCCACGTCtG -243 (Mism.= 1) ............................................................ RE: 260. AC: RSP00260//OS: rice (Oryza sativa) /GENE: synthetic oligonucleotides/RE: RTBP1 BS /BF: unknown nuclear factor Motifs on "-" Strand: Mean Exp. Number 0.03589 Up.Conf.Int. 1 Found 1 -466 TTTAGGG -472 (Mism.= 0) ............................................................ RE: 385. AC: RSP00385//OS: carnation (Dianthus caryophillus) /GENE: GST1/RE: ERE-core /BF: unknown nuclear factor Motifs on "-" Strand: Mean Exp. Number 0.02730 Up.Conf.Int. 1 Found 1 -498 ATTTCAAA -505 (Mism.= 0) ............................................................ RE: 397. AC: RSP00397//OS: tobacco (Nicotiana tabacum) /GENE: RNP2/RE: CDE /BF: unknown nuclear factor Motifs on "-" Strand: Mean Exp. Number 0.00201 Up.Conf.Int. 1 Found 1 -364 AGTGGCGG -371 (Mism.= 0) ............................................................ RE: 522. AC: RSP00522//OS: carrot (Daucus carota) /GENE: Dc3/RE: E2-core /BF: DPBF-1; DPBF-2; Motifs on "-" Strand: Mean Exp. Number 0.01980 Up.Conf.Int. 1 Found 1 -352 CCACTTG -358 (Mism.= 0) ............................................................ RE: 2222. AC: RSP02188//OS: soybean (Glycine max) /GENE: CHS8/RE: Fp31/VI /BF: unknown nuclear factor Motifs on "-" Strand: Mean Exp. Number 0.00344 Up.Conf.Int. 1 Found 1 -440 ACTTGACACcC -450 (Mism.= 1) ............................................................ RE: 2322. AC: RSP02288//OS: pea (Pisum sativum) /GENE: TRX m1/RE: EE (TRX m1) /BF: CCA1 Motifs on "+" Strand: Mean Exp. Number 0.02730 Up.Conf.Int. 1 Found 1 -71 TAGATATT -64 (Mism.= 0) ........................................................................................................ ........................................................................................................ ........................................................................................................ ........................................................................................................ ............................................................ Totally 37 motifs of 34 different REs have been found ------------------------------------------------------------