Search for of prosite patterns with statistical estimation

The method is based on statistical estimation of expected number of a prosite pattern in a given sequence. It uses the PROSITE database (author: Amos Bairoch,1995) of functional motifs. If we found a pattern which has expected number significantly less than 1, it can be supposed that the analyzed sequence possesses the pattern function. Presented version 1 is the simplest version that search for patterns without any deviation from a given Prosite consensus. In the following version we will include this possibility. In the output of PSite we can see a prosite pattern, its position in the sequence, accession number, ID, Description in the PROSITE database as well as Document number where is pattern characteristics outlined. It must be noted that patterns which started at the beginning or end of protein sequence will be recognized along the whole sequence in this version. It may be useful for analysis of ORF or 6 frame translation sequences.

Acknowledgments: We acknowledge Ilgam Shahmuradov and Igor Rogozin which took part in development some applications of this method for nucleotide consensuses searching and Asya Salihova for protein sites searching on IBM PC.

** Example of PSite output: **

PSite V1 - search for Prosite patterns 10 20 30 40 50 60 RLLRAIMGAPGSGKGTVSSRITKHFELKHLSSGDLLRDNMLRGTEIGVLAKTFIDQGKLI 70 80 90 100 110 120 PDDVMTRLVLHELKN*TQYNWLLDGFPRTLPQAEALDRAYQIDTVINLNVPFEVIKQRLT 130 140 150 160 170 180 ARWIHPGSGRVYNIEFNPPKTMGIDDLTGEPLVQREDDRPETVVKRLKAYEAQTEPVLEY 190 200 210 220 230 240 YRKKGVLETFSYTETNKIWPHVYAFLQTKLPDANKDDALDQREWSAAAAWLAAAAALDLN 250 260 270 280 290 300 AGCPAAALAAAAAGSAACAAAAAFAAAAAACCAACAAAAAAACAAAADAACGAYAYACAP ID GLYCOSAMINOGLYCAN; RULE. AC PS00002; DE Glycosaminoglycan attachment site. DO PDOC00002; PA S-G-x-G. Sites found: 1 Expected number: 0.0272 95% confidential interval: 0 # Start End Expected Site sequence 1 12 15 0.0272 SGKG ID EF_HAND; PATTERN. AC PS00018; DE EF-hand calcium-binding domain. DO PDOC00018; PA D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)- PA [DE]-[LIVMFYW]. Sites found: 1 Expected number: 0.0004 95% confidential interval: 0 # Start End Expected Site sequence 1 212 224 0.0004 DANKDDALDQREW ID ADENYLATE_KINASE; PATTERN. AC PS00113; DE Adenylate kinase signature. DO PDOC00104; PA [LIVMFYW](3)-D-G-[FY]-P-R-x(3)-[NQ]. Sites found: 1 Expected number: 0.0000 95% confidential interval: 0 # Start End Expected Site sequence 1 81 92 0.0000 WLLDGFPRTLPQ

**Reference: **

Solovyev V.V., Kolchanov N.A. 1994,

Search for functional sites using consensus

In Computer analysis of Genetic macromolecules. (eds. Kolchanov N.A., Lim
H.A.), World Scientific, p.16-21.