Program for mapping the Tandem Repeats Regions in nucleotide sequences.
TandemRep mapping is performed by searching regions with uniform diplet composition. The searching is initiated for the regions flanked by short ideal repeated elements.Tandem searching algorithm consists of the following stages:
1) Find a pair of l-plets C1 and C2 with a distance between C1 and C2 not exceeding predefined N. The region between and including C1 è C2 will be denoted as R1 with the length L1. If C1 and C2 overlap then tandem unit size can be found trivially, jump to p.5.
2) Implying that C1 and C2 flanks do not contain insertions/deletions, extend synchronously C1 and C2 allowing 1 mismatch per several matches. Extended C1 and C2 we will denote as C3 and C4. After this operation the region will be denoted as R2 with the length L2 (>= L1). If extension performed without mismatches and C3 and C4 overlap then we have ideal tandem which unit size again can be found trivially, followed by jump to p.5. If extension performed with mismatches and C3 and C4 overlap then we have almost ideal tandem which unit size can be found according p.4. Proceed if C3 and C4 do not overlap.
3) Now region R2 looks as follows
C3 C4 ########-----------------------------######## | W1 | W2 | W3 | W4 | W...| Wn-1| Wn |
For the region R2 perform the following test.
Divide region into set of windows W1, …, Wn,
each of size U. Consequently compare mono- (or di-) plet composition of the windows W1
and Wi. If the difference in such composition between W1
and some window Wi exceeds predefined threshold then stop.
Test is not passed, jump to the p.1 to consider the next pair of l-plets. If the difference is low for all
windows W2, …, Wn then the test is passed and at
least fragment R2 could be declared tandem region.
Since we don't know the size of the window at which test described above could be passed, the test is performed for the window sizes U = 2, …, L2/2.
Remember the lowest U at which the test is passed. Denote it U1.
3a) Since uniform mono- (or di-) plet composition does not guarantee homology in windows W1 and Wi, at this step the identity calculated by cycled Smith-Waterman algorithm is used for the additional filtering. If such an identity does not exceed predefined threshold then calculation is stopped for the C1 and C2 pair.
4) Calculate more precisely unit size Uopt of the tandem using two small windows synchronously sliding at the distance U one from another, U changes from U1 to L2/2.
5) Using Uopt calculated at the previous step find precise margins of the tandem using again two small synchronously sliding windows.
Such a procedure is carried out for all pairs C1 and C2 possible in the sequence. The final map of the tandems is an interception of tandems found for all l-plet pairs.
>c20 Masked regions: p1: 96 p2: 127 l: 31 chain(+) [Tandem Repeat] p1: 240 p2: 262 l: 22 chain(+) [Tandem Repeat] p1: 277 p2: 322 l: 45 chain(+) [Tandem Repeat]....
>c20 CGGTGGCGGCAGCCGGCTCAAGCCCGGGCCGCAGCTGCCTGGCCGCGGGGGCCGCCGAGCAGCGGGAGGGCCTTTGGGGG GCGGGGCGGCGGCGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTGCTGGAAGTGCGCCTGGTCGAGACCCCGGGG CGGGAGCTGTGGAGGATGGTCCCGGCGGGACGGGCCGCTCGGGGACAAGCGGAGCGCGCCCAAGGGCCGTCGGGCGAGGG NNNNNNNNNNNNNNNNNNNNNNTCCCCGACACCGTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NTGAGGAAGGTGAAGAGGAGACGGTGGCGTCGGGGGAGGAGTCGCTGGGCTTTCTGTCCAGGCTGCCCCCTGGCCCGGCC....
>c20 cggtggcggcagccggctcaagcccgggccgcagctgcctggccgcgggggccgccgagcagcgggagggcctttggggg gcggggcggcggcgccCGAGGACGACGATGAAGACGACGACGAGGAGctgctggaagtgcgcctggtcgagaccccgggg cgggagctgtggaggatggtcccggcgggacgggccgctcggggacaagcggagcgcgcccaagggccgtcgggcgaggg GGCGGCCGCCGCCGCCGCCGCCtccccgacaccgtcGGAGGACGAGGAGCCGGAGGAAGAGGAGGAGGAGGCGGCAGCGG Ctgaggaaggtgaagaggagacggtggcgtcgggggaggagtcgctgggctttctgtccaggctgccccctggcccggcc....
>seq:1 beg:96 len:31 CGA GGA CGA CGA TGA AGA CGA CGA CGA GGA G >seq:1 beg:240 len:22 GGC GGC CGC CGC CGC CGC CGC C >seq:1 beg:277 len:45 GGA GGA CGA GGA GCC GGA GGA AGA GGA GGA GGA GGC GGC AGC GGC....