TandemRep description

Program for mapping the Tandem Repeats Regions in nucleotide sequences.

Algorithm description

TandemRep mapping is performed by searching regions with uniform diplet composition. The searching is initiated for the regions flanked by short ideal repeated elements.

Tandem searching algorithm consists of the following stages:

1) Find a pair of l-plets C1 and C2 with a distance between C1 and C2 not exceeding predefined N. The region between and including C1 è C2 will be denoted as R1 with the length L1. If C1 and C2 overlap then tandem unit size can be found trivially, jump to p.5.

2) Implying that C1 and C2 flanks do not contain insertions/deletions, extend synchronously C1 and C2 allowing 1 mismatch per several matches. Extended C1 and C2 we will denote as C3 and C4. After this operation the region will be denoted as R2 with the length L2 (>= L1). If extension performed without mismatches and C3 and C4 overlap then we have ideal tandem which unit size again can be found trivially, followed by jump to p.5. If extension performed with mismatches and C3 and C4 overlap then we have almost ideal tandem which unit size can be found according p.4. Proceed if C3 and C4 do not overlap.

3) Now region R2 looks as follows


   C3                                   C4
########-----------------------------########
| W1  | W2  | W3  | W4  | W...| Wn-1| Wn  |

For the region R2 perform the following test. Divide region into set of windows W1, …, Wn, each of size U. Consequently compare mono- (or di-) plet composition of the windows W1 and Wi. If the difference in such composition between W1 and some window Wi exceeds predefined threshold then stop. Test is not passed, jump to the p.1 to consider the next pair of l-plets. If the difference is low for all windows W2, …, Wn then the test is passed and at least fragment R2 could be declared tandem region.
Since we don't know the size of the window at which test described above could be passed, the test is performed for the window sizes U = 2, …, L2/2.
Remember the lowest U at which the test is passed. Denote it U1.

3a) Since uniform mono- (or di-) plet composition does not guarantee homology in windows W1 and Wi, at this step the identity calculated by cycled Smith-Waterman algorithm is used for the additional filtering. If such an identity does not exceed predefined threshold then calculation is stopped for the C1 and C2 pair.

4) Calculate more precisely unit size Uopt of the tandem using two small windows synchronously sliding at the distance U one from another, U changes from U1 to L2/2.

5) Using Uopt calculated at the previous step find precise margins of the tandem using again two small synchronously sliding windows.

Such a procedure is carried out for all pairs C1 and C2 possible in the sequence. The final map of the tandems is an interception of tandems found for all l-plet pairs.

Output examples


>c20
Masked regions:
p1:  96       p2: 127      l: 31        chain(+) [Tandem Repeat]
p1: 240       p2: 262      l: 22        chain(+) [Tandem Repeat]
p1: 277       p2: 322      l: 45        chain(+) [Tandem Repeat]
....


>c20
CGGTGGCGGCAGCCGGCTCAAGCCCGGGCCGCAGCTGCCTGGCCGCGGGGGCCGCCGAGCAGCGGGAGGGCCTTTGGGGG
GCGGGGCGGCGGCGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTGCTGGAAGTGCGCCTGGTCGAGACCCCGGGG
CGGGAGCTGTGGAGGATGGTCCCGGCGGGACGGGCCGCTCGGGGACAAGCGGAGCGCGCCCAAGGGCCGTCGGGCGAGGG
NNNNNNNNNNNNNNNNNNNNNNTCCCCGACACCGTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NTGAGGAAGGTGAAGAGGAGACGGTGGCGTCGGGGGAGGAGTCGCTGGGCTTTCTGTCCAGGCTGCCCCCTGGCCCGGCC
....


>c20
cggtggcggcagccggctcaagcccgggccgcagctgcctggccgcgggggccgccgagcagcgggagggcctttggggg
gcggggcggcggcgccCGAGGACGACGATGAAGACGACGACGAGGAGctgctggaagtgcgcctggtcgagaccccgggg
cgggagctgtggaggatggtcccggcgggacgggccgctcggggacaagcggagcgcgcccaagggccgtcgggcgaggg
GGCGGCCGCCGCCGCCGCCGCCtccccgacaccgtcGGAGGACGAGGAGCCGGAGGAAGAGGAGGAGGAGGCGGCAGCGG
Ctgaggaaggtgaagaggagacggtggcgtcgggggaggagtcgctgggctttctgtccaggctgccccctggcccggcc
....


>seq:1  beg:96  len:31
CGA
GGA
CGA
CGA
TGA
AGA
CGA
CGA
CGA
GGA
G
>seq:1  beg:240  len:22
GGC
GGC
CGC
CGC
CGC
CGC
CGC
C
>seq:1  beg:277  len:45
GGA
GGA
CGA
GGA
GCC
GGA
GGA
AGA
GGA
GGA
GGA
GGC
GGC
AGC
GGC
....