RScan program complex description

RScan  description

Contents

Introduction
Running RScan
PAT-file format
PAT-file: mandatory tags
PAT-file: optional tags
CFG-file format
RScan options
Output format
Score calculation
Examples of use
Scanning speed

Introduction

RScan program is intended for searching occurrences of definite secondary structure patterns in long genomic sequences

Running RScan

RScan is a console application. It can be run as follows:

$ rscan in.fa in.pat -o:rs.cfg [options]

Here

in.fa        - (where to search) is a FASTA-file with one or many DNA/RNA sequences
in.pat      - (what to search) is a PAT-file (a file with a description of the secondary structure pattern)
rs.cfg      - CFG-file (configuration file)

PAT-file format

Below is an example of PAT-file for a clover leaf structure:

RNA_TREE_BEGIN
F                ESL:0.3
 E    len:0..0
 S    len:5..10  LEN:45..60  msl:2  tmm_ex
  L   len:4..12                             cons_5:AG    cons_3:AAA  mm_3:1(w:-1)
  S   len:3..7               msl:3  tmm_in
   L  len:5..15                     len_opt:7(tdev:1,mul:-1.5)
  L   len:0..7      
  S   len:3..7               msl:2
   L  len:5..10 
  L   len:0..7   
  S   len:3..7                      L_cons_5:AWAG  L_dist_5:-1..2  L_mm_5:2(w:-1)
   L  len:5..10
  L   len:0..12
 E    len:0..0
RNA_TREE_END
PSEUDOKNOTS_BEGIN
PSEUDOKNOTS_END

Identifiers RNA_TREE_BEGIN, RNA_TREE_END denote the beginning and the end of the topology description section. Pseudoknots section is not processed in the current version. Strings beginning with ';' are comments.

An RNA pattern is a tree of a secondary RNA structure of a certain configuration. Tree elements are nodes of different types ("F" - fictive, "E" - end, "L" - loop, "S" - stem) and edges connecting them. Only node of "F" and "S" types can have descendant nodes, and when listing from 5' to 3' (starting from 1) "L" nodes are odd and "H"/"S" nodes are even. There are only two "E" nodes in a pattern: first and last childes of node "F". Consideration of nodes of type "E" and type "L" in this algorithm is equivalent, so basically "L" type nodes will be mentioned below.

In the description, one line is assigned to one node. Fictive node line begins with "F", end node lines begin with "E", loops - with "L" and stems - with "S". Indent from the left edge equals level of the corresponding node in a tree.

PAT-file: mandatory tags

F, E, S, L - Denote node types

len:5-7 - The interval of the allowed length of this element for the nodes E, S, L

PAT-file: optional tags

"F" node can have two additional tags:

ESL:1.5 - outputs occurrences having ES score to length L ratio above 1.5. Here, ES is a neg-energy (in kcal/mol) multiplied by 10.

LEN:70..90 - limits occurrence length by an interval [70,90]

E and L nodes can have the following additional tags:

len_opt:15(tdev:2.1,mul:-1.5) - option defines the optimal length of the element. If the length of the element differs from the optimal value, its DS score is penalized according to a certain formula (see below); if the option is absent, length of the element is not penalized (though has an length-dependant energy)

cons_5:AWGUC - consensus sequence in a 15-letter alphabet (IUPAC), which presence is required at the 5'-border of the element

mm_5:2(w:-1) - there are 2 mismatches allowed in the consensus, weight for mismatch is -1

dist_5:0..2 - allowed shift of the consensus from the 5 'edge of the element to a distance of 0 to 2 nt. In the description of shifts small absolute negative values are allowed (for example, 'dist_5:-1..2'). A negative shift value means that the consensus is "sticking out" of the element in the 5' direction (or in the 3' direction in case of 'dist_5:' option)

dist_5_opt:2(w:-1) - The option shows that shift of 2 nt is considered optimal. Deviation of the shift value from the optimum by every 1 nt is penalized in the DS score with a weight of -1. If there is no option, there is no penalty.

cons_3:AWGUC - similarly, from the 3'-boundary
mm_3:2(w:-1) - similarly, from the 3'-boundary
dist_3:0..2 - similarly, from the 3'-boundary
dist_3_opt:2(w:-1) - similarly, from the 3'-boundary

S nodes may have the following tags:

LEN:70..90 - sets the length limits of the sub-fragment, closed by this stem (including the stem itself)

msl:3 - ("max stem loop") Defines the maximum size of the interior loop in the stem, or its "looseness"; for example, at value 'msl:3' stems

((((...))...)), ((.((...)))..))), ((..((...)).)), ((... ((...))))   are allowed, whereas stems 
((.((...))...)) or ((((...))....))  are not allowed 
tmm_ex - checks pair of nucleotides adjacent to the stem from the "outside" for non-complementarity. The check is performed only if the loops adjacent to the stem have a non-zero length

tmm_in - checks pair of nucleotides adjacent to the stem from the "inside" for non-complementarity. The check is performed only if the loops adjacent to the stem have non-zero length or, if the stem is terminal, the check is performed when the hairpin loop exceeds 4 nucleotides in length

len_opt:15(tdev:2.1,mul:-1.5) - similar to the same option for the nodes E and L
L_cons_5:AWGUC - same as for nodes L; left arm, 5'-end of the stem
L_mm_5:2(w:-1) - same as for nodes L; left arm, 5'-end of the stem
L_dist_5:0..2 - same as for nodes L; left arm, 5'-end of the stem
L_dist_5_opt:0..2 - same as for nodes L; left arm, 5'-end of the stem
R_cons_5:AWGUC - same as for nodes L; right arm, 5'-end of the stem
R_mm_5:2(w:-1) - same as for nodes L; right arm, 5'-end of the stem
R_dist_5:0..2 - same as for nodes L; right arm, 5'-end of the stem
R_dist_5_opt:0..2 - same as for nodes L; right arm, 5'-end of the stem
L_cons_3:AWGUC - same as for nodes L; left arm, 3'-end of the stem
L_mm_3:2(w:-1) - same as for nodes L; left arm, 3'-end of the stem
L_dist_3:0..2 - same as for nodes L; left arm, 3'-end of the stem
L_dist_3_opt:0..2 - same as for nodes L; left arm, 3'-end of the stem
R_cons_3:AWGUC - same as for nodes L; right arm, 3'-end of the stem
R_mm_3:2(w:-1) - same as for nodes L; right arm, 3'-end of the stem
R_dist_3:0..2 - same as for nodes L; right arm, 3'-end of the stem
R_dist_3_opt:0..2 - same as for nodes L; right arm, 3'-end of the stem

CFG-file format

Basically, the CFG file is not intended for editing. However, lines beginning with "COMMAND_LINE:" can be edited. The options placed at these lines are the same as allowed in the command line (see next section).

RScan options

-D:N (def.: 0) strand to search in. N = 0 - search in direct strand, N = 1 - in the reverse strand, N = 2 - in both strands
-P1:N1 -P2:N2 search only in interval [N1,N2]
-max_stem_loop:N (def.: 2) the maximum length of the interior loop in the stem (analogous to the stem option 'msl:' in the PAT-file). The value N is set for all stems for which the individual value of this parameter is not set in the PAT-file
-tmm_mode:N (def.: 0) how to check the non-complementarity of nucleotides flanking the stems;
N = 0 - check only for stems with 'tmm_in' or 'tmm_ex' set on in the PAT-file;
N = 1 - do not check; N = 2 - check for all stems where possible
-out_N outputs fragments containing the unknown nucleotide 'N'
-out_over outputs all overlapping fragments. Overrides option '-max_len_perc:'
-score_type:N (def.: 1) which score to optimize, ES (energy), DS (deviation score), or CS (combination score). N = 0 - optimize the DS score, which penalizes mismatches with consensus, deviation of the consensus distance from the optimum value to the element boundary, deviation of the element length from the optimal value. N = 1 - optimize the ES score. N = 2 - optimize the CS score (linear combination of DS and ES scores). More details about the scores can be found below.
-d_e_mul:x:y (def.: 1:1) weights of DS and ES scores when calculating CS score: CS = x * DS + y * DS . Integer numbers are taken by the option, so for a fine balance higher values can used, for example '-d_e_mul:14:15'
-cmm_mul:N (def.: 10) the penalty for a mismatch with consensus is multiplied by N. At the default value of the option the mismatch with letters A, U, G, C will give a penalty of -20 * w, mismatch with the letters W, R, M, K, Y, S will give a penalty of -10 * w, mismatch with the letters B, V, H, D will give a penalty of -4 * w, where w is an individual penalty value of every consensus, given by the options looking like ' mm_5:2(w:-1)' or 'L_mm_5:2(w:-1)' in the PAT-file
-score_thr:N (def.: -2147483648) score threshold. Score is a DS score (at '-score_type:0'), ES score (at '-score_type:1') or CS score (at '-score_type:2'). Fragments having a score value higher than N are not outputted
-node_ener_thr100:N (def.: -2147483648) the option is associated with the option '-net_loops_n'. During the calculation, discard substructures with ES < (N * L / 100), where L is the length of the fragment. Speeds up the calculation.
-net_loops_n:N (def.: 100) apply the option '-node_ener_thr100' only for substructures that contain at least N loops
-nothr disables all the stem energy thresholds and the general score threshold
-max_len_perc:N (def.: 100) the option is associated with the option '-mlp_loops_n'. Discard all substructures that have a length L > A + (B-A) * N / 100, where A is the minimum and B is the maximum possible length of the substructure.
-mlp_loops_n:N (def.: 100) apply the option '-max_len_perc' only for substructures that contain at least N loops
-max_iloop_len:N (def.: 40) maximum allowed length of the interior loop between stems ('max_stem_loop' - the same value, but inside the stems)
-max_mloop_len:N (def.: 50) maximum allowed length of a multiloop (the sum of the lengths of all its arms). Note: sometimes structures with multiloop slightly longer than N, still falls into the output. This happens when a shorter (<=N) multi-loop is also possible, but has a lower score however
-max_xloop_len:N (def.: 50) The maximum allowed length of the "external" loop. An "external " loop is the length of 'x' in a structure like ...(((...)))xxx(((...)))xxx(((...)))...
-out_mode:N (def.: 2) outputs the result in the following formats: N = 0 - occurrences in Vienna and GCG, N = 1 - occurrences in FASTA format, N = 2 - occurrences in extended Vienna format, N = 3 - output in FASTA-format all sequences, if they contain occurrences
-stat outputs statistics on nucleotides and nucleotide pairs for each occurrence. Works only with '-out_mode:2' option
-del_olap_perc:N throws out well-overlapping occurrences. More specifically, among all the occurrences, a pair is searched that overlap by more than N% of the length of the shortest of these two occurrences. Throw out a member of a pair having a lower score. Repeat until such pairs are no longer detectable. The type of score is defined by the option '-score_type'
-del_max_len_diff:N (def.: 100) used in conjunction with the '-del_olap_perc'option (see its description). The overlap of the two occurrences is detected if the relative difference in their length does not exceed N%
-out_best:N outputs only the N best entries. Since scanning of long sequences is done in chunks, N occurrences are output for each such piece (by default, 30000 nt)
-toL applies '-out_best' or '-del_olap_perc:'option not to the value of score S, but to the ratio S / L, where L is the length of the occurrence (without taking into account the length of the flanks - nodes of type "E")
-strnum calculates the number of possible parsings of the [i, j] -fragment by the pattern. Slightly slows down the calculation
-progr outputs progress info
-min_stem_cm:N sets the minimum required number of matches with consensus, common to the consensus of all stems. Overrides the settings in the PAT-file
-stem_cmm_pen:N (def.: -1) sets the multiplier to the penalty for a mismatch with consensus, common to the consensus of all the stems
-min_loop_cm:N sets the minimum required number of matches with consensus, common to the consensus of all loops. Overrides mismatch options in the PAT-file
-loop_cmm_pen:N (def.: -1) sets the multiplier to the penalty for a mismatch with consensus, common to the consensus of all loops
-cmm_pen_freq_depend makes consensus mismatch score depend on the frequencies of nucleotides in the sample
-cons_stickout_max:N (def.: 4) the maximum possible "stick out" of the consensus beyond its element, allowed in the PAT-file
-max_stem_loop:N (def.: 2) defines the upper size of the interior loop inside all stems, except those for which the individual option 'msl' is set in PAT-file
-stem_ener_thr:N (def.: 3) the stem is considered if its ES> = N * (L1 + L2) / 2, where L1 and L2 are the lengths of the stem arms
-stem_ener_thr_1bp:N (def.: 8) a 1-bp stem is considered if its ES>=T. The threshold T = max (N, M), where M is specified by the '-stem_ener_thr:M' option
-stem_ener_thr_2bp:N (def.: 12) a 2-bp stem is considered if its ES>=T. The threshold T = max (N, 2M), where M is specified by the '-stem_ener_thr:M' option
-stem_ener_thr_3bp:N (def.: 16) a 3-bp stem is considered if its ES>=T. The threshold T = max (N, 3M), where M is specified by the '-stem_ener_thr:M' option
-stable_root if the stem is the root (or one of the root stems), this option requires that its ES is equal to or greater than the destabilizing contribution of the loop it forms
-nowse ("no weak stem ends") throw out from consideration stems with weak closing helices. For example,
uauggg...cccuuuuug
((.(((...)))....))

-mispair_score:N allows the formation of all non-canonical pairs, assigning them ES = N. Slows down the algorithm dramatically. Ensures that an occurrence can be found in any fragment of a suitable length

Output format

The following is an example of the recommended type of the RScan output (produced with an option '-out_mode:2'):

>NM:[chr_rand].1 CH:+ X:2023007 L:60 ES:23 DS:-51(cons:-50) CS:-28 LN:2.40000e+002
uaccuuagaauuucauacacggguggcccugccggcaguguguucggcgcacacaaggua
((((((.....(((......)))..(((.....))).((((((.....))))))))))))
AAAAAA.....BBB......BBB..CCC.....CCC.DDDDDD.....DDDDDDAAAAAA
......agaaU.........................aGUg....................
......ag............................awag....................
........aaa................................................. 

The first line contains:

NM: the name of the sequence in which the occurrence was found, in square brackets. After the brackets, the number of occurrence in the given sequence and the chain
CH: chain ("+" or "-")
X: position from the 5'-end of the current chain
L: length
ES: energy score. Is equal to the energy in kcal / mol, taken with the opposite sign and multiplied by 10
DS: deviation score. Includes penalties for deviations from optimal lengths and for mismatches with consensus (penalty for mismatches is given separately in brackets "(cons:-50)")
‘S: complex score; is a combination of ES and DS
LN: number of the pattern parsings at the given fragment

The second line contains sequence fragment.
The third line shows the secondary structure in the dot-bracket notation.
The fourth line shows the stems mark up. Nucleotides belonging to the same stem are marked with the same letter.
The fifth line shows the positions of the fragment corresponding to the consensus. If the position agrees with all overlapping consensus fragments, it is denoted by a lowercase letter, otherwise - the uppercase.
The sixth line shows the consensus fragments that are associated with the 5-edges of the elements of the pattern.
The seventh line shows the consensus fragments that are associated with the 3-edges of the elements of the pattern.

Score calculation

ES score:
By default, optimization is performed according to the ES score ('-score_type:2'). ES score represents the energy (in kcal/mol) multiplied by 10 and taken with the opposite sign. ES score, like DS score and CS score, is maximized (the higher means the better).

DS score:
If the option '-score_type:0' is given, the optimization is carried out according to the DS score (deviation score). DS score is the total of penalties for mismatches with consensus (where consensus is defined in the PAT-file), penalties for shifting of the consensus from the optimal position, penalties for deviation of the elements length from the optimal values (only those elements for which the optimal length values are set). DS score can not be positive.

DS score, consensus:
Let's say you have a PAT-file exa.hp.pat looking like this:

RNA_TREE_BEGIN
F
 E              len:0..0     
 S              len:2..2
  L             len:3..10   cons_5:ATGCKYBH  mm_5:1(w:-1)
 E              len:0..0
RNA_TREE_END
PSEUDOKNOTS_BEGIN
PSEUDOKNOTS_END

If the weight of the mismatch is set ('w=-1' in the PAT-file) and the option '-cmm_mul:100' is given, DS score for mismatch with letters A, T, G, C will be -200, DS score for mismatch with letters W(=AT), R(=AG), M(=AC), K(=TG), Y(=TC), S(=CG) will be -100, DS score for mismatch with letters B(=TGC), V(=AGC), H(=ATC), D(=ATG) will be -42:

$ rscan chr.fa exa.hp.cfg -o:rscan.cfg -score_type:0 -cmm_mul:100

Output:

>NM:[ chr].4 CH:+ X:751 L:15 ES:-38 DS:-200(cons:-200) CS:-238 LN:1.00000e+000
ugauacgccucgcgc
((........))...
AA........AA...
..auAcgccu.....
..augckybh.....
...............

>NM:[ chr].7 CH:+ X:1166 L:14 ES:-46 DS:-42(cons:-42) CS:-88 LN:1.00000e+000
uaaugcucauaggc
((.......))...
AA.......AA...
..augcucAu....
..augckybh....
..............

>NM:[ chr].16 CH:+ X:4286 L:16 ES:-48 DS:-100(cons:-100) CS:-148 LN:1.00000e+000
ggaugcuggcuucgcg
((.........))...
AA.........AA...
..augcuGgc......
..augckybh......
................

The option 'dist_5_opt:L_opt(w:X'), referring to some consensus in the PAT-file, also requires calculating the DS score for the shift of the consensus position from the optimal one. In this case, the deviation is linear: DS = |L - L_opt| * X , where L_opt - is the optimal value of the consensus shift relative to the element boundary.

DS score, optimal elements length:
Penalties for deviations from optimal lengths of elements are calculated by the following formula: DS = Mul * ln( ((L - L_opt) / L_tdev) ^ 2 + 1) , where L is the actual length of the element (or the average length of its arms, when talking about the stem), L_opt is the optimal length value, L_tdev is a typical deviation, L_mul is an arbitrary multiplier. The values L_opt, L_tdev, Mul are specified in the PAT-file in the following form: 'len_opt:L_opt(tdev:L_tdev,mul:Mul)', for example, 'len_opt:6.7(tdev:0.8,mul:-10.0)'.

CS score:
If '-score_type:2' is given, the optimization is done by the CS score (complex score). CS score is calculated as a combination of ES score and DS score: CS = x*DS + y*ES . The coefficients x, y are specified by the option '-d_e_mul:x:y'.

Examples of use

In the work directory, some examples of patterns and sets of sequences containing occurrences of these patterns are given. They will be used in the examples below.

Example 1:

$ rscan

Running the program without options gives a help

Example 2:

$ rscan exa.regexp.fa exa.regexp.pat -o:rscan.cfg -nothr -score_type:0 -out_best:3

This example shows that RScan can be used to find in primary sequence some kind of regular expressions. The file exa.regexp.pat describes a pattern with zero stems length. It is configured to search for the following context template (regular expression): ATC...HWAGCSS...ATB...AAA...TACGTG...SS...HYWWYSS, in which "..." means intervals of arbitrary (in some limits) length, the allowed number of mismatches in blocks is also somehow adjusted. The option '-nothr' removes all energy thresholds (since in this case the energy is not of interest to us), the '-score_type:0' option requires optimizing the DS score (in this case, to minimize the number of mismatches with consensus). Output:

>NM:[ RegExpExampleSeq].1 CH:+ X:8 L:20 ES:-83 DS:-204(cons:-204) CS:-287 LN:1.00000e+000
ggcuaagaaagcuuauuagc
....................
....................
GGc....aAAgCUUAuUAg.
auc....aaaguacgugss.
....................

>NM:[ RegExpExampleSeq].2 CH:+ X:17 L:47 ES:-130 DS:-94(cons:-94) CS:-224 LN:1.00000e+000
agcuuauuagcgauaauucuccuauaugccuucauauuaugcagccg
...............................................
...............................................
aGc...uuagcgAuAaU.....uaUAugcc.....auuaugc.....
auc...hwagcsauaaa.....uacgugss.....hywwyss.....
...............................................
...

Example 3:

$ rscan exa.ires.fa exa.ires.pat -o:rscan.cfg -out_best:1

This example shows the use of RScan for searching the internal ribosome entry site (IRES). A pattern in the file exa.ires.pat is described fairly strictly and finds 72 occurrences of 520 IRES entries (from RFAM 12.0), and zero occurrences per 10MB of a random sequence. Output:

>NM:[L02971.1/237-705].1 CH:+ X:158 L:79 ES:139 DS:-60(cons:-60) CS:79 LN:1.00000e+000
auccuagugccagcggaacaacaucugguaacagaugccucuggggccaaaagccaagguuugacagacccauuaggau
(((((((((((((.((......(((((....))))).)).))))(((.....)))..(((((...))))))))))))))
AAAAAAAAABBBBBBB......CCCCC....CCCCC.BBBBBBBDDD.....DDD..EEEEE...EEEEEAAAAAAAAA
.........ccagcggaacaAcAUcugguaa.............ggccaaaa...aAgGuu..................
.........syrbsggaahhccymyykgura.............ggccraaa...aygyby..................
...............................................................................
...

Example 4:

$ rscan exa.trna.fa exa.trna.1.pat -o:rscan.cfg

This example shows the use of RScan for searching tRNA. The exa.trna.1.pat pattern is defined in such a way that it finds 2,508 occurrences of 3,514 tRNAs (from RFAM 12.0) and 1 occurrence per 1,000 nt of a random sequence. The pattern uses 4 consensus blocks and the energy threshold ESL = -0.5 (that is, Energy / L <= 0.05 kcal/mol/nt is required). Output:

>NM:[DR1281].1 CH:+ X:1 L:71 ES:59 DS:-20(cons:-20) CS:39 LN:1.00000e+000
cauucauagcucaauuggauagagcggcggacuucgaauccgaagguugcagguucgacuccugcugagug
((((((..((((.........))))..((((.......))))......(((((.......)))))))))))
AAAAAA..BBBB.........BBBB..CCCC.......CCCC......DDDDD.......DDDDDAAAAAA
......uagc.....uggA............cu...aa...............uucgacucc.........
......urgc.....uggu............cu....................uucranucc.........
....................................ra.................................
...

Example 5:

$ rscan exa.trna.fa exa.trna.2.pat -o:rscan.cfg -score_type:2 -d_e_mul:5:3

This example also shows the use of RScan for searching tRNA. The exa.trna.2.pat pattern is defined in such a way that it finds 2,508 occurrences of 3,514 tRNAs (from RFAM 12.0), and at a given CS threshold ('-score_thr:0') detects 1 occurrence in approximately 360,000 nt of a random sequence. The pattern does not fix consensual nucleotides, but it sets optimal lengths of stems and loops and penalizes deviations from them. The optimization is performed according to CS score ('-score_type:2'), the ratio of DS and ES score in CS score is set to 5:3 ('-d_e_mul:5:3'). Output:

>NM:[DR1281].1 CH:+ X:0 L:73 ES:92 DS:-24(cons:0) CS:156 LN:1.00000e+000
gcauucauagcucaauuggauagagcggcggacuucgaauccgaagguugcagguucgacuccugcugagugc
(((((((..((((.........))))..((((.......))))......(((((.......))))))))))))
AAAAAAA..BBBB.........BBBB..CCCC.......CCCC......DDDDD.......DDDDDAAAAAAA
.........................................................................
.........................................................................
.........................................................................

>NM:[DH9330].1 CH:+ X:0 L:71 ES:121 DS:-4(cons:0) CS:343 LN:1.00000e+000
gccgugaucguauagggguuaguacucugcguuguggccgcagcaaccucgguucgaauccgagucacggc
(((((((..((((........)))).(((((.......)))))....(((((.......))))))))))))
AAAAAAA..BBBB........BBBB.CCCCC.......CCCCC....DDDDD.......DDDDDAAAAAAA
.......................................................................
.......................................................................
.......................................................................
...

Example 6:

$ rscan exa.trna.fa exa.trna.2.pat -o:rscan.cfg -mispair_score:-50 -out_best:1

There is often a case when the pattern is not found in the sequences in which it should be present. In these cases, it is recommended to use the '-mispair_score:N' option, which allows any non-canonical pairs. Often this allows to identify the shortcomings of the pattern and correct it properly. Output:

>NM:[DM2440].1 CH:+ X:0 L:73 ES:-32 DS:-157(cons:0) CS:-189 LN:1.00000e+000
gccugcuuagcucaguugguuagagcguccguuucauaagcugauugucacuaguucaaaucuaguagcaggc
(((((((((((...)))).(((((((....)).)).)))..........(((((<(...)>))))))))))))
AAAAAAABBBB...BBBB.CCCCCCC....CCCCCCCCC..........DDDDDDD...DDDDDDDAAAAAAA
.........................................................................
.........................................................................
.........................................................................

>NM:[DN1140].1 CH:+ X:0 L:72 ES:-8 DS:-135(cons:0) CS:-143 LN:1.00000e+000
ggcuuuuuagcucagcagguagagcaaccggcuguuaaccgguuugucacagguucgagcccuguaaaagcc
(((((((..((((<(....)>))))((((((<(...)>))))))....(((((<(...)>))))))))))))
AAAAAAA..BBBBBB....BBBBBBCCCCCCCC...CCCCCCCC....DDDDDDD...DDDDDDDAAAAAAA
........................................................................
........................................................................
........................................................................
Here the non-canonical pairs are denoted by angular brackets, like in this case:
gggaaaaauugcc
((<((...))>))

Scanning speed

    The scanning speed is higher if:

  1. Non-canonical pairs are forbidden (gives a cardinal acceleration): the '-mispair_score:N' option is absent
  2. The required consensus is longer (gives a cardinal acceleration)
  3. The required consensus has fewer allowed mismatches (gives a cardinal acceleration): in the option 'mm_5:X'(and other similar) the value of X is lower
  4. The required consensus has a smaller 'swing' limit relative to the structure: in the 'dist_5:X..Y'(and other similar) option, the interval [X, Y] is shorter
  5. The elements of the pattern less vary in size (it gives a cardinal acceleration): in the 'len:X..Y' option, the interval [X, Y] is shorter. Especially it concerns the terminal (hairpin) loops and the stems closing them
  6. The stems are longer: in the 'len:X..Y' option of "S" node, the X value is higher.
  7. The stems have the options 'tmm_in', 'tmm_ex', or, even better, the option '-tmm_mode:N' is provided
  8. The PAT-file contains the options 'LEN:X..Y', the interval [X,Y] is shorter
  9. The stems have a smaller size of the allowed internal loops: in the option 'msl:X', the value of X is lower
  10. The loops are smaller: in the options '-max_iloop_len:X', '-max_mloop_len:Y', '-max_xloop_len:Z' the values of X, Y, Z are lower.
  11. Stronger stems are allowed: the '-nowse' option is set and in the '-stem_ener_thr:A', '-stem_ener_thr_1bp:B', '-stem_ener_thr_1bp:C', '-stem_ener_thr_1bp:D' options A, B, C, D parameters have higher values.


RInf  description

Contents

Introduction
Running  RInf
PAT-file format
CFG-file format
RInf  options
Output format
Examples of use

Introduction

RInf   program is intended for estimation of frequency (or bit-score) of occurrence of the pattern with a definite secondary structure in long random sequence. It is implied that the sequence is generated according to the Bernoulli scheme with uniform nucleotide distribution. The program takes into account the shape of the structure and the energy threshold.

The current version does not take into account the contextual constraints imposed on the pattern. Only a rough estimation of the information content of contextual requirements is given in isolation from the structural one.

Frequency estimation is done by any combination of three algorithms:

(A) by scanning a random sequence,
(B) by estimating the properties of pseudorandom (generated) occurrences,
(C) by linear regression from some statistics of the pattern, including characteristics of its frequency-energy spectrum.

Algorithm (A) works only with relatively frequently occurring patterns (with a frequency from 10-7 to 1).
Algorithm (B) can evaluate any pattern, but it needs to generate tens of thousands of pseudo-occurrences, which can sometimes take tens of minutes.
Algorithm (C) gives a less accurate estimate than (A) and (B), working at the same time no longer than one minute.

Note: Frequency estimates can sometimes exceed 1. This is because several occurrences with a different length may start from or end in the same position, whereas as an estimate the sum of frequencies over all possible lengths of occurrences is considered.

See detailed method description here.

Running  RInf

RInf  is a console application. It can be run as follows:

$ rinf x in.pat -o:rinf.cfg [options]

Here
x             - 1st argument is an empty option
in.pat     - is a PAT-file (a file with a description of the secondary structure pattern)
rinf.cfg  - CFG-file (configuration file)

PAT-file format

The description of the PAT-file can be found at the Rscan help page.

CFG-file format

Basically, the CFG file is not intended for editing. However, lines beginning with "COMMAND_LINE:" can be edited. The parameters located on these lines are the same as the command line parameters.

RInf   options

-sec_scan:N (def.: 30) Search for occurrences in a random sequence no longer than N seconds
-sec_imit:N (def.: 30) Simulate random occurrences no longer than N seconds
-vol_scan:N (def.: 1000) Stop searching for occurrences in a random sequence after finding N occurrences. Note: the search for random occurrences stops when one of the thresholds is reached: '-sec_scan:X' or '-vol_scan:Y'
-vol_imit:N (def.: 1000) The amount of pseudo-random occurrences to be simulated. Note: simulating pseudo-random occurrences is stopped when one of the thresholds is reached: '-sec_imit:X' or '-vol_imit:Y'
-vol_imit_min:N (def.: 50) Until this amount of statistics is reached, the '-sec_imit:X' and '-vol_imit:Y' options do not work
-noscan Skip scanning algorithm
-noimit Skip the algorithm generating pseudo-occurrences
-nospec Skip spectrum estimation algorithm
-max_iloop_len:N (def.: 40) Same as in the  RScan  program. Maximum allowed length of the interior loop between stems
-max_mloop_len:N (def.: 50) Same as in the  RScan  program. Maximum allowed length of the multiloop (the sum of all its arms length)
-max_xloop_len:N (def.: 50) Same as in the  RScan  program. Maximum allowed length of "external" loop. An "external" loop is the length of 'x' in a structure not closed by stem, like ...(((...)))xxxx(((...)))xx(((...)))...

Output format

Below is an example of the basic type of RInf output:

------------------------------------------------------------------------------------------------------------------------------------------------------
Summ of frequences (partition function), without energy threshold: 4.625e+001
Estimated frequence                    , without energy threshold: 3.191e+000
                                  Primary sequence consensus bits: 13.87
                                                       Shape bits: 0.00
------------------------------------------------------------------------------------------------------------------------------------------------------
Summ of frequences (partition function),   above energy threshold: 4.827e-001
Estimated frequence                    ,   above energy threshold: 9.768e-002
                                                Shape-energy bits: 3.36
------------------------------------------------------------------------------------------------------------------------------------------------------
ES/L     ES/L    ES/L   Scan.    Scan.      Scan.     Scan.  Imit.    Imit.    Linear. Imit.  Part.    Part.    Part.    Number   Most    Number  Esti
Thresh.  Expect. Stdev  Observed Estim.     Estim.    Estim. Var_#.   Mism_#.  Combin. Stat.  Func.    Func.-   Func.-   Of       Freq.   Of St.  Ave.
         ByPart. ByPart Freq.    Freq.      Freq.     Vol.   Estim.   Estim.   Var_#   Vol.            Estim.   Estim.   Possible Struct. In the  Occ.
         Func.   Func.           Norm.Distr Norm.rTail       Freq.    Freq.    Mism_#                  Freq.1   Freq.2   Forms    Freq.   Pattern Len.
------------------------------------------------------------------------------------------------------------------------------------------------------
-100.00  -0.81   0.80   8.90e-01 8.90e-001  8.90e-001   8626 1.62e+00 8.69e-01 9.71e-01 9310 4.62e+01 3.19e+00 3.19e+00 3.91e+07 1.20e-003  3    70.05
  -2.00  -0.81   0.80   8.82e-01 8.82e-001  8.82e-001   8549 1.60e+00 8.59e-01 9.59e-01 9272 4.36e+01 2.85e+00 6.05e+00 3.91e+07 1.20e-003  3    70.05
  -1.50  -0.81   0.80   8.51e-01 8.53e-001  8.51e-001   8247 1.54e+00 8.24e-01 9.20e-01 9023 3.84e+01 2.45e+00 6.19e+00 3.91e+07 1.20e-003  3    70.05
  -1.00  -0.81   0.80   7.70e-01 7.65e-001  7.67e-001   7460 1.34e+00 7.22e-01 8.04e-01 8091 2.91e+01 1.85e+00 5.31e+00 3.91e+07 1.20e-003  3    70.05
  -0.50  -0.81   0.80   5.99e-01 5.93e-001  5.96e-001   5802 9.78e-01 5.25e-01 5.84e-01 5860 1.76e+01 1.19e+00 3.52e+00 3.91e+07 1.20e-003  3    70.05
   0.00  -0.81   0.80   3.73e-01 3.68e-001  3.72e-001   3611 5.37e-01 2.89e-01 3.19e-01 3128 8.00e+00 6.22e-01 1.69e+00 3.91e+07 1.20e-003  3    70.05
   0.50  -0.81   0.80   1.67e-01 1.72e-001  1.74e-001   1618 2.07e-01 1.12e-01 1.22e-01 1111 2.69e+00 2.60e-01 5.75e-01 3.91e+07 1.20e-003  3    70.05
   1.00  -0.81   0.80   5.24e-02 5.80e-002  5.84e-002    508 5.51e-02 2.97e-02 3.21e-02  302 6.70e-01 8.55e-02 1.37e-01 3.91e+07 1.20e-003  3    70.05
   1.50  -0.81   0.80   1.62e-02 1.37e-002  1.37e-002    157 1.02e-02 5.49e-03 5.83e-03   63 1.23e-01 2.21e-02 2.32e-02 3.91e+07 1.20e-003  3    70.05
   2.00  -0.81   0.80   5.16e-03 2.22e-003  2.20e-003     50 1.35e-03 7.32e-04 7.62e-04    7 1.76e-02 4.53e-03 2.95e-03 3.91e+07 1.20e-003  3    70.05
   2.50  -0.81   0.80   5.16e-04 2.45e-004  2.39e-004      5 1.36e-04 7.39e-05 7.52e-05    1 1.94e-03 7.38e-04 2.81e-04 3.91e+07 1.20e-003  3    70.05
   3.00  -0.81   0.80   0.00e+00 1.82e-005  1.74e-005      0 1.08e-05 5.88e-06 5.86e-06    0 1.71e-04 9.64e-05 2.07e-05 3.91e+07 1.20e-003  3    70.05
   3.50  -0.81   0.80   0.00e+00 9.05e-007  8.47e-007      0 7.03e-07 3.79e-07 3.71e-07    0 1.25e-05 1.02e-05 1.24e-06 3.91e+07 1.20e-003  3    70.05
   4.00  -0.81   0.80   0.00e+00 3.01e-008  2.74e-008      0 3.70e-08 1.98e-08 1.90e-08    0 7.55e-07 8.76e-07 6.10e-08 3.91e+07 1.20e-003  3    70.05
   4.50  -0.81   0.80   0.00e+00 6.66e-010  5.87e-010      0 1.55e-09 8.21e-10 7.76e-10    0 3.87e-08 6.13e-08 2.47e-09 3.91e+07 1.20e-003  3    70.05
   5.00  -0.81   0.80   0.00e+00 9.79e-012  8.32e-012      0 5.05e-11 2.66e-11 2.46e-11    0 1.64e-09 3.50e-09 8.05e-11 3.91e+07 1.20e-003  3    70.05

The first 4 lines contain:

- The sum of the frequencies of different shapes of the pattern (partition function) without taking energy into account
- Estimate of the frequency of the pattern without taking energy into account energy
- Estimate of the information content of contextual constraints of the pattern
- Estimate of the information content of the structure (shape) of the pattern

The following 3 lines contain:

- The sum of the frequencies of different shapes of the pattern (partition function), taking into account the energy threshold
- Estimate of the frequency of the pattern, taking into account the energy threshold
- Estimate of the information content of the structure (shape) of the pattern, taking into account the energy threshold

Below follows the table:

Column 1 - ES/L threshold, where ES is a neg-energy (in kcal/mol) multiplied by 10, L is a length of occurrence
Column 2 - value ES/L, which cuts out half of the occurrences, averaged over all shapes of the pattern
Column 3 - standard deviation of the ES/L value
Column 4 - Observed frequency by scanning algorithm
Column 5   is the same as column 4, but asymtotically approximated by the normal distribution
Column 6   is the same as column 4, but asymtotically approximated by an asymmetric distribution with a light tail
Column 7 - the volume of statistics, collected by the scanning algorithm
Column 8 - estimate of the pattern frequency using the algorithm generating pseudo-random occurrences. The estimate (still not taking energy into account) is a sum of the frequencies of all shapes of the pattern, divided by the average number of variants of the pattern placement on a fixed fragment of the sequence. The obtained value is multiplied then by the fraction of occurrences having ES/L score above the threshold from column 1
Column 9 - estimate of the pattern frequency using the algorithm generating pseudo-random occurrences. The estimate (still not taking energy into account) is the frequency of the occurrence of fragments that do not require the replacement of non-canonical pairs for the perfect correspondence of the fragment to the pattern. The obtained value is multiplied then by the fraction of occurrences having ES/L score above the threshold from column 1. The estimate is more rough than in column 8
Column 10   is a linear combination of values from columns 8 and 9, better than both of them
Column 11 - the volume of statistics of pseudo-random occurrences
Column 12 - the sum of frequencies (partition function) of different shapes of the pattern with ES/L score exceeding the threshold from column 1
Column 13 - estimate of the frequency based on the assumed normality of the energy distribution of the occurrences. This estimate is better than in column 14 in the area of high frequencies
Column 14 - estimate of the frequency by linear regression, in which the most meaningful regressor is the value from column 12, the other regressors are other statistics of the pattern. This estimate is somewhat better than in column 13, especially in the area of lower frequencies
Column 15 - the exact number of possible pattern shapes
Column 16 - the frequency of the most frequent shape of the pattern (only the shape, without taking energy into account)
Column 17 - number of stems in the pattern
Column 18 - estimate of the average occurrence length

Examples of use

In the   work   directory, some examples of patterns are given. They will be used in the examples below.

Example 1:
  $ rinf
Running the program without options gives a help

Example 2:
  $ rinf x exa.secis.pat -o:rinf.cfg
This example shows the easiest way to launch RInf. All 3 evaluation algorithms are runned: scanning, generation of pseudo-random occurrences, and regression by spectrum characteristics

Example 3:
  $ rinf x exa.secis.pat -o:rinf.cfg -noscan -noimit
Same as in the previous example, but only the regression algorithm is runned

Example 4:
  $ rinf x exa.secis.pat -o:rinf.cfg -noscan -noimit -nospec
Do not perform any of the algorithms that take energy into account. An estimate is given only for the shape of the pattern. Keeps within one second.

Example 5:
  $ rinf x exa.secis.pat -o:rinf.cfg -sec_scan:3600 -vol_scan:100000
Scan at least 1 hour. Scanning will be interrupted if statistics volume of 100,000 occurrences is reached


b2t  description

Contents

Introduction
Running  b2t
b2t  options
Output format
Examples of use

Introduction

b2t   ("bracket to tree") program is designed to generate files with patterns of secondary structure (PAT-files), which are accepted by the program Rscan. The input data of b2t is a dot-bracket RNA secondary structure which can be obtained by any RNA folding program.

Running b2t

b2t   is a console application. It can be run as follows:

$ b2t -file:in.file [options]

Here

in.file   is a file consisting of 2 or 3 lines. The first line should contain the RNA sequence; its dot-bracket structure should follow in the second line; the third line can (optionally) contain the primary sequence constraints in 15-letter (IUPAC) alphabet.

An example of the contents of the input file is shown below:

gcaugcaagccgcgggaacucccccuuggugacaaggacccgcggggccaaaagccacguucucugaaccuugcaugu
((((((((((((((((.......(((((....))))).)))))))(((.....)))..((((...)))))))))))))
.............SGSMA..........DDDD........................AC....................

b2t options

-read_cons_str Read third string determining primary sequence requirements (in 15-letters code). When the option is ON, input file should look like this:

      aagcgacccucgcaa
      ..((((...))))..
      AASC..MC.......
    
otherwise like this:

      aagcgacccucgcaa
      ..((((...))))..
    
-max_stem_loop:N (def.: 2) Max allowed stem defect size (sum of left and right internal loop arms)
-min_stem_len:N (def.: 2) Lowest allowed min stem length limit (value of 'A' in 'len:A..B' section of the description of a stem). Alowed values are 0, 1, 2, 3.
-consider_lp Consider lonely pairs. By default, nucleotides of lonely pairs in unput structures are dropped.

Deviations penalties from the observed length are calculated according to the formula: DEV_PEN = round(A + B * L ^ C), where L is a length of the element or substructure

-st_A: (def.: -1.0) is A for stems
-st_B: (def.: 1.0) is B for stems
-st_C: (def.: 0.5) is C for stems

-hp_A: (def.: 1.0) is A for hairpin loops
-hp_B: (def.: 1.0) is B for hairpin loops
-hp_C: (def.: 0.5) is C for hairpin loops

-in_A: (def.: 1.0) is A for internal loops
-in_B: (def.: 1.0) is B for internal loops
-in_C: (def.: 0.5) is C for internal loops

-ex_A: (def.: 1.0) is A for external loops
-ex_B: (def.: 1.0) is B for external loops
-ex_C: (def.: 0.5) is C for external loops

-mu_A: (def.: 1.0) is A for multiple loops
-mu_B: (def.: 1.0) is B for multiple loops
-mu_C: (def.: 0.5) is C for multiple loops

-sp_A: (def.: 1.0) is A for a subpattern closed with a stem
-sp_B: (def.: 1.0) is B for a subpattern closed with a stem
-sp_C: (def.: 0.5) is C for a subpattern closed with a stem

-sp_randshift Select interval of subpattern LEN randomly. Pattern produced with '-sp_randshift' may not catch original structure
-sp_sometimes Set subpattern LEN limits only for some (randomply chosen) nodes
-sp_never Cancels setting LEN limits for subpatterns

-tl_A: (def.: 1.0) is A for total pattern length; if 'tl_A' <= -100, total pattern len is made fixed (LEN:X..X)
-tl_B: (def.: 1.0) is B for total pattern length
-tl_C: (def.: 0.5) is C for total pattern length
-tl_randshift Select interval of total LEN randomly. Pattern produced with '-tl_randshift' may not catch original structure

-al_A: (def.: 1.0) is A for all elements above (including total length) except specified
-al_B: (def.: 1.0) is B for all elements above (including total length) except specified
-al_C: (def.: 0.5) is C for all elements above (including total length) except specified

-relax_stem_loop:N (def.: 0) Increase 'max_stem_loop' for all stems in the output.
-mark:s Put mark string into output (in comments).

Output format

The description of the resulting PAT-file can be found at the Rscan help page.

Examples of use

The   work   directory contains an example file with a dot-bracket secondary structure: exa.b2t.in. It will be used in the following examples.

Example 1:
  $ b2t
Running without options gives a help

Example 2:
  $ b2t -file:exa.b2t.in  >  exa.b2t.pat
This example shows the easiest way to run b2t. The output of the program is redirected to the file exa.b2t.pat

Example 3:
  $ b2t -file:exa.b2t.in -read_cons_str  >  exa.b2t.pat
Same as in the previous example, but the program will try to read the third line of the input file. The third line must have the same length as the sequence fragment in the first line and the dot-bracket structure in the second line. The line must consist only of dots ('.') and 15-letter alphabet characters (IUPAC). Letters will be converted into consensus requirements

Example 4:
  $ b2t -file:exa.b2t.in -max_stem_loop:0  >  exa.b2t.pat
A stronger restriction on stems defects is established: '-max_stem_loop:0' option, for example, will not allow two adjacent stems consisting of the three pairs (in the input) to be combined into one stem (in the output) consisting of six to seven pairs in the following structure: (((.(((...))))))

Example 5:
  $ b2t -file:exa.b2t.in -al_A:2  >  exa.b2t.pat
Allow a wider range of length than the default value for all element types.