QUICK START
DESCRIPTION
COMMANDS AND OPTIONS
CONFIGURATION PARAMETRS
PRELIMINARY DATA PROCESSING
ACCURACY
LICENSE AND CITATION
ReadsMap
is a program for high accuracy mapping of large sets of short sequencing reads.
The program can be used for genome and RNA-Seq reads alignments.
ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map "exonic" reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining "non-mapped" group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these "non-mapped" reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.
In the following example of a command line, single-end RNA-Seq reads (file reads.fa) are being mapped to chromosomes (chr1.fa,chr2.fa) of the same organism, allowing for deletions/insertions of up to 4 nucleotides. The results are being saved in subfolder ./da in sam format.
./bin/runReadsMap.pl --chr:chr1.fa,chr2.fa --reads:reads.fa --max_indel:4 --spliced --wrkpath:./da --sam
In this example of a command line, pair-ends genomic reads (file reads.fa) are being mapped to chromosomes (chr1.fa,chr2.fa) of the same organism, allowing for deletions/insertions of up to four nucleotides. Results are being saved in subfolder ./da in sam format. Average distance between external ends of a pair is 200, standard deviation of such distance 20.0.
./bin/runReadsMap.pl --chr:chr1.fa,chr2.fa --reads:reads.fa --paired --max_indel:4 --nosplice --wrkpath:./da -–sam --peAv:200 –peSd:20
Input parametrs:
--chr: - list of chromosomes (contigs), comma separated. File format – FASTA. --chr_list: - File with list of names of files with chromosomes (contigs) (alternative to previous variant). List of filenames is separated by line feed (by one filename in the line). --reads: - File with reads. File format - MULTI FASTA. --masked - invoke masking (ignore lowercase letters). --wrkpath:path - Path to keep aligning results --paired - use the information on pairness of reads. In our tests, taking pairness into account increased the specificity (proportion of correct alignments) but, in some way, decreased sensitivity (ratio of correct alignments to total number of reads). By default, this parameter is disabled. Paired reads must be in a sequential order (even/odd).
Parameters of paired reads:
--peAv:X Average distance between external ends of a pair for PE type pairs. ( default 200.0) --peSd:X Standard deviation of such distance. ( default 20.0) --sigma: Number of standard deviations to calculate maximum and minimum allowed distances. ( default 5.0)
Alignment parametrs:
--max_indel:X Set the maximal length of deletions/insertions. --covSkew:X Maximal deviation from the best coverage: alignments with quality not worse than 1-X of the best will be kept (will pass the filter).
Output options:
--cvt - Convert previously calculated *.da files into desired output format (alignment is not performed, only conversion). --sam - convert the result to SAM format. --sbl - convert the result to Sbl-like format (text format of alignment). --map - convert the result into file with mapping coordinates. --stat:fname - create file with some descriptive statistics. --sites - create *sites files for FGENESH. --j:x - number of threads to use during the alignments calculation.
In the run_align.conf file you can change the default settings for some parameters
NTHREADS 7 - number of threads to use during the alignments calculation. @MAXINDEL 4 - maximal length of isertions/deletions. @COVSKEW 0.01 - maximal deviation from best coverage. @USE_SAMTOOLS 1 - Use samtools to generate VCF file. @SAMTOOLS $BIN/external/samtools - set path to samtools ($BIN is ReadsMap folder) @BCFTOOLS $BIN/external/bcftools - set path to bcftools ($BIN is ReadsMap # This variant is valid for version 1.2 @SAMTOOLS_VCF_CMD mpileup -Bguf - command line for samtools @BCFTOOLS_VCF_CMD call -cv - - command line for bcftools
File(s) with chromosomal sequence in FASTA format and a file with read sequences in MULTIFASTA format are used as input.
Sequences of paired reads must be in order (even/odd). Preliminary data preparation can be performed using fastq2_2_fasta.pl script, as described in fastq2_2_fasta.txt file.
Table 1. Specification of sets of test reads.
MRNA-simulated reads
Potentially spliced high homology reads, no insertions/deletions, simulates errors of Illumna sequencer.
Length of reads | Number of reads | Spliced reads (crossing exon junctions) | Parameters of readss |
50bp | 2 979 624 | 492 743 (16.5%) | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 1 960 300 | 485 857 (24.8%) | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
100bp | 1 489 796 | 469 319 (33.3%) | insert size = 300 bp, standard deviation = 30 bp, coverage = 40 |
Table 2. Specification of sets of test reads.
Genome-simulated reads
Non-spliced reads with increasing frequency of mismatches (mutations) and limited number of insertions/deletions of uo tp 4 bp.
Length of reads | Number of reads | % of mutations | % InDel | Parameters of reads |
76bp | 18 363 068 | 0.5 | 0.002 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18 363 276 | 1 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18 368 502 | 2 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18361496 | 3 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18365644 | 4 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18361920 | 5 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18364062 | 6 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18369140 | 7 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18367384 | 8 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18373472 | 9 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
76bp | 18371406 | 10 | 0.02 | insert size = 200 bp, standard deviation = 20 bp, coverage = 40 |
Table 3. Combined table of accuracies of four read mapping programs, RNA-Seq reads, both spliced and non-spliced.
Sn = good_align/reads_count.
Sp = good_align/AllAlign.
50bp | 76bp | 100bp | ||||
Sp | Sn | Sp | Sn | Sp | Sn | |
ReadsMap | 0.95571 | 0.99725 | 0.96732 | 0.99759 | 0.97411 | 0.99680 |
TopHat | 0.92268 | 0.92268 | 0.94996 | 0.98643 | 0.95528 | 0.91894 |
STAR | 0.89171 | 0.94056 | 0.90403 | 0.94007 | 0.89864 | 0.93096 |
PASS | 0.89005 | 0.91547 | 0.88750 | 0.90603 | 0.86458 | 0.87765 |
Table 4. Reads without introduced mutations and insertions/deletions (unspliced reads from the previous set)
ReadsMap v 1.9.0 (version for genomic reads)
Length, bp | Quantity | Aligned
(Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec |
100 | 1 020 477 | 1020477 (1.00000) |
1020477 (1.0000) |
1020424 | 0,96978 | 0,99995 | 0,98463 | 0,98475 | 548.03 |
76 | 1 473 886 | 1473886 (1.0000) |
1530231 | 1473831 | 0,96314 | 0,99996 | 0,98120 | 0,98138 | 452.56 |
50 | 2 486 387 | 2486336 (0.99998) |
2711089 | 2485726 | 0,91687 | 0,99973 | 0,95651 | 0,95740 | 980.28 |
BWA v 0.7.12
Length, bp | Quantity | Aligned (Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec* |
100 | 1 020 477 | 1019620 (0.99916) |
1019620 | 1009030 | 0,98961 | 0,98878 | 0,98919 | 0,98919 | 347.29 |
76 | 1 473 886 | 1473210 (0.99954) |
1473210 | 1455150 | 0,98774 | 0,98729 | 0,98751 | 0,98751 | 322.25 |
50 | 2 486 387 | 2485349 (0.99958) |
2485349 | 2441869 | 0,98251 | 0,98210 | 0,98230 | 0,98230 | 297.23 |
* Time without indexing of reference genome, which took additional 65 sec.
Bowtie
Length, bp | Quantity | Aligned (Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec* |
100 | 1 020 477 | 1020244 (0.99977) |
1020244 | 1009052 | 0,98903 | 0,98880 | 0,98891 | 0,98891 | 179.26 | 76 | 1 473 886 | 1473158 (0.99951) |
1473158 | 1454956 | 0,98764 | 0,98716 | 0,98740 | 0,98740 | 156.86 | 50 | 2 486 387 | 2477883 (0.99658) |
2477883 | 2434210 | 0,98237 | 0,97901 | 0,98069 | 0,98069 | 158.74 |
Table 5. Reads with introduced mismatches (mutations) and insertions/deletions of up to four bp
ReadsMap v 1.9.0 (version for genomic reads)
% of mutations | Quantity | Aligned (Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec* |
0,5 | 18 363 068 | 18362950
(0,99999) |
19254664 | 18338766 | 0,95243 | 0.99868 | 0.97501 | 0.97528 | 13208,16 |
1 | 18 363 276 | 18234432 (0,99298) |
19120356 | 17004096 | 0,88932 | 0.92598 | 0.90728 | 0.90746 | 17901,33 |
2 | 18 368 502 | 18242134 (0,99312) |
19142304 | 14515623 | 0,75830 | 0.79025 | 0.77395 | 0.77411 | 29625,92 |
3 | 18 361 496 | 18216744 (0,99212) |
19124318 | 15170384 | 0,79325 | 0.82621 | 0.80939 | 0.80956 | 39638,09 |
4 | 18 365 644 | 18150184 (0,98827) |
19058762 | 12310677 | 0,64593 | 0.67031 | 0.65789 | 0.65801 | 45485,20 |
5 | 18 361 920 | 17985918 (0,97952) |
18886312 | 12444053 | 0,65889 | 0.67771 | 0.66817 | 0.66823 | 46113,67 |
6 | 18 364 062 | 17672136 (0,96232) |
18581674 | 11741815 | 0,63190 | 0.63939 | 0.63562 | 0.63563 | 43155,83 |
7 | 18 369 140 | 17148746 (0,93356) |
18017920 | 11163611 | 0,61958 | 0.60774 | 0.61360 | 0.61363 | 38197,16 |
8 | 18 367 384 | 16367454 (0,89112) |
17205394 | 10175198 | 0,59140 | 0.55398 | 0.57208 | 0.57238 | 31221,26 |
9 | 18 373 472 | 15330374 (0,83438) |
16132718 | 9380973 | 0,58149 | 0.51057 | 0.54373 | 0.54488 | 26240,48 |
10 | 18 371 406 | 14010072 (0,76260) |
14742026 | 8015146 | 0,54369 | 0.43628 | 0.48410 | 0.48703 | 23297,59 |
BWA v 0.7.12
% of mutations | Quantity | Aligned (Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec* |
0,5 | 18 363 068 | 18362950 (0,99999) |
18277290 | 17836216 | 0.97587 | 0.97131 | 0.97358 | 0.97359 | 1861.88 |
1 | 18 363 276 | 18234432 (0,99298) |
16352253 | 16352253 | 0.91169 | 0.89049 | 0.90097 | 0.90103 | 2193.43 |
2 | 18 368 502 | 18242134 (0,99312) |
16981036 | 16981036 | 0.77650 | 0.71785 | 0.74602 | 0.74660 | 2715.63 |
3 | 18 361 496 | 18216744 (0,99212) |
15184617 | 12332850 | 0.81219 | 0.67167 | 0.73528 | 0.73860 | 2885.90 |
4 | 18 365 644 | 18150184 (0,98827) |
12798496 | 8450889 | 0.66030 | 0.46015 | 0.54235 | 0.55121 | 2808.80 |
5 | 18 361 920 | 17985918 (0,97952) |
10210712 | 6876296 | 0.67344 | 0.37449 | 0.48132 | 0.50219 | 2744.36 |
6 | 18 364 062 | 17672136 (0,96232) |
7757125 | 5013381 | 0.64629 | 0.27300 | 0.38386 | 0.42004 | 2455.39 |
7 | 18 369 140 | 17148746 (0,93356) |
5650432 | 3577670 | 0.63317 | 0.19477 | 0.29790 | 0.35117 | 2117.16 |
8 | 18 367 384 | 16367454 (0,89112) |
3960788 | 2389041 | 0.60317 | 0.13007 | 0.21399 | 0.28010 | 1748.40 |
9 | 18 373 472 | 15330374 (0,83438) |
2697972 | 1602255 | 0.59387 | 0.08720 | 0.15207 | 0.22756 | 1455.65 |
10 | 18 371 406 | 14010072 (0,76260) |
1783120 | 990818 | 0.55567 | 0.05393 | 0.09832 | 0.17311 | 1222.717 |
* Time without indexing of reference genome, which took additional 65 sec.
Bowtie
% of mutations | Quantity | Aligned (Percent) |
Align N | Good | Sp | Sn | F1-Score | G-Measure | Time,sec* |
0,5 | 18 363 068 | 18362950 (0,99999) |
18258558 | 17769803 | 0.97323 | 0.96769 | 0.97045 | 0.97046 | 2449.11 |
1 | 18 363 276 | 18234432 (0,99298) |
18054372 | 16276544 | 0.88636 | 0.88636 | 0.88636 | 0.88636 | 2569.75 |
2 | 18 368 502 | 18242134 (0,99312) |
17355459 | 13272528 | 0.76475 | 0.72257 | 0.74306 | 0.74336 | 2543.34 |
3 | 18 361 496 | 18216744 (0,99212) |
16230075 | 12951743 | 0.79801 | 0.70538 | 0.74884 | 0.75027 | 2446.93 |
4 | 18 365 644 | 18150184 (0,98827) |
9582292 | 9582292 | 0.64740 | 0.52175 | 0.57782 | 0.58119 | 2277.48 |
5 | 18 361 920 | 17985918 (0,97952) |
13129087 | 8662673 | 0.65981 | 0.47177 | 0.55017 | 0.55792 | 2102.27 |
6 | 18 364 062 | 17672136 (0,96232) |
11373687 | 7194946 | 0.63260 | 0.39179 | 0.48389 | 0.49784 | 1854.86 |
7 | 18 369 140 | 17148746 (0,93356) |
9596446 | 5946099 | 0.61961 | 0.32370 | 0.42524 | 0.44785 | 1701.03 |
8 | 18 367 384 | 16367454 (0,89112) |
7885925 | 4663468 | 0.59137 | 0.25390 | 0.35527 | 0.38749 | 1528.09 |
9 | 18 373 472 | 15330374 (0,83438) |
6337211 | 3684500 | 0.58141 | 0.20053 | 0.29821 | 0.34145 | 1350.17 |
10 | 18 371 406 | 14010072 (0,76260) |
4935904 | 2681500 | 0.54326 | 0.14596 | 0.23010 | 0.28159 | 1199.94 |
ReadsMap is a free for academic usage. Please contact us at softberry@softberry.com for commercial license.