SoftBerry - readsmap-i

ReadsMap: Program For Mapping Short Sequencing Reads

QUICK START
DESCRIPTION
COMMANDS AND OPTIONS
CONFIGURATION PARAMETRS
PRELIMINARY DATA PROCESSING
ACCURACY
LICENSE AND CITATION

DESCRIPTION

ReadsMap is a program for high accuracy mapping of large sets of short sequencing reads. The program can be used for genome and RNA-Seq reads alignments.
ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map "exonic" reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining "non-mapped" group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these "non-mapped" reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.

QUICK START

In the following example of a command line, single-end RNA-Seq reads (file reads.fa) are being mapped to chromosomes (chr1.fa,chr2.fa) of the same organism, allowing for deletions/insertions of up to 4 nucleotides. The results are being saved in subfolder ./da in sam format.

 ./bin/runReadsMap.pl --chr:chr1.fa,chr2.fa  --reads:reads.fa --max_indel:4 --spliced --wrkpath:./da --sam

In this example of a command line, pair-ends genomic reads (file reads.fa) are being mapped to chromosomes (chr1.fa,chr2.fa) of the same organism, allowing for deletions/insertions of up to four nucleotides. Results are being saved in subfolder ./da in sam format. Average distance between external ends of a pair is 200, standard deviation of such distance 20.0.

./bin/runReadsMap.pl --chr:chr1.fa,chr2.fa  --reads:reads.fa --paired --max_indel:4 --nosplice --wrkpath:./da -–sam --peAv:200 –peSd:20

COMMANDS AND OPTIONS

Input parametrs:

--chr: - list of chromosomes (contigs), comma separated. File format – FASTA.

 --chr_list: - File with list of names of files with chromosomes (contigs) (alternative to previous variant). 
List of filenames is separated by line feed (by one filename in the line).

 --reads: - File with reads. File format - MULTI FASTA.
 --masked - invoke masking (ignore lowercase letters).

 --wrkpath:path - Path to keep aligning results

 --paired - use the information on pairness of reads. In our tests, taking pairness into account increased the specificity (proportion of correct alignments) but, in some way, decreased sensitivity (ratio of correct alignments to total number of reads). By default, this parameter is disabled. 
 Paired reads must be in a sequential order (even/odd).

Parameters of paired reads:

 --peAv:X          Average distance between external ends of a pair for PE type pairs. ( default 200.0)

 --peSd:X          Standard deviation of such distance. ( default 20.0)

 --sigma:          Number of standard deviations to calculate maximum and minimum allowed distances. ( default 5.0)

Alignment parametrs:

 --max_indel:X Set the maximal length of deletions/insertions.

 --covSkew:X  Maximal deviation from the best coverage: alignments with quality not worse than 1-X of the best will be kept (will pass the filter).

Output options:

 --cvt - Convert previously calculated *.da files into desired output format (alignment is not performed, only conversion).

 --sam  - convert the result to SAM format.

 --sbl  - convert the result to Sbl-like format (text format of alignment).

 --map  - convert the result into file with mapping coordinates.

 --stat:fname - create file with some descriptive statistics.

 --sites   - create  *sites files for FGENESH.

 --j:x    - number of threads to use during the alignments calculation.

CONFIGURATION PARAMETRS

In the run_align.conf file you can change the default settings for some parameters

NTHREADS    7  - number of threads to use during the alignments calculation.
@MAXINDEL    4 - maximal length of isertions/deletions.
@COVSKEW  0.01 - maximal deviation from best coverage.

@USE_SAMTOOLS 1   - Use samtools to generate VCF file.
@SAMTOOLS $BIN/external/samtools   - set path to samtools ($BIN is ReadsMap folder)
@BCFTOOLS $BIN/external/bcftools  - set path to bcftools ($BIN is ReadsMap

# This variant is valid for version 1.2
@SAMTOOLS_VCF_CMD  mpileup -Bguf    - command line for samtools
@BCFTOOLS_VCF_CMD  call -cv -       - command line for bcftools

PRELIMINARY DATA PROCESSING

File(s) with chromosomal sequence in FASTA format and a file with read sequences in MULTIFASTA format are used as input.
Sequences of paired reads must be in order (even/odd). Preliminary data preparation can be performed using fastq2_2_fasta.pl script, as described in fastq2_2_fasta.txt file.

MAPPING ACCURACY

Table 1. Specification of sets of test reads.
MRNA-simulated reads
Potentially spliced high homology reads, no insertions/deletions, simulates errors of Illumna sequencer.

Length of reads	Number of reads	Spliced reads (crossing exon junctions)	Parameters of readss
50bp	2 979 624	492 743 (16.5%)	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	1 960 300	485 857 (24.8%)	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
100bp	1 489 796	469 319 (33.3%)	insert size = 300 bp, standard deviation = 30 bp, coverage = 40

Table 2. Specification of sets of test reads.
Genome-simulated reads
Non-spliced reads with increasing frequency of mismatches (mutations) and limited number of insertions/deletions of uo tp 4 bp.

Length of reads	Number of reads	% of mutations	% InDel	Parameters of reads
76bp	18 363 068	0.5	0.002	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18 363 276	1	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18 368 502	2	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18361496	3	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18365644	4	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18361920	5	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18364062	6	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18369140	7	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18367384	8	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18373472	9	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40
76bp	18371406	10	0.02	insert size = 200 bp, standard deviation = 20 bp, coverage = 40

Table 3. Combined table of accuracies of four read mapping programs, RNA-Seq reads, both spliced and non-spliced.
Sn = good_align/reads_count.
Sp = good_align/AllAlign.

	50bp		76bp		100bp
	Sp	Sn	Sp	Sn	Sp	Sn
ReadsMap	0.95571	0.99725	0.96732	0.99759	0.97411	0.99680
TopHat	0.92268	0.92268	0.94996	0.98643	0.95528	0.91894
STAR	0.89171	0.94056	0.90403	0.94007	0.89864	0.93096
PASS	0.89005	0.91547	0.88750	0.90603	0.86458	0.87765

Table 4. Reads without introduced mutations and insertions/deletions (unspliced reads from the previous set)

ReadsMap v 1.9.0 (version for genomic reads)

Length, bp	Quantity	Aligned (Percent)	Align N	Good	Sp	Sn	F1-Score	G-Measure	Time,sec
100	1 020 477	1020477 (1.00000)	1020477 (1.0000)	1020424	0,96978	0,99995	0,98463	0,98475	548.03
76	1 473 886	1473886 (1.0000)	1530231	1473831	0,96314	0,99996	0,98120	0,98138	452.56
50	2 486 387	2486336 (0.99998)	2711089	2485726	0,91687	0,99973	0,95651	0,95740	980.28

BWA v 0.7.12

Length, bp	Quantity	Aligned (Percent)	Align N	Good	Sp	Sn	F1-Score	G-Measure	Time,sec*
100	1 020 477	1019620 (0.99916)	1019620	1009030	0,98961	0,98878	0,98919	0,98919	347.29
76	1 473 886	1473210 (0.99954)	1473210	1455150	0,98774	0,98729	0,98751	0,98751	322.25
50	2 486 387	2485349 (0.99958)	2485349	2441869	0,98251	0,98210	0,98230	0,98230	297.23

* Time without indexing of reference genome, which took additional 65 sec.

Bowtie

Length, bp	Quantity	Aligned (Percent)	Align N	Good	Sp	Sn	F1-Score	G-Measure	Time,sec*
100	1 020 477	1020244 (0.99977)	1020244	1009052	0,98903	0,98880	0,98891	0,98891	179.26
76	1 473 886	1473158 (0.99951)	1473158	1454956	0,98764	0,98716	0,98740	0,98740	156.86
50	2 486 387	2477883 (0.99658)	2477883	2434210	0,98237	0,97901	0,98069	0,98069	158.74

Table 5. Reads with introduced mismatches (mutations) and insertions/deletions of up to four bp