There are multiple programs currently available for correcting errors in sequencing reads, based on different principles and optimized for different specific tasks. Still, all of them are far from the theoretical limit of improving reads quality. Especially difficult problem is posed by repeat sequences that differ by a few nucleotides: such repeats are routinely a source of erroneous corrections. Taking into account that repeats comprise up to 40% of eukaryotic genomes, it is very important to improve performance of correction algorithms on such sequences.
In this paper we describe a new method of error correction based on clustering of alignments of homologous reads. This method is realized on ReadsClean program, which is designed for correcting reads from Illumina HiSeq sequencer. We compared ReadsClean with other reads cleaning programs recognized to be the best in several publications. Our sequence assembly tests using actual and simulated sequencing reads show that ReadsClean achieves superior results.
Program |
Corrected errors |
Errors remaining after correction
|
Missed errors
|
Errors replaced by other errors
|
Introduced errors |
readsClean |
1,737,048 99.9939% |
209 0.0120% |
60 0.0035% |
46 0.0026% |
103 0.0059% |
RACER |
1,734,832 99.87% |
22,644 1.30% |
2,175 0.13% |
147 0.01% |
20322 1.17% |
SGA |
1,733,562 99.79% |
4,550 0.26% |
3,552 0.20% |
40 0.002% |
958 0.06% |
Musket |
1,731,964 99.70% |
5,620 0.32% |
5,013 0.29% |
177 0.01% |
430 0.02% |
Karect |
1,736,858 99.98% |
784 0.05% |
217 0.01 |
79 0.005% |
488 0.03% |
Table 1. Error correction accuracy on simulated reads from E.coli genome, initially containing 1,737,154 errors (0.936% of nucleotides). The table shows actual numbers and percentage of initial errors.
Program |
Corrected errors |
Errors remaining after correction
|
Missed errors
|
Errors replaced by other errors
|
Introduced errors |
readsClean |
33,915,305 99.8171% |
138,043 0.4063% |
56,415 0.1660% |
5,734 0.0169% |
75,894 0.2234% |
RACER |
31,242,161 91.95% |
22,157,432 65.21% |
2,588,218 7.62% |
147,075 0.43% |
19,422,139 57.16% |
SGA |
31,511,057 92.74% |
3,294,087 9.69% |
2,436,166 7.17% |
30,231 0.09% |
827,690 2.44% |
Musket |
30,533,489 89.86% |
4,151,370 12.22% |
3,354,877 9.87% |
89,088 0.26% |
707,405 2.08% |
Karect |
32,573,973 95.87% |
1,676,124 9.87% |
1,359,854 4.00% |
43,627 0.13% |
272,643 0.80% |
Table 2. Error correction accuracy on simulated reads from human chromosome 14 Homo sapiens, initially containing 33,977,454 errors (0.962% of all nucleotides).
Parameter |
Cleaning program |
SGA (with filtering) |
Velvet |
SOAPdenovo |
spades |
Percent coverage |
No cleaning ReadsClean RACER SGA Musket Coral Karect |
70.328 71.513 65.873 69.679 70.061 69.739 70.992 |
65.937 68.253 65.372 66.637 65.365 66.723 67.527 |
75.443 76.517 74.768 76.079 75.741 73.467 75.598 |
78.512 79.140 77.758 78.889 78.727 76.224 79.020 |
NGA50 |
No cleaning ReadsClean RACER SGA Musket Coral Karect |
2,028 2,304 1,434 1,971 2,064 2,082 2,271 |
1,844 2,291 1,633 1,911 1,616 1,984 2,114 |
2,604 3,248 2,490 3,118 2,892 2,545 2,980 |
5,401 10,211 5,211 9,067 8,393 5,293 10,048 |
Table 3. Genome assembly results of actual homan Chr14 reads after error correction.
For licensing info, please inquire to softberry@softberry.com