ReadsClean: a new approach to error correction of sequencing reads based on alignment clusterization.

There are multiple programs currently available for correcting errors in sequencing reads, based on different principles and optimized for different specific tasks. Still, all of them are far from the theoretical limit of improving reads quality. Especially difficult problem is posed by repeat sequences that differ by a few nucleotides: such repeats are routinely a source of erroneous corrections. Taking into account that repeats comprise up to 40% of eukaryotic genomes, it is very important to improve performance of correction algorithms on such sequences.

In this paper we describe a new method of error correction based on clustering of alignments of homologous reads. This method is realized on ReadsClean program, which is designed for correcting reads from Illumina HiSeq sequencer. We compared ReadsClean with other reads cleaning programs recognized to be the best in several publications. Our sequence assembly tests using actual and simulated sequencing reads show that ReadsClean achieves superior results.

Program

Corrected errors

Errors remaining after correction

Missed errors

Errors replaced by other errors

Introduced errors

readsClean

1,737,048

99.9939%

209

0.0120%

0.0035%

0.0026%

103

0.0059%

RACER

1,734,832

99.87%

22,644

1.30%

2,175

0.13%

147

0.01%

20322

1.17%

SGA

1,733,562

99.79%

4,550

0.26%

3,552

0.20%

0.002%

958

0.06%

Musket

1,731,964

99.70%

5,620

0.32%

5,013

0.29%

177

0.01%

430

0.02%

Karect

1,736,858

99.98%

784

0.05%

217

0.01

0.005%

488

0.03%

Table 1. Error correction accuracy on simulated reads from E.coli genome, initially containing 1,737,154 errors (0.936% of nucleotides). The table shows actual numbers and percentage of initial errors.

Program

Corrected errors

Errors remaining after correction

Missed errors

Errors replaced by other errors

Introduced errors

readsClean

33,915,305

99.8171%

138,043

0.4063%

56,415

0.1660%

5,734

0.0169%

75,894

0.2234%

RACER

31,242,161

91.95%

22,157,432

65.21%

2,588,218

7.62%

147,075

0.43%

19,422,139

57.16%

SGA

31,511,057

92.74%

3,294,087

9.69%

2,436,166

7.17%

30,231

0.09%

827,690

2.44%

Musket

30,533,489

89.86%

4,151,370

12.22%

3,354,877

9.87%

89,088

0.26%

707,405

2.08%

Karect

32,573,973

95.87%

1,676,124

9.87%

1,359,854

4.00%

43,627

0.13%

272,643

0.80%

Table 2. Error correction accuracy on simulated reads from human chromosome 14 Homo sapiens, initially containing 33,977,454 errors (0.962% of all nucleotides).

Parameter

Cleaning program

SGA (with filtering)

Velvet

SOAPdenovo

spades

Percent coverage

No cleaning

ReadsClean

RACER

SGA

Musket

Coral

Karect

70.328

71.513

65.873

69.679

70.061

69.739

70.992

65.937

68.253

65.372

66.637

65.365

66.723

67.527

75.443

76.517

74.768

76.079

75.741

73.467

75.598

78.512

79.140

77.758

78.889

78.727

76.224

79.020

NGA50

No cleaning

ReadsClean

RACER

SGA

Musket

Coral

Karect

2,028

2,304

1,434

1,971

2,064

2,082

2,271

1,844

2,291

1,633

1,911

1,616

1,984

2,114

2,604

3,248

2,490

3,118

2,892

2,545

2,980

5,401

10,211

5,211

9,067

8,393

5,293

10,048

Table 3. Genome assembly results of actual homan Chr14 reads after error correction.

For licensing info, please inquire to softberry@softberry.com