Services Test Online

OligoZip

OligoZip: A Tool for Reconstructing Using Next-Generation Sequencing Technology Data

    Softberry OligoZip tool for processing short reads from next-generation sequencing machines, such as Solexa/Illumina and similar, provides effective solutions to the following tasks:
  • De novo reconstruction of genomic sequence;
  • Reconstruction of sequences based on reference genome from same or close species;
  • Mutation profiling and SNP discovery in a given set of genes;
  • Analysis of transcriptome sequence data with estimates of gene expression level and identification of gene structure of expressed splice variant.

OligoZip general algorithm

Algorithm of ab initio genome assembling with the use of data produced by next-generation sequencing machines (Illumina/Solexa/etc). In this description, a group of assembled reads will be denoted as a reads "cluster", unused reads as "free reads" and a set of unused reads as a "free reads pool". The algorithm begins with an empty cluster.
1) First, the cluster accepts a top read from a free reads pool {P} (which at the beginning of calculation contains all reads). Consensus of the single-read cluster coincides with a sequence of the read.
2) Next, with the use of hashing technique, a subset of free pool is constructed, in which reads share some similarity with the consensus.
3) For each read of this subset, an attempt is made to add it to the cluster. Read is included under condition that its alignment to a consensus meets certain criteria.
4) After adding second (and any other) read into the cluster, a clusters consensus is recalculated. Entered reads are excluded from a free pool.
This process (1 to 4) iterates till the free pool contains reads with similarity to a consensus, that allows to include them into the current cluster. Cluster consensus is output as a contig sequence.
Assembling of the next clusters (and contigs) occurs the same way. Clusters are formed from the pool of reads remained unused after previous clusters assembly.
Process stops when either no reads remained free or all of free reads are already tested as cluster-initiating reads.

The clustering algorithm produces sequence contigs - ab initio or using a reference genome (see algorithm scheme ).

When use please reference:

Vorobyev D., Seledtsov I., Solovyev V. De novo assembling next generation sequences. http://linux5.softberry.com/cgi-bin/berry/programs/OligoZip

The set of programs of ab initio assembly:

Testing the system

(1) De novo sequence reconstruction was tested on assembling several phage and bacterial genomes and was demonstrated to have superior clustering power compared to earlier published approach (Bioinformatics, 2007, 23(4):500-501): Simulated error-free 25mers of bacteriophage jX174 and coronavirus SARS TOR2 were assembled perfectly; and on Haemophilus influenzae genome, contigs assembled by OligoZip were almost twice as long as those assembled by published SSAKE software.

(2) To test reconstruction of bacterial sequences using reference genome, we assembled genomic sequence of Methanopyrus kandleri TAG11 on known genome of Methanopyrus kandleri AV19. Solexa reads, about 6 million each for AV19 and TAG 11, were produced by sequencing lab of Harvard Partners HealthCare Center for Genetics and Genomics. AV19 genome itself was assembled perfectly, with one extra contig that happened to be genome of phage jX174. TAG 11 reads were assembled into several hundred contigs. Similar results were achieved on five other Methanopyrus stains. The following link shows alignment of a fragment of one of assembled contigs to a reference genome in Softberry Genome Comparison Viewer.

Annotation of aligned parts of reference and assembled genomes by automatic FGENESB pipeline produces similar results (see link), indicating that no ORF distortions such as frameshifts or premature termination codons were introduced in the process of sequence assembly.

(3) As an example of SNP finding using Solexa reads, we took reads of a population sample of fragment of human EPS8 gene with known C->T substitution, marked with an asterisk on a sequence here link. Approximately 40% of 268 reads mapped to this fragment support occurrence of C->T SNP. This demonstrates feasibility of using OligoZip for SNP discovery on Solexa data.

Tests of assembling OligoZip software on Arabidopsis chromosome 1.

Sequence of Arabidopsis Chr1 (about 30 MB) was randomly fragmented to 350-bp reads. To test how coverage and length affect assembling speed and quality, we started with 5-MB chromosome fragment with low coverage and proceeded by increasing coverage and length of a fragment.

Speed was measured on one 2.2-GHz dual core processor on Linux OS, (pretty standard computer). Having computer farm and/or more powerful processors will definitely make possible faster and scaled up assembly.

a) increasing coverage of 5-MB fragment from 10x to 20x decreased the number of clusters (contigs) from 114 to 12 and increased computer time from 16 min to 4 hours. Assembled fragments covered ~90% of the sequence, and 99% of sequence was covered by good assembled contigs and by fragments of several chimerical assembled sequences.

NOTE: Chromosome 1 has 15 big gaps presented as polyN tracts: five gaps of about 60 bp each and 10 gaps of thousands Ns each, as well as other small NNN disruptive fragments.

b) 10 MB fragment, 10x coverage -> 34 min, 215 contigs; entire Chromosome 1 (30 MB), 10x coverage -> 2 hours, 1003 contigs.

Number of contigs is roughly proportional to length of assembled fragment. For 20x Coverage, 90% decrease of contigs should be expected, with their average length about 10 times larger (620,000 bp), i.e. such fragments would on average have ~ 100 genes, a very good size for reliable annotation.

Table 1. Results of ab initio assembly of 5-MB fragment of Arabidposis Chromosome 1:

oligos    ai   chimeras     length_of_ai_contigs   glued  length_of_glued_contigs  cov_good cov_all   time
coverage                    min    max     average         min     max    average

10x       110   6 ( 5.5%)   538   225588   45492.6    23   2943   726676  217326.9   89.41    99.99   ~10 min
15x        13   2 (15.4%)   355  1354730  384577.9     9   4823  1521262  555394.0   72.38   100.00   ~1 hour 40 min 
20x         8   1 (12.5%)  4812  1861885  624837.6     6   4812  2543555  833030.2   98.41   100.00   ~3 hours 30 min


ai -       ab initio assembled contigs
chimeras - chimeric contigs (different parts of a contig map to different regions of 5 Mb chromosome fragment)
glued -    ai contigs that overlap on reference genome and have sticky ends can be glued together
cov_good - coverage of 5 Mb genome fragment by "good" contigs (those that are not chimeras) - most of "good"
           contigs align perfectly to genome fragment; some contigs have small problems but still >99% of their
           lengths align very well within same region on a chromosome fragment and covers >99% of that region
cov_all -  coverage of 5 Mb genome fragment by all alignments, including alternative alignments, of all contigs,
           including chimeras
Minimum number of contigs that covers following part of 5-Mb fragment sequence:
  >80% >90% >95%
cov. 10x 50 67 79
cov. 15x 6 7 8
cov. 20x 3 4 5

Table 2. Results of ai assembling Arabidopsis Chromosome 1

oligos     ai    chimeras    length_of_ai_contigs     glued   length_of_glued_contigs  cov_good cov_all time
coverage                     min     max   average             min     max    average

10x        840  38 ( 4.5%)   350   314116  35597.83     331    350  1826464   90209.18   88.18   99.91  ~2 hours  
15x        306  33 (10.8%)   351  2932495  97628.63                                                     ~18 hours
Minimum number of contigs that covers following part of entire Chromosome 1
  >80% >90% >95%
cov. 10x 271 372 461
cov. 15x 33 45 61

We can see that 15x coverage produces contigs of average size 100 KB, up to 3MB. Size of contigs is restricted by occurrence of repeats that might also create long chimeras. It can be improved, and many chimeras resolved, by applying technique using mate-pairs, EST and gene prediction. It shouldn't be forgotten that Chromosome 1 contains 15 large gaps (polyN tracts), which we already mentioned above.