Tag Archives: EHAEC

Comparison of De Novo Assemblies of the Escherichia coli Outbreak

The E. coli EAHEC outbreak inGermany has been an opportunity to compare currently available sequencing technologies with respect to the data quality.

Regarding the N50 contig size and amount of contigs/scaffolds best assembly quality so far was achieved using the long read technologies in the market, Roche’s GS Junior sequencing and Pacific Biosciences’ PacBio RS sequencing. For further comparison I am therefore going to focus on data of both long read technologies. But first, please have a look at the sequencing layouts:

Sequencing layout PacBio RS (Source: Pacific Biosciences >):

First library: Standard sequencing library (200-fold coverage)
Second library: Circular consensus sequencing library (35-fold coverage)
Sequencing: 56 SMRT cells

Sequencing layout Roche GS Junior
(Source: UK Health Protection Agency HPA >):

First library: Shotgun library
Second library: Long paired end library (LPE, 8 kbp insert length)
Sequencing: Three Roche GS Junior runs (25-fold coverage)

Comparison of the results

The de novo assembly is comprised of 33 contigs with PacBio RS sequencing and 13 scaffolds with Roche GS Junior. The N50 contig size of the PacBio sequencing approach is 402 kbp and the N50 scaffold size of the Roche 454 sequencing approach is 968 kbp. Both de novo assemblies > with Illumina MiSeq and Ion Torrent PGM data so far revealed higher amount of contigs and considerably shorter N50 contig sizes (95 kbp and 50 kbp, respectively).

This data once again shows that de novo sequencing strictly needs long reads. The advantageous effect of the very long reads >  of PacBio for scaffolding (on average 2900 bp and 5% longer than 5100 bp) is balanced in the other approach by sequencing of the LPE library.

Important to mention is that the PacBio assembly is generated with reads from a not yet released chemistry (planned for quarter 4). In contrast Roche 454 assembly did not contain the long FLX+ chemistry reads that will become available for GS FLX by the end of the month. According to our experience a read length of 650 – 750 bp will have some additional positive effect on number of scaffold and N50 scaffold size.

Most striking for me is that as much as 56 flow cells were needed to generate the PacBio assembly with the high consensus accuracy > of 99.998 %. The standard library was sequenced with that high coverage in order to increase the number of very long reads and the circular consensus sequencing library was employed for further correction of errors derived of still low single read accuracy.