Tag Archives: assembly

Transcriptome assemblers put to the test

Next Generation Sequencing produces millions and billions of reads – and the interpretation of this reads rely on bioinformatic tools.

Especially for de novo assemblies of genomes or transcriptomes the result can vary dependent on the quality of the assembly.

In a recent publication Shorash Amin and his co-workers sequence the transcriptome of the non-model gastropod Nerita melanotragus with the Ion PGM. Afterwards they used different softwares and compared the quality different assemblies of the transcriptome (Amin et. al).

Oases, Trinity, Velvet and Geneious Pro, were the four de novo transcriptome assemblers that were used for this study. The assemblers were compared on different parameters like the length of the contigs, N50 statistics, BLAST and annotation success.

The longest contig was created with the Oasis assembler (1700 bp) and overall Trinity and Oasis delivered much better results than the de novo assembly of Ion PGM reads with Velvet or Geneious Pro.

Furthermore the mapping to a reference genome showed that Ion PGM transcriptome sequencing and subsequent de novo assembly with either Trinity or Oasis generates reliable and accurate results.

Read the complete publication here.

How to benefit from our superior LJD’s on the MiSeq

With the update of our MiSeq system to 250 bp reads genome sequencing on this system gets even more important. But long reads and huge data output are not the only prerequisite for a great de novo assembly result.

What is missing?

Paired-end libraries that span gaps and repetitive structures can improve de novo genome assemblies tremendously. Our proprietary long jumping distance libraries (LJDs) are perfectly suited for scaffolding on Illumina sequencing devices. In contrast to other paired-end libraries (like Illumina mate pair library), our LJD library preparation involves an adaptor-guided ligation of the genomic fragments. The different preparation protocol offers the following advantages:

  • No hybrid reads – a unique sequence identifies the crossover points
  • No shotgun pairs – less than 1% of all LJD reads are shotgun paired-end reads
  • Distinct insert sizes – we prepare LJDs with 3, 8, 20 or even 40 kbp insert size
  • Span large repeats – large and complex repeats up to 40 kbp can be resolved

Mapped reads: All reads from a 3 kbp LJD library (grey) are aligned to a reference sequence. Two LJD read pairs are highlighted (blue + black) and their measured insert size is 3107 bp and 3002 bp respectively.


Why should I combine MiSeq long reads and LJDs?

The new features of the MiSeq (250 bp reads; data output up to 8 Gbp) enable the combined and cost-efficient approach of shotgun and LJD libraries in one run. The MiSeq output is sufficient to sequence several bacterial genomes or single fungal genomes (up to 60 Mbp) with appropriate coverage.

  • Longer reads – more sequence information to correctly map the reads onto your contigs
  • Short delivery time – due to the shorter run time compared to the HiSeq 2000

Read more about our long jumping distance libraries on our website

Creating the Perfect Genome Assembly

Dr. Georg Weinstock from the Genome Institute at the Washington University presents in a webinar how to create the perfect genome assembly by using the optical mapping system from OpGen Inc.

What is Optical Mapping?

Whole Genome Mapping (WGM) using the OpGen Argus technology delivers high resolution, ordered whole genome restriction maps from single DNA molecules. To receive such a restriction map it is crucial to isolate long DNA fragments (200 kb in size and longer) and to capture the DNA on a solid phase. Afterwards the DNA is digested revealing restriction cleavage sites as gaps when using a fluorescence microscope to visualize the DNA. This optical map will then be converted into digital data, the so called “single molecule restriction maps” (see video below). The software MapSolver enables the following analysis options (see details in the analysis video):

  • Perform Genome Comparisons
  • Identify Motifs, Annotate Features, and view in silico sequence data
  • Perform Sequence Placement
  • Create Similarity Clusters

 Video about step 3: How to scan and assemble single molecule restriction maps (SMRM)

Recently we gained access to this innovative technology and are able to combine our Next Generation Sequencing Service with the WGM technology. The combination of NGS and WGM can be used to order the contigs from a next generation sequencing project against the optical map scaffold. This method is able to highly improve sequencing assemblies. If you are interested in a combined or stand-alone project for WGM, please do not hesitate to contact us.

We look forward to discuss WGM in detail with you.

RAD-Seq – A brief technical overview

Some time ago I was introducing a new approach combining restriction site associated DNA marker genotyping (RAD) with next generation sequencing technology. Originally this method was developed for microarray platforms. However, the combination of RAD and NGS (Illumina) – resulting in RAD sequencing (RAD-Seq) – enabled the massivly parallel and multiplexed sample sequencing. RAD-Seq is becoming more and more powerful and has the potential to revolutionize agrigenomics, because one can discover and screen thousands of SNP’s and genotype large populations in a high throughput manner at the same time. The scope of the following section is to give a short technical overview how this can be accomplished:

Genomic DNA of each sample is digested in parallel with a certain restriction enzyme and a specific P1 adapter is ligated to the restriction fragments. Thereby each sample will be equipped with an individual P1-adapter containing a sample-specific molecular identifier (Barcode) and Illumina adapter sequences (forward amplification primer site and Illumina sequencing primer site, respectively). If multiplexing is desired, the adapter-ligated fragments of a number of samples can now be pooled. The level of multiplexing depends on the number of differed P1-adapters which have been used before. In a further step the RAD pool will be sheared, size-selected and ligated with a second adapter (P2). The P2 adapter comprises a divergent “Y” adapter containing the reverse amplification primer sites. However, the P2 adapter is special such that fragments lacking the P1 adapter cannot be amplified. This guarantees, that only fragments containing a P1 and a P2 adapter will be selectively and robustly enriched during amplification step following next. The overall length of RAD-tags which can be further analysed mainly depend on the size selection step and sequencing run mode (single vs. paired end), respectively.


Metagenomic Assembly: The Big Challenge

Microbial communities are more and more analyzed by direct sequencing of DNA from environmental samples. The aim is to study the microbial composition at different conditions and the identification of novel organisms. The bottleneck is the assembly of the metagenome reads.

Why are these assemblies so challenging? One important reason is the highly heterogeneous character of the microbial environmental sample. Furthermore the abundance of the member species differs remarkably. While some species are highly abundant, others – often the unknown and therefore very interesting ones – are present at a very low level. In order to receive contigs from the low abundant species very deep sequencing is needed and these species often can only be assembled in a highly fragmented manner. Another challenge for metagenomic assemblies are populations of closely related species. As their genomes are highly similar the assembly software generates hybrid contigs from those closely related species.

Despite or probably due to these challenges I see a lot of efforts in the field in improving the underlying assembly tools. Currently, procedures like clustering large contigs based on tetranucleotide frequency and coverage are applied. Clustered contigs are afterwards ordered by mapping to related genome. With this approach first bacterial draft genomes from e.g. cow rumen  and soil metagenomes have been published in Science and Nature.

Evolving SAM/BAM Format for De Novo Assemblies

IGV visualisationI guess any bioinformatician working with NGS data is familiar with the SAM/BAM format that has evolved into a widely accepted standard in the NGS community. Actually, many software packages (e.g. mapping and visualization tools like BWA, Tablet or IGV) have adopted this open file format to cope with the large amount of data produced by e.g. read mappings. The benefit of an accepted data format is the easy accessibility of structured information – in this case, two open source libraries samtools and Picard provide functionality to access the data.

However, the field of de novo assemblies still misses an open and widely-adopted data format. So, existing assembly formats like ACE, CAF or amos are not that easily convertible into one another.

Now, initiated by a recent blog entry by Nick Lomann (Sept 19, 2011) an intensive discussion in the community (samtools-devel mailing list) has started how to adapt the SAM/BAM standard also for de novo assemblies. The intention is to modify the existing format (v1.4) to allow also for the incorporation of the reference sequence in the SAM/BAM file. Open questions are, for example, how to integrate a padded (gapped) reference sequence (as already implemented in the ACE format). Peter Cock’s blog provides some nice figures depicting the problem of visualizing the read-to-consensus alignment using a padded reference sequence.

Actually, a few days ago (Sept 27, 2011), the SAM/BAM specification has already been updated in the repository and, referring to Peter Lock, “this is on track to be part of the SAM Format Specification v1.5”.

Wow, I am really impressed (again) how fast things change and develop in NGS research. I am looking forward to my first SAM/BAM assembly!

Comparison of De Novo Assemblies of the Escherichia coli Outbreak

The E. coli EAHEC outbreak inGermany has been an opportunity to compare currently available sequencing technologies with respect to the data quality.

Regarding the N50 contig size and amount of contigs/scaffolds best assembly quality so far was achieved using the long read technologies in the market, Roche’s GS Junior sequencing and Pacific Biosciences’ PacBio RS sequencing. For further comparison I am therefore going to focus on data of both long read technologies. But first, please have a look at the sequencing layouts:

Sequencing layout PacBio RS (Source: Pacific Biosciences >):

First library: Standard sequencing library (200-fold coverage)
Second library: Circular consensus sequencing library (35-fold coverage)
Sequencing: 56 SMRT cells

Sequencing layout Roche GS Junior
(Source: UK Health Protection Agency HPA >):

First library: Shotgun library
Second library: Long paired end library (LPE, 8 kbp insert length)
Sequencing: Three Roche GS Junior runs (25-fold coverage)

Comparison of the results

The de novo assembly is comprised of 33 contigs with PacBio RS sequencing and 13 scaffolds with Roche GS Junior. The N50 contig size of the PacBio sequencing approach is 402 kbp and the N50 scaffold size of the Roche 454 sequencing approach is 968 kbp. Both de novo assemblies > with Illumina MiSeq and Ion Torrent PGM data so far revealed higher amount of contigs and considerably shorter N50 contig sizes (95 kbp and 50 kbp, respectively).

This data once again shows that de novo sequencing strictly needs long reads. The advantageous effect of the very long reads >  of PacBio for scaffolding (on average 2900 bp and 5% longer than 5100 bp) is balanced in the other approach by sequencing of the LPE library.

Important to mention is that the PacBio assembly is generated with reads from a not yet released chemistry (planned for quarter 4). In contrast Roche 454 assembly did not contain the long FLX+ chemistry reads that will become available for GS FLX by the end of the month. According to our experience a read length of 650 – 750 bp will have some additional positive effect on number of scaffold and N50 scaffold size.

Most striking for me is that as much as 56 flow cells were needed to generate the PacBio assembly with the high consensus accuracy > of 99.998 %. The standard library was sequenced with that high coverage in order to increase the number of very long reads and the circular consensus sequencing library was employed for further correction of errors derived of still low single read accuracy.