Tag Archives: Bioinformatics

Transcriptome assemblers put to the test

Next Generation Sequencing produces millions and billions of reads – and the interpretation of this reads rely on bioinformatic tools.

Especially for de novo assemblies of genomes or transcriptomes the result can vary dependent on the quality of the assembly.

In a recent publication Shorash Amin and his co-workers sequence the transcriptome of the non-model gastropod Nerita melanotragus with the Ion PGM. Afterwards they used different softwares and compared the quality different assemblies of the transcriptome (Amin et. al).

Oases, Trinity, Velvet and Geneious Pro, were the four de novo transcriptome assemblers that were used for this study. The assemblers were compared on different parameters like the length of the contigs, N50 statistics, BLAST and annotation success.

The longest contig was created with the Oasis assembler (1700 bp) and overall Trinity and Oasis delivered much better results than the de novo assembly of Ion PGM reads with Velvet or Geneious Pro.

Furthermore the mapping to a reference genome showed that Ion PGM transcriptome sequencing and subsequent de novo assembly with either Trinity or Oasis generates reliable and accurate results.

Read the complete publication here.

A major upgrade of SAMTools: CRAM format to reduce NGS data load

SAMTools, one of the most popular NGS sequence analysis tools has recently been upgraded by Computer scientists at the Wellcome Trust Sanger Institute. SAMTools is a set of utilities which allow the manipulation of alignments in the SAM/BAM format. SAM is the acronym for Sequence Alignment/Map format, whereas BAM is just the binary form of SAM. SAM can be seen as the worldwide standard for storing large nucleotide sequence alignments.

SAMTools 1.0, the revised version of the free program suite now allows researchers an improved handling of their sequencing data. Further to the existing SAM and BAM file formats, SAMTools now supports the new CRAM format. Basically, CRAM files are alignment files, just like BAM files – except that their size is reduced by 10 -30%. For better handling even greater compression – up to 100-fold – can be achieved in the “lossy” mode, that still preserves the most important information. The savings in storage that CRAM offers could be achieved by incorporating data compression techniques which were cooperatively developed by the Sanger Institute and the EMBL-European Bioinformatics Institute.

“This major rebuild of SAMTools reflects our commitment to supporting the global use of sequencing data,” says Dr Richard Durbin, Head of Computational Genomics at the Sanger Institute. “Genome science worldwide relies on fast and efficient data analysis and storage, and SAMTools 1.0 fulfills this need by supporting new sequencing and analysis technologies”. Dr. John Marshall from the Sanger Institute is highly optimistic that the widespread uptake of the new format will lead to lower data storage costs on a global scale (complete article).

I am curious on how the new format is going to be adapted by the genomic community. By the way did you know that SAMTools has been downloaded more than 225,000 times?

Update on NGS and Clinical Validation

Clinical validationThere is an increasing demand for the development of regulated next-generation sequencing based diagnostic tests. The review that I would like to draw your attention to is thoroughly discussing all challenges and issues that arise when developing NGS-based diagnostic tests or even CDx. The experts form the Merck Research Laboratories take very thing into account starting from the choice of the platform, bioinformatics through to the regulatory approval process.

Have a read, it’s really worth it!


Data analysis – still a bottleneck!

With the many NGS machines around in the field, we daily produce tremendous amounts of sequencing data. However, at the end of the day, all the data have to be analyzed and interpreted. In many cases, this step is still a bottleneck.

Please check the video below which is an interview with Lex Nederbragt, Bioinformatician at the Norwegian High-Throughput Sequencing Centre in Oslo, on this topic. He discusses the fact that the analysis tools which are available do not fully fulfill the needs of the researchers. In this context, he also discusses the use of open source and commercial software tools.

Lex Nederbragt discussing software bottlenecks and lack of flexible reference genomes from NGS Perspectives on Vimeo.

There is more than one bottleneck in NGS

The blog on NGS perspectives published recently a great survey (sponsered by QIAGEN) about the biggest bottlenecks researchers face by using the NGS technology. 26% of the 924 participants voted for the complexity of the data analysis. And from my point of view the challenge with data analysis has just begun. Because the sequencers out there produce more and more data in a single run. So high-end software solutions are a prerequisite for further usage of these machines.

What is the primary sequencing work done in your lab?



Also interesting: one of the questions from the survey asked about the applications that everyone runs with the NGS-instruments. The answers show that more and more scientists use NGS for dedicated purposes, like to know more about the expressed genes in a sample or about the mutations and existence of specific genes or gene panels.



Visit NGS Perspectives to view or download the complete survey.

Clinical Genomics Using NGS Approach

Video by Cambridge Healthtech about the impact of next generation sequencing on clinical genomics.

Nazneen Aziz of the College of American Pathologists and Konrad J. Karczewski of Stanford University are talking about analytical and bioinformatics standards, personal analysis, challenges and interpretation services.

Whose genome has been sequenced? Nasuia deltocephalinicola

de-novo-sequencingThe human genome comprises more than 3 billion base pairs and builts up more than 20,000 protein coding genes. For genomes like this high-throughput sequencers, like the HiSeq 2000 are a revelation. In this article we talk about the smallest genome sequenced so far – here sequencing with the MiSeq is more than sufficient. The main role of this small symbiont (Nasuia deltocephalinicola) together with the symbiont Sulcia muelleri is to provide 10 essential amino acids to the pest insect Macrosteles quadrillineatus (Bennett et al).

What was sequenced?

10 phloem-feeding pest insects (Macrosteles quadrillineatus) including the obligate symbionts Nasuia deltocephalinicola & Sulcia muelleri.

Sequencing strategy: Whole genome sequencing

  1. Libraries & Sequencing: 2x 250 bp paired-end sequencing using the Illumina MiSeq platform
  2. Data output: 12,000 contigs (> 500 bp) including reads from the pest insect and Sulcia muelleri; MIRA and Velvet assembly revealed two large scaffolds for Nasuia (~102 kb & ~12 kb)
  3. Bioinformatics: Many tools have been used, including Velvet for inital read assembly; SOAP2 to map the symbion-derived reads to the Velvet contigs and MIRA for re-assembly of isolated symbiont reads

The biggest challenge with this genome mixture was definitely the bioinformatic analysis. During several cycles of mapping and assembly the reads that belong to one organism needs to be filtered out of the remaining reads. But this labour-intensive approach revealed the smallest bacterial genome yet sequenced (112 kb).

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Recently Launched Tools for Genomic Sequencing

Costs for DNA sequencing decreased tremendously the last years. New technologies and better methods cause that rapid drop in prices.
On the other hand, the field of sequencing is pushed forward with

  • methods to enrich nucleic acid samples,
  • kits that simplify library preparation from a variety of samples, and
  • services to assist the researcher with all aspects of sequencing.

Read more at the Nature Product Focus; the article was published in Nature 26 September 2013

Whose genome has been sequenced? Thunnus orientalis

de-novo-sequencingTalking about sealife everyone knows how sharks or whales look like or how they behave. Sadly, I think little is known about tuna. Tuna is more or less only known as delicous meal. So it’s all the more pleasant to see that the recent de novo genome sequencing approach of Nakamuar et. al aim to learn more about the predatory behaviour of tuna and not about breeding or cultiviation (Nakamuar et. al). With this genome sequencing project of Thunnus orientalis the scientists could prove that tuna harbors some unique tactics to catch their prey.

What was sequenced?

The diploid genome of a wild-caught male Pacific bluefin tuna (T. orientalis) was sequenced.

Sequencing strategy: Whole genome sequencing

  1. Hybrid approach: Roche 454 GS FLX Titanium & Illumina GAIIx
  2. Libraries: Shotgun & paired-end libraries on Roche 454 & paired-end libraries on Illumina GAIIx
  3. Read output: 31.9 million 454 reads, including 4.9 million long paired-end reads (11.9x coverage) & 229.7 million Illumina paired-end reads (43x coverage)
  4. Data output: 192,169 contigs (> 500 bp) that could be assembled in 16,802 scaffolds (> 2 kb), totaling 740.3 Mb (= 92.5% of the estimated genome size (~ 800 Mb))
  5. Bioinformatics: Roche 454 read assembly with Newbler (Version 2.5) followed by mapping of the paired-end Illumina reads with Bowtie (Version 0.12.7).
    Note: 7,259 nucleotide mismatches & 312,851 short InDel’s could be eliminated by mapping the Illumina reads onto the scaffolds by bwa (Version 0.5.9)

Sequencing strategy: Transcriptome analysis

  1. Libraries & Sequencing: Normalized cDNA libraries have been sequenced with the Roche 454 FLX Titanium Instrument
  2. Read output: 3.8 million 454 reads
  3. Data output: 5,741 full-length cDNA sequences
  4. Bioinformatics:Assembly was performed using Newbler (Version 2.5)

From the sequencing strategy point of view this publication shows again that the hybrid approach of the Roche 454 long read technology and the Illumina short read technology is one of the most used techniques for de novo genome sequencing (Hybrid assemblies).

From a scientific point of view this publication could show that tuna hs the most RH2 paralogs among studied fishes and that three of these genes are mutated compared to the others. And according to Nakamuar et. al these changes might be responsible for the great feature of tuna to detect blue-green contrasts and therefore to be able to measure the distance to prey in the blue-pelagic ocean.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Periodic Table of Bioinformatics

periodic_tableI came across this interesting “Elements of Bioinformatics” that categorizes and arranges bioinformatics tools in the format of a periodic table. It is certainly not exhaustive, but it is very useful (and fun!) as an overview of the available tools. In addition, you also find the year the tool was published at the upper right corner, so the table also offers a historical perspective of bioinformatics!

Check out http://elements.eaglegenomics.com/