Archive by Author

Samba In The World Of NGS

sambaToday I was reading a publication about sequencing error profiles in Ion torrent PGM data, when I came upon a detail in the PGM sequencing workflow that I find funny and interesting at the same time and that I want to share with you.

You may know that the sequencing method of the Ion Torrent PGM is quite similar to the sequencing method of the Roche 454 devices. In both technologies beads that hold the clonally amplified template with appropriate sequencing adaptors are loaded onto a plate with millions of wells. The loading is performed in a way that ensures that most wells are loaded with a single bead (the size of the wells do not allow two beads per well). In a next step dNTPs are flowed over the surface in a predetermined order with only one type of nucleotide at a time. Washing steps occur before the next dNTP is flowed over the surface. The way the incorporation of the nucleotide is measured represents the substantial difference between both technologies:

With the Roche 454 technology an enzymatic cascade follows the polymerization event that finally generates pyrophosphate and light. The light intensity is proportional to the number of nucleotides that were incorporated (if any). The light is detected by the camera of the system.

In contrast, the Ion torrent PGM is measuring pH rather than light to detect incorporation events. A single proton is released for every dNTP incorporated during the flow, which changes the net pH value in the respective well and a ionic sensor measures the pH change.

The Roche system (as well as the first generation of the PGM) cycles the 4 dNTPs in a step-wise fashion. They simply repeat the sequence TACG over and over. With the second generation PGM these 4 base cycles have been changed to 32 base cycles (TACGTACGTCTGAGCATCGATCGATGTACAGC), called the Samba sequence. The sequence starts with the same 4-nucleotide repeats, but after 2 such patterns some nucleotides are repeated in a period shorter than four. According to Bragg et al. this modification was implemented to improve the synchrony of clonal templates which facilitates a more accurate base calling. Unfortunately the Samba sequence is not optimized for read length as the original sequence was. It remains to be seen if Ion Torrent (now owned by Thermo Fisher) will make further modifications in the Samba sequence in order to balance the accuracy and the read length of the system.

New Bid From Roche For Illumina?

The analyst and sequencing community is currently divided on whether to believe the rumors of a new bid from Roche to buy Illumina. The source of the controversial discussions is an article from the Swiss Newspaper L’Agefi that reported end of December that Roche and Illumina might have agreed to a deal for Roche to acquire Illumina. Since Illumina turned down Roche’s original bid in January, continuous interest from Roche has been reported several times, but the report from L’Agefi is also mentioning concrete amounts of the bid. According to them, the acquisition might take place for $66 per share, valuing the deal at about $8.14 billion in total.

The offer is $15 per share higher than the previous offer of $51 in April last year. According to the analyst Devia Ferreiro of Oppenheimer the new bid is definitely at a level that might lead to a final deal.

With Roche having only about 9% of the NGS market and next generation sequencing becoming most likely an important clinical diagnostic tool in the next years, the strategy focus of Roche must be to get better access to the NGS market and to take NGS to clinical practice. The acquisition of the NGS market leader Illumina represents an optimal starting point.

We’ll see if the rumors are built on a solid foundation within the next two weeks: The Swiss newspaper L’Agefi reported that the announcement might come during the first half of January.

Hybrid De Novo Genome Assemblies

What are your intentions when being interested in a bacterial or fungal de novo genome sequencing project?

Typical answers we get from our customers:

  • Easy working with the data
  • Data suitable for high quality annotation
  • Resolution of structural rearrangements
  • High consensus accuracy
  • High cost-efficiency

All these requirements can be fulfilled perfectly when combining Roche GS FLX++ and Illumina data. The long Roche FLX++ reads of up to 1100 bp give much longer contigs than Illumina reads only do. For scaffolding and to be able to resolve structural rearrangements we sequence shotgun (SG) and LJD libraries with Illumina technology. The adding of Illumina reads keeps the overall costs at a reasonable level. Furthermore the reads correct the Roche sequencing errors at homopolymer sites and therefore enable us to build a consensus sequence with high accuracy.

The superiority of such a hybrid assembly becomes quickly apparent when looking at the following results of one of our proof of concept studies. In this de novo project, we sequenced a fungal genome of about 30 Mbp and approx. 57% GC content. Using the hybrid strategy we obtained only 10 chromosome-sized scaffolds (see figure below) with up to 8.3 Mbp. Remarkably, the 10 scaffolds represent the majority of genetic information present, given that they make up 99.6% of all scaffold sequence information.

Such results enable easy data handling and definitely are an excellent starting point for annotation and studying of gene content and rearrangements.

Sequencing strategy: SG library with FLX++ (approx. 10-fold coverage), SG and LJD 3 kbp, 8 kbp and 20 kbp on Illumina HiSeq 2000 with 2x 100 bp module.

 

150 bp, 250 bp and next year 300 bp:
Illumina keeps the competition on the go

Illumina is currently in the midst of the MiSeq sequencer updates. The software update, the new flowcells and the new sequencing chemistry enable runs with outputs of around 8 Gbp and 250 bp read length. The first updates have reached Europe just recently and only a few days ago our own MiSeq has received the update.

That’s not the end of the story for Illumina. Just a week ago, they already have announced the next update. In the second half of 2013 Illumina is planning to offer another MiSeq update that will increase the output to 15 Gbp. They achieve this tremendous output for their benchtop device by increasing the read length to 300 bp and resolving about 25 million clusters on the flowcell.

Considering the intense competition with Life Tech’s Proton and Ion Torrent sequencer, Illumina needs to steadily improve the specs of their sequencing devices. In March, Life Tech plans to increase the output of their Proton sequencer to around 36 Gbp. That’s still a bit more than the new MiSeq upgrade can deliver, but one also has to evaluate the differences in the read length. While the MiSeq will be able to produce 300 bp reads soon thereafter, the Ion Proton is generating reads from 100 to 150 bp. And the difference is even more remarkably when the sequencing on the MiSeq is performed with the paired-end module – an approach that is not possible with Life Techs devices. By using library insert sizes of around 450 – 500 bp, the two overlapping reads can generate a single consensus read of about that size.

In my opinion the Illumina MiSeq is at the forefront of the race and if Illumina’s plan works out they will be there in 2013, too. But we all know how short-lived the NGS market is. So let’s see what’s coming!

PacBio RS Data to Validate SNPs Called from Illumina Sequencing?

Would you have thought that PacBio RS sequences with about 15% single read error rate can outdo MiSeq reads in validation of the variants called by WGS or Exome Sequencing? Personally, I wouldn’t have thought so. But the study of the Broad Institute published a few days ago clearly shows that they can.

Variants called within projects that aim at analysis of variants definitely need validation to determine the rate at which the mutations have been correctly called and to confirm the specific reported changes. Currently used techniques like Sequenom genotyping and Sanger sequencing provide essential drawbacks, such as the need for manual interpretation or low data throughput. For that reason, Carneiro and his colleagues studied the power of PacBio RS and MiSeq data as a validation tool and compared the results with each other.

They generated amplicons covering 98 variants called in the 1000 Genomes Project and sequenced the PCR products with both instruments, PacBio RS and MiSeq. Using PacBio RS data 96 out of the 98 variants could be correctly genotyped, whereas the MiSeq correctly genotyped only 93 sites. The explanation of the authors is quite simple: The completely random distribution of errors across the reads can overcome the low read accuracy problem if sufficient coverage is applied.

Manual checking of the sites, that were miscalled using the PacBio dataset, revealed, that one of the two miscalls happened due to a reference bias (true variation is hidden). Such bias is introduced by alignment parameters where the gap open penalty is higher than the base mismatch penalty. The high error rate of PacBio RS reads makes these parameters necessary.

However, Carneiro told GenomeWeb, that the researchers are not using a different aligner that was developed at the Broad Institute. This aligner re-aligns the reads using different parameters and therefore reduces the problem to a great extent.

For me the study shows that there is potential for PacBio RS sequencing. Nevertheless, like the variants, also this study result needs to be validated. Furthermore I think that the value of the study needs also to be seen in relation to the sequencing cost for both instruments. While the consumable prices for both techniques are in a similar range, the several fold higher cost for the PacBio RS instrument makes a remarkable difference.

Base Modification Detection with Pacific BioSciences

After having launched the new C2 chemistry for PacBio RS sequencing with longer read length it has been quiet for a while with Pacific BioSciences. However, a few days ago they have again attracted attention by launching a new analysis software that indicates base-modifications in the sequencing data. And from what I hear and read about these techniques, the epigenetics market could really be a great success story for Pacific Biosciences.

As PacBio’s SMRT sequencing is observing the DNA polymerization in real time it allows not only to decode the sequence, but also to study kinetic characteristics of the process. The kinetics of a base incorporation is characteristically changed by the presence of a modified base in the template strand and therefore can be used to distinguish between different base modifications. Different modifications result in different signatures (or fingerprints) that vary in signal magnitude and the length of the region over which the kinetics are altered.

I think that the study of base-modifications with PacBio RS has several advantages compared to experiments like methyl-Seq or bisulfite sequencing. On the one hand side PacBio RS sequencing is a direct detection, where no enzymatic restriction or bisulfite conversion has to be applied upfront. On the other hand – and this is the most important advantage for me – the PacBio RS system allows to distinguish a wide spectrum of base modifications, which has not been possible so far.

Unfortunately, the recently launched software is not yet ready to distinguish the different types of modifications, it only flags positions where modifications are present. However the company has shown proof of principle data and has already stated that the information to discriminate between modifications will be incorporated into future releases of the software. Moreover a Technical Note is provided from the company regarding their motif identification tool for bacterial methylomes.

Expression Profiling Without the Need for a Reference Genome

Interested in expression profiling, but you are working with a non-model organism?

A very elegant way for this purpose is to (1) generate long cDNA contigs with NGS technologies that serve as a reference transcriptome and (2)  perform expression profiling by mapping Illumina HiSeq 2000 derived short reads of each sample back onto the reference. As only one read is generated per transcript, down and up regulated genes easily can be identified by counting the sequence hits.

This approach was used by Mutasa-Göttgens et al., 2012  in order to analyze targets involved in bolting and flowering in sugar beet. Understanding the regulation of the vernalization-induced bolting and the change towards the reproductive phase is of high importance because bolting and flowering cause considerably reduced sugar content.

To generate the reference transcriptome of the shoot apex, a normalised random primed cDNA library was prepared and sequenced on Illumina HiSeq 2000 with single read module and 100 bp read length. De novo assembly yielded at total of 225’000 unique transcripts, 53’000 of which represent large transcripts (>500 bp and up to >8’700 bp). For quantitative comparison we prepared for the research group a digital gene expression (DGE) library from samples which were subjected to vernalization and / or phytohormone treatment. The libraries were sequenced on Illumina HiSeq 2000 and reads were mapped onto the transcriptome reference sequence.

Bioinformatics analysis identified (amongst others) a potential regulator of vernalization, and therefore an interesting breeding target for the sugar beet crop.

In my opinion, this study is an excellent example of how to combine the strength of different available RNA-Seq libraries most effectively. The normalized random primed library allows unbiased site-directed sequencing. Furthermore the normalization process levels high and low expressed transcripts, which allow identification of low expressed genes accurately and facilitate de novo assembly with short read technology considerably. The DGE library in contrast produces only one tag per transcript, thus allowing much deeper resolution than the mRNA-Seq approach from Illumina, which generates reads that cover the whole transcript.

In the meanwhile, with new NGS libraries available, one would rather use a 3’-fragment library instead of the DGE library. While displaying similar costs, this library type offers longer sequence information (100 bp versus 17bp) and in consequence higher mapping accuracies and reduced numbers of non-mappable reads.

You will find more information regarding this combined approach including the 3’-fragment library for read counting in the following Application Note.

Expression Profiling with 3‘-Libraries

My last week’s blog article was about expression profiling with mRNA-Seq libraries and about the required sequencing depth of this protocol. But there are other possibilities for expression profiling, and today I especially want to highlight the 3’-fragment library protocol.

The big advantage of this protocol is that it provides a much higher resolution than mRNA-Seq does. The reason is that within mRNA-Seq the average transcript is represented by approx. 10-25 reads that cover the whole transcript, while with the 3’-fragment protocol only one read is generated per transcript. The derived reads from a 3’-end library map to the 3’-end of the transcripts and expression differences are easily collected by just counting the reads that map to a specific reference transcript.

The 10-25-fold higher resolution comes along with considerably reduced projects costs as 10-25-fold less sequencing is required to obtain a similar depth of the analysis. Or in other words: When analyzing the same number of samples per channel the 10-25 fold higher resolution allows the scientist to even look at very low expressed genes with reliable statistical evidence.

Of course the mRNA-Seq protocol is needed in case other analysis shall follow, like the study of alternative transcripts, or fusion genes. But this is anyway a completely different story as these applications need an even higher sequencing depth than expression profiling with mRNA-Seq does require.

As a conclusion I think it is definitely worth to evaluate this protocol when having in mind an expression profiling experiment. And we would be delighted if you share your thoughts on this with us and the other blog readers.

Expression Profiling and Sequencing Depth

The majority of scientists performing expression studies use the mRNA-Seq protocol (random-primed cDNA synthesis after fragmentation of PolyA-purified transcripts) and sequence the fragments with Illumina technology. By planning the experiment the question of the sequencing depth immediately arises. And for all of you being interested in an answer I want to share with you the recommendations on sequencing depth from experts in the field of transcriptome sequencing published in genomeweb.

You can see that the recommendations vary between 10 million single end reads and up to hundreds of millions reads depending on the exact need. And it is really tough for the experts to give a concrete number. Please keep in mind that about 80-85% of the transcripts in a typical transcriptome are representing only a few highly expressed transcripts whereas the majority of transcripts is present in a few copies only. For just straight gene expression analysis the interviewed scientists usually use around 20 – 30 million reads per sample. But when your aim is to look at really low expressed genes, like some transcription factors, you definitely have to apply a higher sequencing depth. And the very same is true when transcript isoforms or fusion genes shall be analyzed. For this applications the required sequencing depth can be as much as a full channel per sample.

So, enjoy reading their comments!

16S Amplicon Experiments: Which Platform to Choose?

Since 2010 several studies have been published that analyze microbial community composition by amplicon sequencing on the Illumina Genome Analyzer (GA). However, direct adaption of these protocols for sequencing on the HiSeq 2000 – the currently predominant Illumina sequencer – is not possible as both systems use different basecalling pipelines. Therefore amplicon sequencing on Illumina HiSeq 2000 is still left to the very experienced users and only a few publications can be studied on this.

In the meanwhile Illumina has introduced the MiSeq as the optimal platform for this kind of projects. In this context they have published an application note presenting sequencing of the V4 region of 16S rRNA genes on the MiSeq system.

And I totally agree that the MiSeq is a very good tool for these studies. For me, the most important advantages of the MiSeq layout in comparison to the sequencing on Illumina HiSeq 2000 are as follows:

  • Shorter turnaround time: The sequencing run itself takes a bit more than one full day, while a HiSeq 2000 run takes up to 12 days.
  • More informational content: By overlapping two paired end reads of 150 bp, full-length reads of about 250 bp can be generated
  • Potential for even longer reads: Illumina has announced read length of 250 bp for the end of the year. Then reads of up to 450 bp should be possible.

Nevertheless Roche GS FLX+ sequencing is still able to generate much longer reads with an average of up to 500-600 bp. And the long read length will provide a deeper insight into the microbiome of interest or more precisely higher classification efficiency down to species level. However Roche sequencing goes along with higher costs per base, so it will always be a decision based on the individual experiment, whether read length or sequencing depth is the most important factor.