Tag Archives: Bioinformatics

Whose genome has been sequenced? Emiliania huxleyi

de-novo-sequencingDressing up by pulling carbon dioxide out of the water – this is speciality of the coccolithophore Emiliania huxleyi. Using carbon dioxide E. huxleyi makes microscopic disks of calcite, with which it clothes itself (about.com). These carbon fixation makes up for ~ 20% of carbon fixation in some systems, which is really impressive. Read an her colleagues used one strain from the South Pacific to investigate the global distribution and the heterogeneity of the genome of this coccolithophore (Read et. al). Amongst others they could reveal that “this organism is unusually diverse and has a huge genome with a large number “optional” genes. This kind of “pan genome” has not previously be found outside the bacteria” (Alden, about.com)

What was sequenced?

A batch culture of the diploid strain Emiliania huxleyi CCMP1516 from the South Pacific

Sequencing strategy: Whole genome sequencing

  1. Libraries: 3 libraries (insert sizes: 3 kbp, 8 kbp, 20-40 kbp). The majority was sequenced using the ABI 3730 XL
  2. Read output: 3,910,095 whole genome shotgun reads (10x coverage)
  3. Data output: 6,995 scaffolds of the final nuclear genome (excluding mitochondrial, chloroplast and eukaryotic scaffolds), where 321 large scaffolds harbor 70% of the total sequence
  4. Bioinformatics: Analysis of prokaryotic only scaffolds with total lengths greater than 100 kb -> Genome assembly with Arachne
    Note: All contigs and scaffolds < 4 kb in length were excluded from the final assembly due to the high GC content (65%) and large amount of repetitive region in E. huxleyi

2nd whole genome sequencing approach:

  1. Libraries: 13 shotgun libraries for 13 different strains using Illumina HiSeq sequencing (3 strains deeply sequenced and 10 strains moderately sequenced)
  2. Read output: ~ 36 x 109 reads per strain (strain 1-3) -> 265-352x coverage   and ~ 27 x 106 reads per strain (strain 4-13) -> 14-29x coverage
  3. Data output: total scaffold lengths: 98-117 Mb (strain 1-3) & 49 – 76.5 Mb (strain 4-13)
  4. Bioinformatics: De novo genome analysis using CLC Genomics & BLASTn for comparison of the deeply sequenced strains

Sequencing strategy: Transcriptome analysis

  1. Libraries: 4 cDNA libraries corresponding to different development stages and growth conditions were prepared and sequenced using the ABI 3730
  2. Data output (filtered): 30,569 genes  (these genes cover 40% of the genome)
  3. Bioinformatics: Genome annotation and alignment using BLAST and BLAT

I think one of the most interesting facts from this study is that they used Sanger sequencing for a great part of this project. According to their comparisons with for example the Illumina data, the scaffold completeness of the sanger data is estimated at 96%. And although it seems that also sanger sequencing might be suitable for small genomes for me the question remains if a hybrid NGS consisting of Roche GS FLX++ and Illumina HiSeq might have lighten up the project.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose genome has been sequenced? Anas platyrhynchos

de-novo-sequencingStarting with a great deal of attention for the bird flu in 2005, nearly every year a potential Influenza epidemic is discussed in the media. This  leads to greater awareness for influenza research projects. A well suited research tool for influenza viruses are ducks. Ducks harbor nearly all hemagluttinin (HA) and neuraminidase subtypes and the harm for the ducks is often neglectable.
Huang and his research team have now sequenced the ducks genome to search for defense mechanisms in ducks against influenza viruses (Huang et. al., Nature Genetics).

What was sequenced?

A 10-week-old female Beijing duck (Anas platyrhynchos)

Sequencing strategy: Whole genome sequencing

  1. Libraries: 8 shotgun libraries and 5 mate-pair libraries (insert sizes: SG lib 185 – 530 bp,  mate-pair lib 2 – 10 kb), (50 bp reads) using the Illumina GA Solexa technology
    note: sequencing method according to the de novo Panda Genome project
  2. Read output: >77 Gb of paired-end reads (~ 64x coverage)
  3. Data output: 78.487 scaffolds with a contig N50 length of 26 kbp and a scaffold N50 length of 1.2 Mbp; total covered length of 1.1 Gb (~ 95% of the genome)
  4. Bioinformatics: Genome assembly using SOAPdenovo
  5. Additional comparative studies with the duck genetic and physical map resulted in 47 superscaffolds which contained 225 scaffolds and spanned 289 Mbp

Transcriptome analysis

  1. Libraries: Infected as well as control duck transcriptomes were sequenced using the Roche GS FLX instruments. In addition cDNA-libraries  were sequenced using the Illumina GA instrument
  2. Data output after BI: 319,996 contigs with an average length of 307 bp
  3. Bioinformatics: Illumina transcriptome mapping and assembly was performed using SOAPaligner and SOAPdenovo software. Re-assembly together with 454 data was performed using Phrap software

The intensive study of the ducks genome using de novo genome and transcriptome sequencing approaches helped to identify significant changes in the genetic pattern compared to other bird species: the duck genome […] includes genes that are not present in the other three species whose genomes have been sequenced .” (Huang et. al)

I think it’s a quite interesting approach to learn more about a virus and its infectivity by studying the interaction between host – virus.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose Genome Has Been Sequenced? Theobroma Cacao L.

de-novo-sequencingI suppose there is no human being on the planet not knowing chocolate. “The tropical Theobroma cacao tree has been cultivated for at least three thousand years. Its earliest documented use  is arount 1100 BC (wiki.org).”

The latest de novo genome sequencing publication about a cacoa plant focusses on the Theobroma cacao L. Matina 1-6 clone, which is the most common cultivated type of cacao worldwide (Motamayor et al.). And although a first draft of this clone has already been published in 2010 the authors aim for an improved version of the genome to identify candidate genes regulating traits.

What was sequenced?

Leaves from Theobroma cacao L. Matina 1-6 clone; haploid genome size ~0.5 Gbp

Sequencing strategy: Whole genome sequencing plus BAC & fosmid end sequencing

  1. Libraries: shotgun and 8 long paired-end (LPE) libraries (insert size: 3 kbp; 6 kbp, 8 kbp) on the Roche GS FLX; three fosmid libraries and three independent BAC libraries with Sanger Sequencing
  2. Read output: > 32 million reads
  3. Data output: 711 scaffolds with a total scaffold length of 346 Mbp with a contig N50 length of 84.4 kbp and a scaffold N50 length of 34.4 Mbp
  4. Bioinformatics: Beside other tools Arachne, Megablast and blastx were used for genome assembly

Gene annotation and orthology analysis

  1. Libraries: long normalised libraries sequenced on the Roche GS FLX and short-paired reads libraries sequenced on the Illumina platform
  2. Read output: ~ 7 M reads from the Roche and ~ 1 billion reads from the Illumina sequencing
  3. Bioinformatics: Transcriptome assembly using the NCBI TSA within BioProject 51633 & final refining using PASA. Further tools where used for marker identification and comparison to other plant species

As further analysis tools re-sequencing as well as qPCR expression analysis were performed to finally  report a “high-quality sequence and annotation of T. cacao L.  and demonstrate its utility in identifying candidate genes regulating traits.” (Motamayor et al.)

From my point of view this is a high complex study using a comprehensive range of sequencing technologies. This shows once more that not only one sequencing strategy is needed to fully characterise a genome and start interpreting its secrets.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Acquisitions And Rumours

The last time I wrote about acquisitions is a while ago. But that does not mean that nothing happened – yet the opposite is the case: the NGS business is so dynamic that I am not sure which news are already outdated one day later.

But now it might be time to have all news comprised in this blog to at least list the lastest mergers, acquistions and rumours in the field of NGS:


1. QIAGEN acquired Ingenuity for $105M

QIAGEN one of the market leaders in Sample & Assay Technologies now builds up a branch in Next Generation Sequencing. The 1st step was the acquisition of Intelligent Bio-Systems in 2012. The expected launch of the upcoming sequencing device is scheduled for mid 2013. The acquisition of Ingenuity now seems to be last piece of the jigsaw for a complete NGS workflow from sample preparation to complete data analysis (see PR QIAGEN). From my point of view I am really confident regarding the sample preparation and the data analysis. But some doubts remain in respect to the NGS device – at least I have never heard about it before….

1. Life Tech – in great demand

Just two days ago an article in GenomeWeb revealed that two other bidders for Life Tech were Roche and Sigma-Aldrich. The rumours I heard so far only said, that Roche was interested in IonTorrent to push their own NGS business. According to the respective GenomeWeb article the Thermo – Life deal is anticipated to be completed in early 2014.

3. Roche – expanding or downsizing?

But although some rumours say that Roche is still interested in IonTorrent it might also be that they will shift their focus. Especially since Roche has downsized their effort in Applied Science business. According to the announcement Roche will integrate these products with other units and they also stopped the collaboration with DNAe to develop a semiconductor sequencing platform. Maybe because a new development might take too long. Maybe because the deal for IonTorrent is under way…

And while writing the summary I remembered again why these updates are so difficult to phrase: I don’t get rid of the feeling that something new, something more interesting is already close to publication.

DNA as Digital Data Storage – New Ways for using NGS?

While data output and quality of Next Generation Sequencing is continually increasing, the cost per base is steadily dropping. A survey  from the National Human Genome Research Institute (NHGRI) shows that the cost development even exceeds Moore’s law. New doorways  for research are opening, which may not have been regarded as realistic in the past  due to this trend.

For example, over the past years, several approaches have been made to use DNA as a means of storing information. In a study recently published online in Science, scientists developed a strategy to encode and read digital information using DNA Synthesis and  Next Generation Sequencing Systems.

A html document containing more than 50,000 words, 11 JPG images, and a Java Script program was encoded in DNA by synthesizing nearly 55,000 oligonucleotides on high-fidelity microarrays. The information stored in the oligonucleotides library was later “read” by Illumina sequencing.

According to the authors, DNA is a very useful medium for long term storage of information:   DNA is very stable over many years,  allows data storage at very high density and  small volumes. The senior author, Kosuri, told InSequence, they only used some 50 ng of oligonucleotides to store the information of this html document! Kosuri admitted that the study costed several thousand dollars. However, if Next Generation Sequencing continues to develop at the same speed as today, new applications such as using DNA for (long-term) data storage may become a feasible option.

So let us see what is coming next!

Creating the Perfect Genome Assembly

Dr. Georg Weinstock from the Genome Institute at the Washington University presents in a webinar how to create the perfect genome assembly by using the optical mapping system from OpGen Inc.

The singing mouse

Next Generation Sequencing (NGS) is transforming today’s genomic research and is used in numerous applied areas from clinical diagnostics to academic research. In Texas USA, Dr. Steven Phelps and his research team recently used NGS sequencing to discover a gene which allows mice to communicate by singing a song. I have to admit it sounds more like screaming than singing to me. But Phelps and his team found out that a gene called FOXP2 is responsible for this way of communication.

Phelps’ uses next-generation sequencing to decipher how FOXP2 interacts with DNA to regulate the function of other genes. The process involves reading tiny fragments of overlapping DNA so that the entire sequence can be deduced. It is a procedure that generates massive amount of data that only the processing power of a supercomputer can handle, said O’Connell (Source: www.tacc.utexas.edu). So data handling & storage is still one of the biggest challenges when performing Next Generation Sequencing projects. But now take the chance an listen to the song of this little mouse.

What is Optical Mapping?

Whole Genome Mapping (WGM) using the OpGen Argus technology delivers high resolution, ordered whole genome restriction maps from single DNA molecules. To receive such a restriction map it is crucial to isolate long DNA fragments (200 kb in size and longer) and to capture the DNA on a solid phase. Afterwards the DNA is digested revealing restriction cleavage sites as gaps when using a fluorescence microscope to visualize the DNA. This optical map will then be converted into digital data, the so called “single molecule restriction maps” (see video below). The software MapSolver enables the following analysis options (see details in the analysis video):

  • Perform Genome Comparisons
  • Identify Motifs, Annotate Features, and view in silico sequence data
  • Perform Sequence Placement
  • Create Similarity Clusters

 Video about step 3: How to scan and assemble single molecule restriction maps (SMRM)

Recently we gained access to this innovative technology and are able to combine our Next Generation Sequencing Service with the WGM technology. The combination of NGS and WGM can be used to order the contigs from a next generation sequencing project against the optical map scaffold. This method is able to highly improve sequencing assemblies. If you are interested in a combined or stand-alone project for WGM, please do not hesitate to contact us.

We look forward to discuss WGM in detail with you.

Combined Expertise to Offer Complete Genome Sequencing and Analysis Project Services

Eurofins MWG Operon and Integrated Genomics have announced a cooperation agreement to combine their expertise in sequencing and analysis services for microbial, fungal and algal organisms. The goal of the cooperation is to provide customers “one-stop-shopping” for complete genome projects that delivers analysis results from the raw extracted DNA. Read the press release.