Tag Archives: de novo sequencing

How to handle variants in a reference genome

When talking about genome sequencing the human genome project is one of the best known projects. “Building” a reference genome that helps to identify disease-causing mutations is only one of many goals for the human reference genome.

But I am sure that all of you already asked the question: how can a reference genome even exists? On earth we have more than 7 billion people and among that many different characteristics. So how can one human reference serve for all mankind?

The Global Alliance, lead by David Haussler, recently won a $1 million grant to create a graphical model of the human genome (BioTechniques). The graph model should help to visualise variants as alternate pathways. Like that a more comprehensive picture of “naturally occuring variants” and disease causing variants might be gained. To support this approach, they got access to 300 complete human genome sequences from the Broad Institute in Cambridge.

From my point of view this is a great idea and I hope it helps to further pave the way how the massive amounts of sequencing data can be handled and interpreted in the near future!

Read the complete article at BioTechniques.com

 

Whose genome has been sequenced? Brassica napus

de-novo-sequencingBrassicas napus, also known as oilseed rape, was formed more than 7000 years ago by allopolyploidy (chromosome doubling from to Brassicas species). Of course the genome mutated further and so it is known today that during this evolution some genes were preserved and further “improved” (e.g. oil biosynthesis genes), whereas others were lost over the course of time (e.g. glusoinolate genes).

Chalhoub et. al now sequenced the genome, because it can help to “provide insights into allopolyploid evolution and its relationship with crop domestication and improvement” (Chalhoub et. al).

What was sequenced?

Young fresh leaves from the Brassica napus French homzygous winter line “Darmor-bzh“.

Sequencing strategy: Whole genome sequencing

  1. Libraries & Sequencing:
    Roche GS FLX: ~ 70 Million reads, Average Read length: ~ 368 bp, Genome coverage: 21.2 %
    Sanger BAC Seq: 141k reads, Read length: 650 bp; Genome coverage: 0.1%
    Illumina HiSeq:  ~375 Million reads, Read length: 36, 76, 108 and 150 bp, Genome coverage: 53.9%
  2. Data output: 44.146 contigs and 20.702 scaffolds
  3. Results: A final assembly of 849.7 Mb (using SOAP and Newbler) with 89% nongapped sequences.

After genome assembly the genome was mapped to other species (e.g. B. rapa and B. oleracea) and this helped to find several interesting genes and gene variation that help to understand the complete evolution better.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose Genome Has Been Sequenced? Belgica antarctica

de-novo-sequencingExtreme conditions require extreme actions. And this is what the midge Belgica antarctica has done. The midge lives exclusively in the Antarctic and in order to survive shrinked its genome to the smallest possible size. As of today, this is the smallest insect genome that has been sequenced.

Kelley et. al. now sequenced the genome of Belgica antarctica with the aim to learn more about how insects in general can adapt to the most extreme conditions.

What was sequenced?

Two fourth instar larva (Belgica antarctica) collected near Palmer Station, Antarctica.

Sequencing strategy: Whole genome sequencing & RNA-sequencing

  1. Libraries & Sequencing: 1 channel 2x 100 bp Illumina HiSeq 2000 (SG library (400 bp insert)) and one SMRT-cell of a 10 kb fragment library on PacBio RSII (P4 DNA Polymerase)
  2. Data output: 92 M paired-end reads from the shotgun sequencing with Illumina. These resulted in 5,422 contigs. Using the paired-end RNA-Seq data the number of contigs has been reduced to 5,064. Genome coverage with Illumina sequencing ~ 100x.
  3. Results: The total genome is ~ 99 Mbp.

For the PacBio sequencing a second larvae was used. But due to the low input of genomic DNA the PacBio data yielded only in a modest improvement in assembly. This underlines the need of a long-read sequencing technology with low input DNA material.

The de novo sequencing of the midge Belgica antarctica revealed that the smalll genome size is achieved by a reduction in repeats, TEs and intron size.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose genome has been sequenced? Aquila chrysaetos

de-novo-sequencingEvery day an unimaginable number of NGS data is generated. Anyhow the number of avian genomes that have been sequenced so far is still quite small (Doyle et al table 1). Doyle et. al added one more avian genome to this list – the “Golden Eagle” Aquila chrysaetos.

What was sequenced?

A male golden eagle (Aquila chrysaetos canadensis) captured in the southern Sierra Nevada.

Sequencing strategy: Whole genome sequencing

  1. Libraries & Sequencing: 1 channel 2x 100 bp SG paired-end sequencing and 1 channel 2x 100 bp mate-paired sequencing using the Illumina HiSeq platform
  2. Data output: 68.4 Gb of raw data (25.3 Gb from the SG and 43.1 Gb from the mate-pair library). Total genome size (incl mtDNA) ~ 1.28 Gbp. Overall genome coverage ~ 40x. Longest scaffold: 11,517,212 bp
  3. Results: The mtDNA genome is characterised by 13 protein-coding genes, 2 rRNAs and 23 tRNAs. The annotation produced a total of 16,571 predicted nuclear genes.

Besides the nuclear genome Doyle et al could also assemble the complete mitochondrial genome. Furthermore they found ~ 800,000 novel polymorphisms. These polymorphisms can now help to define markers that are involved in carnivory orother biological processes.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose genome has been sequenced? Nasuia deltocephalinicola

de-novo-sequencingThe human genome comprises more than 3 billion base pairs and builts up more than 20,000 protein coding genes. For genomes like this high-throughput sequencers, like the HiSeq 2000 are a revelation. In this article we talk about the smallest genome sequenced so far – here sequencing with the MiSeq is more than sufficient. The main role of this small symbiont (Nasuia deltocephalinicola) together with the symbiont Sulcia muelleri is to provide 10 essential amino acids to the pest insect Macrosteles quadrillineatus (Bennett et al).

What was sequenced?

10 phloem-feeding pest insects (Macrosteles quadrillineatus) including the obligate symbionts Nasuia deltocephalinicola & Sulcia muelleri.

Sequencing strategy: Whole genome sequencing

  1. Libraries & Sequencing: 2x 250 bp paired-end sequencing using the Illumina MiSeq platform
  2. Data output: 12,000 contigs (> 500 bp) including reads from the pest insect and Sulcia muelleri; MIRA and Velvet assembly revealed two large scaffolds for Nasuia (~102 kb & ~12 kb)
  3. Bioinformatics: Many tools have been used, including Velvet for inital read assembly; SOAP2 to map the symbion-derived reads to the Velvet contigs and MIRA for re-assembly of isolated symbiont reads

The biggest challenge with this genome mixture was definitely the bioinformatic analysis. During several cycles of mapping and assembly the reads that belong to one organism needs to be filtered out of the remaining reads. But this labour-intensive approach revealed the smallest bacterial genome yet sequenced (112 kb).

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose genome has been sequenced? Thunnus orientalis

de-novo-sequencingTalking about sealife everyone knows how sharks or whales look like or how they behave. Sadly, I think little is known about tuna. Tuna is more or less only known as delicous meal. So it’s all the more pleasant to see that the recent de novo genome sequencing approach of Nakamuar et. al aim to learn more about the predatory behaviour of tuna and not about breeding or cultiviation (Nakamuar et. al). With this genome sequencing project of Thunnus orientalis the scientists could prove that tuna harbors some unique tactics to catch their prey.

What was sequenced?

The diploid genome of a wild-caught male Pacific bluefin tuna (T. orientalis) was sequenced.

Sequencing strategy: Whole genome sequencing

  1. Hybrid approach: Roche 454 GS FLX Titanium & Illumina GAIIx
  2. Libraries: Shotgun & paired-end libraries on Roche 454 & paired-end libraries on Illumina GAIIx
  3. Read output: 31.9 million 454 reads, including 4.9 million long paired-end reads (11.9x coverage) & 229.7 million Illumina paired-end reads (43x coverage)
  4. Data output: 192,169 contigs (> 500 bp) that could be assembled in 16,802 scaffolds (> 2 kb), totaling 740.3 Mb (= 92.5% of the estimated genome size (~ 800 Mb))
  5. Bioinformatics: Roche 454 read assembly with Newbler (Version 2.5) followed by mapping of the paired-end Illumina reads with Bowtie (Version 0.12.7).
    Note: 7,259 nucleotide mismatches & 312,851 short InDel’s could be eliminated by mapping the Illumina reads onto the scaffolds by bwa (Version 0.5.9)

Sequencing strategy: Transcriptome analysis

  1. Libraries & Sequencing: Normalized cDNA libraries have been sequenced with the Roche 454 FLX Titanium Instrument
  2. Read output: 3.8 million 454 reads
  3. Data output: 5,741 full-length cDNA sequences
  4. Bioinformatics:Assembly was performed using Newbler (Version 2.5)

From the sequencing strategy point of view this publication shows again that the hybrid approach of the Roche 454 long read technology and the Illumina short read technology is one of the most used techniques for de novo genome sequencing (Hybrid assemblies).

From a scientific point of view this publication could show that tuna hs the most RH2 paralogs among studied fishes and that three of these genes are mutated compared to the others. And according to Nakamuar et. al these changes might be responsible for the great feature of tuna to detect blue-green contrasts and therefore to be able to measure the distance to prey in the blue-pelagic ocean.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

De Novo Genome Sequencing of Sunflower Via RAD-Seq

sunflowerDear NGS Expert Blog reader,

As part of our ongoing series of posts on RAD sequencing, I wanted to share some results from a recently published study describing the use of RAD-Seq for high throughput SNP development in Helianthus annuus (Sunflower).

Sunflower is one of the leading oilseed and confectionery crops in North America, with an annual crop mass of approximately 1 billion kilograms and an economic value over 720 million USD. Despite the economic importance of sunflower, relatively modest genomic resources exist for molecular genetic and marker assisted breeding applications.

To accelerate genomics resource development in sunflower, Floragenex was tasked with rapidly identifying a large set of single nucleotide polymorphism (SNP) markers in North American sunflower through the use of RAD sequencing. The end goal was to translating those markers into a downstream genotyping assay, which could be used for high-throughput applications such as linkage and association mapping.
Some highlights on this study:

  • RAD-Seq was used to rapidly construct over 15.1 Mb of de novo sunflower genomic sequence, comparable in size to a small eukaryotic transcriptome.
  • There were over 94,000 putative SNP markers identified from analysis of six sunflower lines sequenced via RAD-Seq.
  • 16,467 of these variants were incorporated into an Illumina Infinium Genotyping Array.

The above study elegantly demonstrates how RAD is an incredibly efficient marker discovery tool. From just under half a lane of Illumina data (44M 2x80bp reads), a marker resource of over 16 thousand high quality variants could be rapidly generated and deployed for breeding applications.

The full article, entitled “De novo sequencing of sunflower genome for SNP discovery using RAD (Restriction site Associated DNA) approach” can be found on BMC Genomics.

As a co-author on the publication, I would be happy to answer any of your questions on this paper, so don’t hesitate to post them. For my next NGS blog entry, I’ll be showing you some interesting publication trends seen with RAD sequencing.

Cheers,
Rick Nipper
President, Floragenex

Whose genome has been sequenced? Emiliania huxleyi

de-novo-sequencingDressing up by pulling carbon dioxide out of the water – this is speciality of the coccolithophore Emiliania huxleyi. Using carbon dioxide E. huxleyi makes microscopic disks of calcite, with which it clothes itself (about.com). These carbon fixation makes up for ~ 20% of carbon fixation in some systems, which is really impressive. Read an her colleagues used one strain from the South Pacific to investigate the global distribution and the heterogeneity of the genome of this coccolithophore (Read et. al). Amongst others they could reveal that “this organism is unusually diverse and has a huge genome with a large number “optional” genes. This kind of “pan genome” has not previously be found outside the bacteria” (Alden, about.com)

What was sequenced?

A batch culture of the diploid strain Emiliania huxleyi CCMP1516 from the South Pacific

Sequencing strategy: Whole genome sequencing

  1. Libraries: 3 libraries (insert sizes: 3 kbp, 8 kbp, 20-40 kbp). The majority was sequenced using the ABI 3730 XL
  2. Read output: 3,910,095 whole genome shotgun reads (10x coverage)
  3. Data output: 6,995 scaffolds of the final nuclear genome (excluding mitochondrial, chloroplast and eukaryotic scaffolds), where 321 large scaffolds harbor 70% of the total sequence
  4. Bioinformatics: Analysis of prokaryotic only scaffolds with total lengths greater than 100 kb -> Genome assembly with Arachne
    Note: All contigs and scaffolds < 4 kb in length were excluded from the final assembly due to the high GC content (65%) and large amount of repetitive region in E. huxleyi

2nd whole genome sequencing approach:

  1. Libraries: 13 shotgun libraries for 13 different strains using Illumina HiSeq sequencing (3 strains deeply sequenced and 10 strains moderately sequenced)
  2. Read output: ~ 36 x 109 reads per strain (strain 1-3) -> 265-352x coverage   and ~ 27 x 106 reads per strain (strain 4-13) -> 14-29x coverage
  3. Data output: total scaffold lengths: 98-117 Mb (strain 1-3) & 49 – 76.5 Mb (strain 4-13)
  4. Bioinformatics: De novo genome analysis using CLC Genomics & BLASTn for comparison of the deeply sequenced strains

Sequencing strategy: Transcriptome analysis

  1. Libraries: 4 cDNA libraries corresponding to different development stages and growth conditions were prepared and sequenced using the ABI 3730
  2. Data output (filtered): 30,569 genes  (these genes cover 40% of the genome)
  3. Bioinformatics: Genome annotation and alignment using BLAST and BLAT

I think one of the most interesting facts from this study is that they used Sanger sequencing for a great part of this project. According to their comparisons with for example the Illumina data, the scaffold completeness of the sanger data is estimated at 96%. And although it seems that also sanger sequencing might be suitable for small genomes for me the question remains if a hybrid NGS consisting of Roche GS FLX++ and Illumina HiSeq might have lighten up the project.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose genome has been sequenced? Anas platyrhynchos

de-novo-sequencingStarting with a great deal of attention for the bird flu in 2005, nearly every year a potential Influenza epidemic is discussed in the media. This  leads to greater awareness for influenza research projects. A well suited research tool for influenza viruses are ducks. Ducks harbor nearly all hemagluttinin (HA) and neuraminidase subtypes and the harm for the ducks is often neglectable.
Huang and his research team have now sequenced the ducks genome to search for defense mechanisms in ducks against influenza viruses (Huang et. al., Nature Genetics).


What was sequenced?

A 10-week-old female Beijing duck (Anas platyrhynchos)

Sequencing strategy: Whole genome sequencing

  1. Libraries: 8 shotgun libraries and 5 mate-pair libraries (insert sizes: SG lib 185 – 530 bp,  mate-pair lib 2 – 10 kb), (50 bp reads) using the Illumina GA Solexa technology
    note: sequencing method according to the de novo Panda Genome project
  2. Read output: >77 Gb of paired-end reads (~ 64x coverage)
  3. Data output: 78.487 scaffolds with a contig N50 length of 26 kbp and a scaffold N50 length of 1.2 Mbp; total covered length of 1.1 Gb (~ 95% of the genome)
  4. Bioinformatics: Genome assembly using SOAPdenovo
  5. Additional comparative studies with the duck genetic and physical map resulted in 47 superscaffolds which contained 225 scaffolds and spanned 289 Mbp

Transcriptome analysis

  1. Libraries: Infected as well as control duck transcriptomes were sequenced using the Roche GS FLX instruments. In addition cDNA-libraries  were sequenced using the Illumina GA instrument
  2. Data output after BI: 319,996 contigs with an average length of 307 bp
  3. Bioinformatics: Illumina transcriptome mapping and assembly was performed using SOAPaligner and SOAPdenovo software. Re-assembly together with 454 data was performed using Phrap software

The intensive study of the ducks genome using de novo genome and transcriptome sequencing approaches helped to identify significant changes in the genetic pattern compared to other bird species: the duck genome […] includes genes that are not present in the other three species whose genomes have been sequenced .” (Huang et. al)

I think it’s a quite interesting approach to learn more about a virus and its infectivity by studying the interaction between host – virus.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts:

Whose Genome Has Been Sequenced? Hevea brasiliensis

de-novo-sequencingAll of us have at least once been doing experiments in the lab. And so everyone was confronted with latex gloves. And more and more of us developed a kind of latex allergy.

According to Rahman et al. “these allergies are triggered by certain proteins present in Hevea-derived natural rubber (NR). […] Hevea brasiliensis (Willd.) Muell.-Arg., also known as Pará rubber tree, is the primary commercial source for natural rubber (NR) production” (in total nearly 11 million tons in 2011 for all 2,500 rubber tree species).

Although rubber is used for > 50.000 products worldwide this is the first de novo sequencing approach. So far only transcriptome analysis studies were performed, which lack the non-coding regions of the genome.

What was sequenced?

Young leaves of Hevea brasiliensis RRIM 600. Genome size: ~ 2.15 Gb; 18 chromosomes

De novo sequencing strategy:

  1. Libraries: shotgun and mate-pair libraries (insert size: 500 bp) on HiSeq 2000; LPE libraries (insert sizes: 8 kb and 20 kb) on Roche GS FLX; Paired-end library (insert size: 2 kb) on SOLiD
  2. Coverage of all sequencing strategies together: ~ 43x (after filtering repeat-matching reads: ~ 13x = 27.86Gb)
  3. Data output: 143 scaffolds (total 1.119 Mb with N50 = 2.972 bp)
  4. Bioinformatics: CLC Workbench & Newbler assembler using different input data and different assembling strategies

Transcriptome sequencing strategy:

  1. Libraries: cDNA libraries
  2. Sequencing with Illumina HiSeq and Roche/454
  3. Bioinformatics: CLC Workbench assembler for the Illumina reads and Newbler for combining Roche and Illumina reads.

This de novo genome sequencing approach revealed that ~ 78% of the genome are repetitive regions. This study helps to improve breeding of H. brasiliensis by allowing marker assisted selection to further increase the disease resistance and minimize the allergenicity.

Read the complete publication here.

Whose Genome Has Been Sequenced? – Recent posts: