Tag Archives: sam/bam

Evolving SAM/BAM Format for De Novo Assemblies

IGV visualisationI guess any bioinformatician working with NGS data is familiar with the SAM/BAM format that has evolved into a widely accepted standard in the NGS community. Actually, many software packages (e.g. mapping and visualization tools like BWA, Tablet or IGV) have adopted this open file format to cope with the large amount of data produced by e.g. read mappings. The benefit of an accepted data format is the easy accessibility of structured information – in this case, two open source libraries samtools and Picard provide functionality to access the data.

However, the field of de novo assemblies still misses an open and widely-adopted data format. So, existing assembly formats like ACE, CAF or amos are not that easily convertible into one another.

Now, initiated by a recent blog entry by Nick Lomann (Sept 19, 2011) an intensive discussion in the community (samtools-devel mailing list) has started how to adapt the SAM/BAM standard also for de novo assemblies. The intention is to modify the existing format (v1.4) to allow also for the incorporation of the reference sequence in the SAM/BAM file. Open questions are, for example, how to integrate a padded (gapped) reference sequence (as already implemented in the ACE format). Peter Cock’s blog provides some nice figures depicting the problem of visualizing the read-to-consensus alignment using a padded reference sequence.

Actually, a few days ago (Sept 27, 2011), the SAM/BAM specification has already been updated in the repository and, referring to Peter Lock, “this is on track to be part of the SAM Format Specification v1.5”.

Wow, I am really impressed (again) how fast things change and develop in NGS research. I am looking forward to my first SAM/BAM assembly!