Microbial communities are more and more analyzed by direct sequencing of DNA from environmental samples. The aim is to study the microbial composition at different conditions and the identification of novel organisms. The bottleneck is the assembly of the metagenome reads.
Why are these assemblies so challenging? One important reason is the highly heterogeneous character of the microbial environmental sample. Furthermore the abundance of the member species differs remarkably. While some species are highly abundant, others – often the unknown and therefore very interesting ones – are present at a very low level. In order to receive contigs from the low abundant species very deep sequencing is needed and these species often can only be assembled in a highly fragmented manner. Another challenge for metagenomic assemblies are populations of closely related species. As their genomes are highly similar the assembly software generates hybrid contigs from those closely related species.
Despite or probably due to these challenges I see a lot of efforts in the field in improving the underlying assembly tools. Currently, procedures like clustering large contigs based on tetranucleotide frequency and coverage are applied. Clustered contigs are afterwards ordered by mapping to related genome. With this approach first bacterial draft genomes from e.g. cow rumen and soil metagenomes have been published in Science and Nature.