Contributed by Shawn Higdon
You are working tirelessly in the lab attempting to generate cultures of microbes that are pure, meaning a single organism comprising a single well-isolated colony. You finally have what appears to be an isolate of pure culture and wish to sequence the microorganism’s genome. After running through the isolate through the sequencing pipeline, you come to find that what was surely a pure isolate may, in fact, be multiple microbes living closely together after assembling the genome…or it could be contamination.
As a graduate student working with whole genome sequencing data from many banked microbial isolates, this is a situation that I am now faced with along the road to achieving my Ph.D. A common finding is that the draft genome assembly for a given isolate will be greater than 10 Mbp, which I interpret as a strong indication that the isolate may be a co-culture or a pure isolate that became contaminated. My challenge is in determining which is actually the case.
My initial approach, which was recently referred to as “round-about” by a colleague, has been to assemble PE150 Illumina reads for each isolate, map the reads back to the assembly and subsequently form contiguous sequence bins that theoretically indicate the presence of whole genomes from distinguished organisms. These contig bins can then be classified independently of one another using software programs such as Sourmash. An alternative approach was recently proposed to me, which is to use the CheckM software suite from Donnovan Parks et al. This is something I plan on looking into moving forward.