Pan-Genome Analysis with Roary

Contributed by Shawn Higdon

Microbial Genomics involves the isolation of microbes from environmental samples, often times with the intent of generating a pure culture comprised by an organism of a single type of strain. While many isolation events lead to the generation of such cultures, large-scale isolation endeavors often lead to collections of banked isolates that are rife in cultures that possess high levels of genomic identity with subtle variations (i.e. different genotypes). These genotypic differences among isolates are often subtle, but in many cases contribute either directly or indirectly to an observed phenotype. In any case, identifying the presence of genetic differences that are present in select isolates is of great interest.

Once the isolates in question have been subjected to whole genome sequencing, DNA sequence reads are then assembled into contiguous sequences that contain the information associated with individual genes (Some popular assembly programs include MEGAhit and SPAdes). These contiguous sequences are then scanned for many genomic features, the likes of which include RNA and Protein coding genes. A great way to identify the features of a microbial genome assembly is using the program Prokka. This program provides genome annotation information in multiple output formats, providing solid versatility for input to downstream analyses.

Getting back to the identification of genetic features that lead to differences among type strains for prokaryotic isolates that display high levels of genomic similarity, one program that provides a solid strategy for carrying out Pan-genome analysis with bacterial genomes to identify accessory genes among a collection of isolates is Roary. In my opinion, something that makes Roary great is that the developers accommodate researchers who already use Prokka for genome annotation by allowing them to input GFF3 output files from Prokka directly into the Roary pipeline. By providing Roary with multiple GFF3 files from the isolates in question, a Pan-genome and Core-genome will be constructed for the population under investigation and a list of genes present in some but not all isolate genomes will be made available. This strategy is useful for identifying genes of interest, such as virulence genes that may lead to outbreaks of infectious diseases. In fact, the Applied Bioinformatics course at UC Davis (BIT150) taught a lab session on this subject in the Fall of 2017, and will likely continue in doing so this year. Check out the course website for details! (



Leave Comment

Your email address will not be published. Required fields are marked *