Contributed by Shawn Higdon
The focus of my graduate research involves the investigation of nitrogen-fixing bacteria inhabiting a terrestrial environment that is rich in complex carbohydrate – the aerial root exudate (mucilage) of an isolated maize landrace. Within this arc, I have embarked on a quest to characterize the genomes of cultured bacteria derived from the mucilage. This includes the identification and annotation of bacterial gene products that correspond to plant-associated functionalities. Specifically, gene products with functions that relate to atmospheric nitrogen fixation, solubilization of inorganic phosphate and carbohydrate utilization.
One tool that I have relied heavily upon for the identification of these genes is the use of profile hidden markov models (pHMMs). While there are many database resources available for sequence analysis using pHMMs, the primary repositories of pHMMs that I used initially were TIGRFAMs from the J. Craig Ventner Institute and the comprehensive Pfam database that catalogs pHMMs associated with protein domains. While these resources are quite expansive for covering the overall breadth of biological functions in prokaryotic systems, they are not particularly tailored to the annotation of proteins involved in complex carbohydrate metabolism. When I discovered this, I began searching online for a resource that was suitable for identifying carbohydrate active enzymes (CAZymes) and stubled across the resource dbCAN.
dbCAN is a collection of pHMMs that were generated in coordination with the Carbohydrate Active Enzymes Database (CAZy). The first iteration of dbCAN was hosted online by Le Huang and the Yanbin Lin lab at the University of Georgia. The pHMM library was made available for download for use with the Hmmscan function of HMMER. Since the initial release, dbCAN has reached its second version and now exists as both a standalone tool available for download on github (https://github.com/linnabrown/run_dbcan) and a meta-server that is accessible through an internet browser (https://github.com/linnabrown/run_dbcan). The second version of dbCAN, dbCAN2, integrates CAZyme annotation using multiple bioinformatics tools that include HMMER, Diamond, Hotpep, SignalP, Prodigal and GeneFragScan. This allows for pHMM scanning and sequence alignment-based annotation calls and increases the robustness of dbCAN as an annotation tool for CAZymes. Additionally, the researchers added libraries for transcription factors (Pfam) and membrane transporters (TCDB) associated with carbohydrate metabolism. The addition of these tools into one CAZyme annotation pipeline truly creates a bioinformatic resource that is ideal for delving into the metabolic capabilities of microbial genomes.
As a final note, the research group has also incorporated a python script called CGC finder, which utilizes protein sequences in fasta format and scaffold position information (gff files) in order to identify Carbohydrate Gene Clusters (CGC) within the microbial genome. As an added utility, the meta-server includes a python script that allows for visualization of the CGCs. In my opinion, based these developments, using dbCAN for life science research related to microbial metabolism just became essential if time efficiency and ease of use are prioritized.