Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

WeimerMicroLab’s latest manuscript by DJ Darwin R. Bandoy & Bart C. Weimer is out! Click here to read “Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease.”

 

Posted in uncategorized | Comments Off on Biological machine learning combined with bacterial population genomics reveals common and rare allelic variants of genes to cause disease

Machine Learning How to Culture UnCulturables?

Contributed by Shawn Higdon

How much of the microbial biodiversity on planet Earth has mankind managed to grow in the laboratory? I think the number thrown around as a rough approximation in early biology classes is 5 percent. Five percent, right? Or is it one percent? One percent sounds more realistic…or is it one tenth of one percent? I’m not sure if we will ever truly be able to answer this question with absolute certainty, but this number is likely to be up for debate depending on the parties involved.

A fascinating approach towards tackling the problem of how to devise microbial culturing strategies was brought to my attention when I met with Titus Brown several months ago. As it turns out, Adrian Viehweger and his colleagues at the Friedrich Schiller University of Jena in Germany thought that it might be a good idea think of protein sequences in microbial genomes in a way that is similar to how we view words in a text document. Essentially, they’ve developed an incredible way to adapt the Word2Vec algorithm for deep learning applications of genome biology! What do they call this tool? Nanotext, of course! The group’s preprint is currently available on BioRxiv and is a great read. I find it particularly amazing that they are able to show how Nanotext can be used to predict a culture medium for metagenome-assembled genomes.

While I view this as being a powerful tool for the future of developing microbial culture media tailored for growing and storing previously uncultured microbes that we can see in microbiome samples through DNA sequencing, I see the Nanotext approach as having additionally utility for describing the functions of proteins that are assigned the oh-so-popular annotation of, “hypothetical protein”. Is it possible for two protein sequences to have the same domains present but display them in different configurations or, “architectures”? I see Nanotext being used to overcome hurdles in functional annotation associated with sequence-alignment based approaches in the near future!

Here are links to Nanotext Github Documentation/Source code and the BioRxiv preprint…

https://github.com/phiweger/nanotext

https://www.biorxiv.org/content/10.1101/524280v2

…Enjoy!

Posted in uncategorized | Comments Off on Machine Learning How to Culture UnCulturables?

Preprint is Out!

Contributed by Darwin Bandoy, DVM

We are proud to announce the preprint publication of our paper wherein we identified a misclassified species of Hungatella from the Human Microbiome Project. This project demonstrates the power of pangenome analysis to resolve species identity by clustering.

The impact of the misclassification is amplified because of numerous microbiome papers using taxonomic classifiers that are dependent on reference species. We are baking more interesting papers and will release them in the coming weeks!

The preprint is here:

https://www.medrxiv.org/content/10.1101/19000489v1

Posted in uncategorized | Comments Off on Preprint is Out!

Qualifying Exam Preparation for PhD students in Life Sciences and Biotechnology

Contributed by Darwin Bandoy, DVM

I am still over the clouds for passing my qualifying exam. And I am pretty sure most PhD students are curious how I made my preparations. So, I will be describing what I did.

  1. Read, read, read.I read all available materials for my topic (genomics of Campylobacter). This means getting the lay of the land, which translates to surveying the most current literature. For me, this started with the most recent review article in Nature Reviews Microbiology for Campylobacter. This step saved me a lot of time, plus the graphics enable me to grasp the major concepts immediately.
  1. Talk,talk,talk.I meet with my adviser (Dr. Weimer) every day. The minimum meeting time was 1 hour. And in those meetings, we almost resolved all the difficult questions. I never hesitated to ask him if I don’t know the answer. Those meetings are very productive as most of the questions have been asked by the qualifying exam committee.
  1. Know yourself, know your enemy.This is a classic art of war mantra, and it is very useful for preparing for the qualifying exam. While your qualifying exam committee is not your enemy, it pays to know them. I met with them several times and asked what would be the type of questions and minimum knowledge expected of me. They also provided me constructive advice which I integrated in my proposal.
  1. Practice for every situation.I learned this from a presentation workshop from Royal Academy of Dramatic Arts in London. When I presented my final pitch for the Leaders in Innovation, something went wrong with the powerpoint presentation, but since I practiced for all situation, I went ahead calmly. People say my calmness was pivotal for me wining the first place. We do not practice for perfection, but we practice for every situation. I learned in the workshop it pays to simulate various audience reaction, and hence they try to anticipate the variabilities and prepare for that. I also used my refrigerator to diagram my talk.
Posted in uncategorized | Comments Off on Qualifying Exam Preparation for PhD students in Life Sciences and Biotechnology

Comprehensive Carbohydrate Active Enzyme Analysis with meta-dbCAN

Contributed by Shawn Higdon

The focus of my graduate research involves the investigation of nitrogen-fixing bacteria inhabiting a terrestrial environment that is rich in complex carbohydrate – the aerial root exudate (mucilage) of an isolated maize landrace. Within this arc, I have embarked on a quest to characterize the genomes of cultured bacteria derived from the mucilage. This includes the identification and annotation of bacterial gene products that correspond to plant-associated functionalities. Specifically, gene products with functions that relate to atmospheric nitrogen fixation, solubilization of inorganic phosphate and carbohydrate utilization.

One tool that I have relied heavily upon for the identification of these genes is the use of profile hidden markov models (pHMMs). While there are many database resources available for sequence analysis using pHMMs, the primary repositories of pHMMs that I used initially were TIGRFAMs from the J. Craig Ventner Institute and the comprehensive Pfam database that catalogs pHMMs associated with protein domains. While these resources are quite expansive for covering the overall breadth of biological functions in prokaryotic systems, they are not particularly tailored to the annotation of proteins involved in complex carbohydrate metabolism. When I discovered this, I began searching online for a resource that was suitable for identifying carbohydrate active enzymes (CAZymes) and stubled across the resource dbCAN.

dbCAN is a collection of pHMMs that were generated in coordination with the Carbohydrate Active Enzymes Database (CAZy). The first iteration of dbCAN was hosted online by Le Huang and the Yanbin Lin lab at the University of Georgia. The pHMM library was made available for download for use with the Hmmscan function of HMMER. Since the initial release, dbCAN has reached its second version and now exists as both a standalone tool available for download on github (https://github.com/linnabrown/run_dbcan) and a meta-server that is accessible through an internet browser (https://github.com/linnabrown/run_dbcan). The second version of dbCAN, dbCAN2, integrates CAZyme annotation using multiple bioinformatics tools that include HMMER, Diamond, Hotpep, SignalP, Prodigal and GeneFragScan. This allows for pHMM scanning and sequence alignment-based annotation calls and increases the robustness of dbCAN as an annotation tool for CAZymes. Additionally, the researchers added libraries for transcription factors (Pfam) and membrane transporters (TCDB) associated with carbohydrate metabolism. The addition of these tools into one CAZyme annotation pipeline truly creates a bioinformatic resource that is ideal for delving into the metabolic capabilities of microbial genomes.

As a final note, the research group has also incorporated a python script called CGC finder, which utilizes protein sequences in fasta format and scaffold position information (gff files) in order to identify Carbohydrate Gene Clusters (CGC) within the microbial genome. As an added utility, the meta-server includes a python script that allows for visualization of the CGCs. In my opinion, based these developments, using dbCAN for life science research related to microbial metabolism just became essential if time efficiency and ease of use are prioritized.

Posted in uncategorized | Comments Off on Comprehensive Carbohydrate Active Enzyme Analysis with meta-dbCAN

Post Conference Reflections: Bay Area Microbial Pathogenesis Symposium

Contributed by Darwin Bandoy DVM

I presented a poster at BAMPS (Bay Area Microbial Pathogenesis Symposium) with my work on using machine learning bacterial genome-wide association study. First of all, it was radically different as most posters focus on defining a mechanism for a single gene, while my work is about extracting insights from population-scale level. While the two methods are complementary, few people are integrating population scale genomics to derive mechanistic insights. What was interesting is that few people cared about abortion and Campylobacter- I was visited because of the machine learning. And hence this represents a strategy in presenting one’s work, to use methods that can be broadly applied and use a specific example. Hence think broadly and work specifically is a good way to be relevant to more people.

Posted in uncategorized | Comments Off on Post Conference Reflections: Bay Area Microbial Pathogenesis Symposium

Analyzing RNAseq Data with DESeq2

Contributed by Shawn Higdon

RNA sequencing experiments are extremely powerful in terms of the amount of information that is capable of being revealed for a given experimental system. While performing these experiments demands passing through several hurdles in the lab that require a high degree of technical skill, careful planning, and seemingly flawless execution, analyzing the data flying off the DNA sequencer is equally complex. The course of my internship in the Weimer Lab revolves heavily around interpreting large files of RNAseq data, and one tool that has been making this analysis just a little bit sweeter is DESeq2 (http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).

DESeq2 is an open source software suite written in R, distributed through Bioconductor (https://bioconductor.org/packages/release/bioc/html/DESeq2.html). As perhaps the most active primary developer, Michael Love has put forth extensive care in generating high-quality documentation for the use of DESeq2 for RNAseq data analysis. One of my favorite features of this package is the development of the companion package tximport, which is a package that provides versatility for Scientists who have taken different bioinformatic approaches to transcript quantification. Tximport allows R users to read in their transcript count information in a relatively straight forward manner that provides streamlined entry to DESeq2 for differential expression analysis. An excellent, in-depth vignette on how to use Tximportwas composed (https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html), and it is also covered briefly in the DESeq2 vignette.

My last post to this blog described ultrafast transcript quantification using Salmon, and DESeq2 is the next logical step in moving forward with RNAseq data analysis in a timely fashion. Tximport provides an efficient bridge for getting Salmon output into R for DESeq2 analysis. In addition, determining differentially expressed genes with DESeq2 has been simplified to the point where only a few commands need to be performed in R. I must admit, when I knew nothing about how to use DESeq2, I had this idea in my mind that it was going to be extremely complicated and require extensive R coding. After having gone through the Vignette, my anxiety was laid to rest…at least on this front. Another great thing about DESeq2 is that the developers have provided integration with ggplot2, which makes plotting visualizations of the data a breeze! I plan to learn more and more about the many facets of RNAseq data analysis, and something tells me that using DESeq2 along the way is going to be extremely helpful…

 

 

Posted in uncategorized | Comments Off on Analyzing RNAseq Data with DESeq2

Metagenome wide association studies: data mining the microbiome

Contributed by Darwin Bandoy, DVM

Increased sequencing capability offers an unprecedented look into the microbiome and the association with complex diseases. Associations are mined to determine the changes occurring with microbiome and metabolic conditions like diabetes and obesity, as well as chronic diseases such as colorectal cancer and rheumatoid arthritis. While initially microbial community membership was determined, more current approaches determine functional upregulation or depletion. The current challenge is mostly bioinformatics as the data dimensionality is high and the technical interpretation difficult. This is where machine learning is working beautifully and elegantly, and I am currently working on making this more intuitive to use.

Posted in uncategorized | Comments Off on Metagenome wide association studies: data mining the microbiome

Pangenome guided pharmacophore modelling of enterohemorrhagic Escherichia coli sdiA

Contributed by Darwin Bandoy, DVM

I am proud to present my published work in the F1000Research journal. This journal is unique due to the open peer review platform. Open peer review publishes the article first after a basic editorial screening and peer review follows thereafter. This is a unique model of disseminating research. I used a unique application of the pangenome to guide pharmacophore model in enterohemorrhagic Escherichia coli sdiA. This is a preview of innovative utilization of population genomics to derived functional insights.

The article linked is here:

Bandoy DD. Pangenome guided pharmacophore modeling of enterohemorrhagic Escherichia coli sdiA[version 1; referees: awaiting peer review]. F1000Research 2019, 8:33 (https://doi.org/10.12688/f1000research.17620.1)

 

Posted in uncategorized | Comments Off on Pangenome guided pharmacophore modelling of enterohemorrhagic Escherichia coli sdiA

Fast Transcript Quantification with Salmon

Contributed by Shawn Higdon

Interpretation of Sequencing data from experiments targeting gene expression in any biological system requires a series of computational and analytical steps. After going through the motions of pre-processing sequencing reads into a state of high quality (trimming low-quality base calls and removing adapter sequences that slip through the cracks), the next step is to generate count information for each transcript that was present within each sample’s cDNA sequencing library. In earlier times, the standard approach for this step was to use sequence alignment-based bioinformatics software to “map” each sequencing read to a reference genome for the organism under investigation, followed by quantification of transcript count based on the mapping results. While this approach is indeed reliable, caveats include heavy computational requirements, lengthy runtime, and the generation of mapping files that tend to occupy large volumes of disk space (storage).

Within the past five years, an alternative approach has been implemented in the area of transcript quantification through the work of Rob Patro and his colleagues, where they have developed software that allows for much faster transcript quantification using a method that ditches the alignment-based approach altogether. Through the development and release of programs such as Sailfish (Patro, Mount, and Kingsford 2014) and Salmon (Patro et al. 2017), generating transcript level abundance and count estimates is now achievable in significantly shorter time periods without the draw-back of consuming large amounts of storage space. Rather than mapping to a reference genome, the transcriptome can now be used in its place for transcript quantification – should one be fortunate enough to have this resource available for the system under investigation.

Salmon is well supported within the scientific community focusing on gene expression analysis, which is evidenced by its ease of installation and strong documentation. Salmon can easily be installed using Anaconda through the Anaconda cloud (https://anaconda.org/bioconda/salmon) and several vignettes have been posted online in order to help biologists move along the road of Big Sequencing Data analysis towards answering some of our greatest questions!

Posted in uncategorized | Comments Off on Fast Transcript Quantification with Salmon