Bridging the Gap Between Biologists and Their Data
Contributed by Marina Becker
In an age of NGS more and more researchers are faced with the daunting task of analyzing massive and often unmanageable quantities of data. The pursuit of “big data” in biology has been accompanied by a vacuum in post-sequencing analysis where researchers struggle to effectively and efficiently make sense of the data they have collected. The by-hand methods used to prepare the samples for sequencing often do not translate when the results come back. In fact, larger data sets often require a larger investment of computational power than man power in their early stages – and many researchers have a little bit of trouble with that transition.
Scripting is the often a great first step. Parallelization and automation of analysis can make huge a huge difference not only in processing time but in consistency of results. Learning to use command-line tools such as for loops can help you to automate simple tasks and build up skills that are widely applicable to other programming languages and analysis scenarios. Cluster computing is also great to learn automation and how to use huge amounts of computing resources efficiently.
While learning a programming language is not a trivial task, picking your poison wisely can make the process go much smoother. Every programmer will likely give a different spiel on the pros and cons of their chosen language, but as a person new to programming in the biological context, I highly recommend Python as a solid beginner’s choice. Python is a user-friendly language with a human-readable syntax which eases the transition from thinking in problems and phrases to thinking programmatically and in lines of code.
In addition, Python features useful extensions and libraries which are prebuilt do do many of the tasks you might need in order to analyze their data. In particular, Biopython makes great use of the Python programming language and caters to many of the common needs of biologists. Tasks such as file parsing, look-ups in large data structures and conversions between data/file types are simple learn and use in this language.
It will always be critically important for biologists to lay hands on their work and to interpret it at the human level. However, the ultimate pursuit of a better understanding of biology depends on our ability as scientists to embrace the need for computing.
There are a number of resources that can help get you started in programmatic analysis…
For help learning a command line interface and getting started with scripting in general:
For getting started with python in particular:
For python tools specific to biologists: