Bioinformatics, Genomes, EC2, and Hadoop
I think it is really interesting to see how breakthroughs and process improvements in one scientific or technical discipline can drive that discipline forward while also enabling progress in other seemingly unrelated disciplines.
The Bioinformatics field is rife with examples of this pattern. Declining hardware costs, cloud computing, the ability to do parallel processing, and algorithmic advances have driven down the cost and time of gene sequencing by multiple orders of magnitude in the space of a decade or two. Processing that was once measured by years and megabucks is now denominated by hours and dollars.
My colleague Deepak Singh pointed out a number of recent AWS-related developments in this space:
JCVI Cloud Bio-Linux
Built on top of a 64-bit Ubuntu distribution, the JCVI Cloud Bio-Linux gives scientists the ability to launch EC2 instances chock-full of the latest bioinformatics packages including BLAST (Basic Local Alignment Search Tool), glimmer (Microbial Gene-Finding System), hmmer (Biosequence Analysis Using Profile Hidden Markov Models), phylip (Phylogeny Inference Package), rasmol (Molecular Visualization) genespring (statistical analysis, data mining, and visualization tools), clustalw (general purpose multiple sequence alignment), the Celera Assembler (de novo whole-genome shotgun DNA sequence assembler), and the NIH EMBOSS utilities. The Celera Assembler can be used to assemble entire bacterial genome sequences on Amazon EC2 today!
There’s a getting-started guide for the JCVI AMI. Graphical and command- line bioinformatics tools can be launched from a shell window connected to a running instance of the AMI.
CloudBurst is described as a “new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics.”
In laymen’s terms, CloudBurst uses Hadoop to implement a linearly scalable search tool. Once loaded with a reference genome, it maps the “short reads” (snippets of sequenced DNA approximately 30 base pairs long) to a location (or locations) on the reference genome. Think of it as a very advanced form of string matching, with support for partial matches, insertions, deletions, and subtle differences. This is a highly parallelizable operation; CloudBurst reduces operations involving millions of short reads from hours to minutes when run on a large-scale cluster of EC2 instances.
You can read more about CloudBurst in the research paper. This paper includes benchmarks of CloudBurst on EC2 along with performance and scaling information.
Crossbow was built to do “Whole Genome Resequencing in the Clouds.” It combines Bowtie for ultra-fast short read alignment and SOAPsnp for sequence assembly and high quality SNP calling. The Crossbow home page claims that it can sequence an entire genome in an afternoon on EC2, for less than $250. Crossbow is so new that the papers and the code distribution are still a little ways off. There’s a lot of good information in this poster:
Michael Schatz (the principal author of CloudBurst and Bowtie) wrote a really interesting note on Hadoop for Computational Biology. He states that “CloudBurst is just the beginning of the story, not the end.” and endorses the Map/Reduce model for processing 100+GB datasets. I will echo Mike’s conclusion to wrap up this somewhat long post:
I really learned a lot while putting this post together and I hope that you will learn something by reading it. If you are using EC2 in a bioinformatics context, I’d love to hear from you. Leave a comment or send me some mail.