AWS Government, Education, & Nonprofits Blog

In Pursuit of a 1 Hour, $10 Genome Annotation

There are hundreds of scientists at the Smithsonian Institution who study just about every kind of life on earth, from animals and plants to fungi and bacteria. Since the initial publication of the human genome project in 2001, DNA sequencing technology has become more efficient and cost-effective, making it possible for individual biodiversity scientists to generate genome resources for their organisms of interest. These genomes can be the gateway to new research questions that were previously unanswerable.

For example, biodiversity genomics scientists face special challenges because they seek to understand genomes that range dramatically in size and complexity (e.g. some plant genomes are more than 10 times larger than the human genome). These scientists need agile software and hardware solutions that can be frequently updated to reflect the ever-increasing data behind algorithms and models.

To tackle these challenges, the Smithsonian’s Office of the Chief Information Officer recently established a Data Science Team, including Dr. Rebecca Dikow and Dr. Paul Frandsen. Part of their mission is to implement solutions that will accelerate science and lower the bar for entry to genomics research, not only for Smithsonian scientists but for biodiversity researchers in general. Although many large institutions have computing resources available to their researchers, there is a queue limit and significant costs to operating a high performance computing cluster. In addition, many smaller research institutions and universities may not have access to such resources.

Dikow and Frandsen are collaborating with AWS and Intel to improve a critical part of the genome analysis pipeline – annotation. Genome annotation is the process of identifying the locations of genes and other genomic features and determining their function, the first step in downstream applications of genomic data.

“Cloud technologies are a natural choice for annotation because different parts of a genome assembly (contigs or scaffolds) can be annotated in parallel, with the results being knitted together in a final step,” said Dikow. “The ability to scale up to many instances for brief periods will make annotation fast while remaining inexpensive.”

The Smithsonian’s Data Science Team is implementing existing annotation pipelines, such as MAKER (Canterel et al., 2008) and WQ_MAKER (Thrasher et al., 2012), as well as developing their own using the workflow engine Toil. Toil uses Common Workflow Language (CWL), which will allow the tools developed to be modular, portable, and scalable across thousands of AWS instances.

What makes these pipelines complex is the need to process each genome scaffold with multiple software tools in turn and to keep track of thousands of intermediate files and any failed tasks. The team has successfully implemented the first step in the annotation pipeline in Toil, which includes masking genome repeats with RepeatMasker (Smit et al., 2015), across 10 c3.xlarge instances. As they continue to make progress in the coming months, their code will be available on GitHub.

Rebecca Dikow presented the team’s progress at the first Global Biodiversity Genomics “BioGenomics” conference held in Washington DC , which was hosted by the Smithsonian Institution. This conference was a gathering of more than 300 genome and biodiversity scientists that focused on the methods and analysis of biodiverse genome data. There was great interest in the annotation pipeline currently under development. Check out the “Improving genome annotation strategies for biodiverse species using cloud technologies” slide deck presented at the conference here.