AWS Public Sector Blog
Drug Discovery and Biomarkers Development on the Human Gut Microbiome Using AWS Batch and Nextflow
A guest post by Francesco Strozzi – Head of Bioinformatics, Enterome
Gut microbiome plays a critical role in building our immune system at birth. It provides a life-long personal and natural protection. To fully explore and characterize the role of the human gut microbiome, Enterome uses different approaches, including the latest genome sequencing technologies, to reconstruct microbial genomes and quantify the abundance of different species and microbial genes in the gut across large cohorts of patients. The current high throughput sequencing technologies produce tens of millions of DNA sequences for each biological sample and the human gut microbiome is estimated to contain hundreds of species and several million unique bacterial genes that can be identified and analyzed. Enterome’s mission is to translate all of this information into actual knowledge, which can be applied to advanced clinical and drug discovery programs.
Processing such large and heterogeneous datasets requires massive computing and storage resources and proper tools and strategies to manage the complexity of the analysis. To tackle this complexity, Enterome adopts innovative approaches using the AWS Cloud. Enterome uses AWS Batch with Nextflow to manage and scale thousands of computing jobs for a single analysis. Nextflow allows the orchestration of large and complex workflows for automation, traceability, and reproducibility of the analysis pipelines. These tools enable fast and efficient processing and mining of the human gut microbiome data.
Typical workflows executed on AWS Batch at Enterome include the metagenomics quantification pipeline, which is used to estimate the abundance of known and novel microbial species present in the human gut microbiome starting from high-throughput sequencing data. In this workflow, raw sequencing data is processed and filtered and then searched against Enterome’s human gut microbiome gene catalogues. This allows for a precise profiling of each microbial gene and species to identify the ones that correlate with a disease progression or a response to a treatment.
The construction of the human gut microbiome gene catalogues is another type of workflow where AWS Batch is used with Nextflow. To build such catalogues, hundreds or thousands of human gut microbiome samples and their sequencing data is processed to reconstruct the maximum number of bacterial genomic sequences and to best predict the possible genes. Once this collection of genes is completed, the workflow performs an advanced and extensive annotation to assign each gene to a known or novel bacterial species to characterize its biological functions.
The process is computationally intensive but important. This information underpins every program in Enterome, as it forms the basis for our understanding of the human gut microbiome composition and role. This opens up the possibility to identify druggable targets and novel candidate molecules.
In order to complete the computational tasks and workflows, 50- 200 Amazon Elastic Compute Cloud (Amazon EC2) instances spin up for each data analysis. Enterome does not need to run continual infrastructure, but rather we spin up instances when we need them and spin them down when we do not. With EC2 Spot instances, a typical data analysis can cost as low as a few dollars per sample and can take from four to six hours to complete. With AWS and Nextflow, complete and automated workflows can be executed and analysed in parallel in just a few hours, saving time and resources. AWS Batch has simplified running scientific workloads on the cloud at scale.
One of the key features of Nextflow is the possibility to resume entire workflows so completed jobs will not be re-executed. We can efficiently manage workflow interruptions or introduce changes with minimal impact on cost and time in order for pipeline progression and results to remain consistent.
Finally, both AWS Batch and Nextflow natively support Docker containers that encapsulate and re-use existing tools and analysis pipelines, which in Enterome’s case, further simplifies the transition from small development to large production environments.
Technologies, such as AWS Batch and Nextflow, allow Enterome to focus on data analysis and not on infrastructure management or workflow execution. Enterome can more effectively develop innovative therapeutic approaches for microbiome-related diseases, while being free from on-premises infrastructure costs and computing limitations.