AWS Public Sector Blog

Tracking global antimicrobial resistance among pathogens using Nextflow and AWS

The Centre for Genomic Pathogen Surveillance (CGPS) is based at the Wellcome Genome Campus, Cambridge and The Big Data Institute, University of Oxford in the United Kingdom. Much of its work involves collaborating with laboratories around the world to enhance genomic surveillance by using big data, engineering, training, and genomic capacity building. Ultimately, the Centre hopes to enable the linking and real-time interpretation of data globally to track pathogens and antimicrobial resistance at an affordable rate. Typically, spikes in cost for research are a common challenge for laboratories. With the cloud, the team wanted to mitigate their costs, and particularly those of their partners in low and middle-income countries, by exploring the Amazon Web Services (AWS) Cloud’s pay-as-you-go infrastructure.

The team developed a comprehensive process with minimal startup steps to employ a dynamic web application that makes scaling up Nextflow-based data pipelines (used for genome sequencing analysis in the AWS Cloud) simpler for researchers. Nextflow allows researchers to build scalable workflows using software containers to better interpret, track, and identify various pathogens with ease. Since Nextflow is free, open source software, pipelines are adaptable to researchers’ needs and can be prototyped on a small desktop computer. Nextflow pipelines, coupled with AWS, have the capacity to scale up pipelines and run hundreds or even thousands of jobs in parallel.

These changes eased the workflow for some of their partner countries such as those that are partners in the NIHR-funded Global Health Research Unit: Genomic Surveillance of Antimicrobial Resistance from Colombia, India, the Philippines, and Nigeria. In some of these institutes, they may lack the computing resources and DevOps expertise necessary to set up and run downstream analytics.

Using AWS CloudFormation and AWS Batch with Nextflow, Dr. Anthony Underwood, bioinformatics implementation manager for the Centre for Genomic Pathogen Surveillance and Ben Taylor, senior software engineer, with their colleagues built a solution that allows a new pipeline to be deployed with just a few commands (instructions for using this software can be found online). Their template eliminates the need for deep, technical knowledge of the command line and reduces the potential of racking up unexpectedly high bills. Now, end users can start a pipeline and monitor it using a web page, and they can proactively set limits to manage costs. With the addition of AWS Lambda, compute is faster, and researchers can keep their costs down by only paying to store and compute while the pipeline is running data in Amazon Elastic Compute Cloud (Amazon EC2).

When developing some of the pipelines, Dr. Underwood says he managed to reduce the learning curve for the global collaborating labs. “In a matter of a day, people can be up and running. When I first came to the Sanger Institute, although I had a login to our high-performance computing (HPC) environment, I didn’t use it. I was trying to put myself in the shoes of someone who doesn’t have access to HPC. The final goal of our project was to have four functioning labs that are completely independent of CGPS. We want them to be able to carry on doing what they’re doing, and also provide their experiences of laboratory and computational setup and training to enable replication to other labs.”

With their solution, the CGPS and collaborating labs can now use whole genome sequencing technologies and analytics to take their research and public health reporting to a degree of specificity that was not possible with traditional microbiology. Dr. Underwood explains, “We now have an extra level of discrimination. We can track where the pathogens are coming from – whether they’re from within or outside of a hospital, region, or country. We can interpret the data forensically by viewing the pathogen genome ‘fingerprints’ and determine the genetic loci responsible for antimicrobial resistance and whether they are present within high transmissible genomic elements, such as plasmids. This helps us to see if infection with bacteria resistant to treatment is due to a single clone or if there has been exchange of antimicrobial resistance between isolates and even species. Our primary goal is to track antimicrobial resistance not just at a local level, but in the global context.”

While only in the prototyping stage, Dr. Underwood and team are working to lay the foundation for replicating and scaling this process. Dr. Underwood says, “Our future hope is that this allows completely remote setup and support of computing infrastructure and analytical pipelines, so that we never have to go into a server room. We want to provide a set of instructions so that other labs can get up and running themselves. This would not be possible if it was all on-premises HPC, but it is possible with the cloud.”