Standardizing quantification of expression data at Corteva Agriscience with Nextflow and AWS Batch
Authored by Anand Venkatraman, Bioinformatics Associate Research Scientist at Corteva Agriscience, and Srinivasarao Annapareddi, Cloud DevOps Engineer at Corteva Agriscience. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Data analysis in biological research today presents some interesting conundrums and challenges, including a rapidly increasing number and complexity of analytical methods, and many implementations of major algorithms and tools that do not scale well. As a result, reproducing the results of a pipeline or workflow can be challenging given the number of components, each having its own set of parameters, dependencies, supporting files, and installation requirements.
Corteva Agriscience is an agriscience company completely dedicated to agriculture, with the purpose of enriching the lives of those who produce and those who consume by ensuring progress for generations to come. At Corteva Agriscience, expression analysis continues to increase in complexity and scale, while facing those challenges for data analysis in biological research. This led to the creation of the Standardized Corteva Quantification Pipeline. By providing best practices for standardized quantification data, this pipeline takes a crucial first step toward mitigating and overcoming most of the data chaos from expression analysis while simultaneously catering to the needs of subject matter experts and downstream data management strategies.
Standardized Corteva Quantification Pipeline for Expression: Implementation with Nextflow and AWS Batch
Given these challenges and complexities, there were two possible paths for implementing the Standardized Corteva Quantification Pipeline:
- Continuous on-premises infrastructure with a huge number of compute resources always available. (But with the knowledge that there might be times when 90% of the capacity is un/underutilized, as well as the possibility that demand might be very high in some seasons and the existing infrastructure cannot meet the demand.)
- Spin up and down instances on the cloud on demand.
Keeping in mind the speed and demand at which expression data was needed, it became increasingly clear to us that the most viable solution would be the ability to spin up and down instances on the cloud on demand. Corteva Agriscience uses AWS Batch for many projects; it is a set of batch management capabilities that enables you to easily and efficiently run hundreds or thousands of batch computing jobs on AWS. We wanted to develop a solution that enhances AWS Batch capabilities without duplicating the neat features and processes it provides. Nextflow’s capabilities with AWS Batch + Spot instances slotted perfectly in this scenario, as Nextflow provides features that extend AWS Batch functionalities in a multi-fold manner.
The team’s decision to use Nextflow with AWS Batch as the solution for standardizing quantification data for expression was primarily based on these four (of many) salient features:
- Nextflow spares AWS Batch configuration steps by automatically taking care of the required Job definitions and Job requests as needed.
- Nextflow spins up the required computing instances, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any given point in time.
- Nextflow synergizes the auto-scaling ability provided by AWS Batch along with the use of spot instances to bring about huge savings in cost, time, and resources.
- Nextflow can reschedule failed job automatically, providing a truly fault-tolerant environment.
Standardized Corteva Quantification Pipeline for Expression: Bioinformatics tools, AWS compute environment, and architecture
The standardized quantification pipeline for expression written in Nextflow lingua uses these bioinformatics software programs: fastqc, bbtools, fastp, salmon, tximport, MultiQC. The underlying compute environment on AWS can scale up to 1024 vCPUS using the combination of r4.8xlarge, r5.8xlarge, r5d.8xlarge, and r5a.8xlarge instance types depending on the compute or memory needs of each of the bioinformatics processes within the workflow. The architecture implemented with Nextflow and AWS Batch + Amazon EC2 Spot Instances is depicted in Figure 1.
Figure 1: Nextflow + AWS Batch architecture for Standardized Corteva Quantification Pipeline for Expression
Notable parts of the architecture are the Scheduler Batch Node, the Executor Batch Node, and the EC2 Batch template.
The Scheduler Batch Node is an EC2 instance launched by AWS Batch to run the main Nextflow process which schedules workflow processes. It is important that this node use on-demand instances so that the main scheduling process is not interrupted. To enable this, we created an AWS Batch compute environment and job queue to use just for scheduling nodes using CloudFormation like the following:
Workflow processes are run in Executor Batch Nodes which can run on EC2 SPOT instances. Again, we created a dedicated compute environment and job queue just for executor nodes using a CloudFormation snippet like the following:
Executor Batch nodes also need some minimal provisioning to work with Nextflow. For this, we used a custom launch template like the following: