AWS for Industries

Standardizing quantification of expression data at Corteva Agriscience with Nextflow and AWS Batch

Authored by Anand Venkatraman, Bioinformatics Associate Research Scientist at Corteva Agriscience, and Srinivasarao Annapareddi, Cloud DevOps Engineer at Corteva Agriscience. The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Data analysis in biological research today presents some interesting conundrums and challenges, including a rapidly increasing number and complexity of analytical methods, and many implementations of major algorithms and tools that do not scale well. As a result, reproducing the results of a pipeline or workflow can be challenging given the number of components, each having its own set of parameters, dependencies, supporting files, and installation requirements.

Corteva Agriscience is an agriscience company completely dedicated to agriculture, with the purpose of enriching the lives of those who produce and those who consume by ensuring progress for generations to come. At Corteva Agriscience, expression analysis continues to increase in complexity and scale, while facing those challenges for data analysis in biological research. This led to the creation of the Standardized Corteva Quantification Pipeline. By providing best practices for standardized quantification data, this pipeline takes a crucial first step toward mitigating and overcoming most of the data chaos from expression analysis while simultaneously catering to the needs of subject matter experts and downstream data management strategies.

Standardized Corteva Quantification Pipeline for Expression: Implementation with Nextflow and AWS Batch

Given these challenges and complexities, there were two possible paths for implementing the Standardized Corteva Quantification Pipeline:

  1. Continuous on-premises infrastructure with a huge number of compute resources always available. (But with the knowledge that there might be times when 90% of the capacity is un/underutilized, as well as the possibility that demand might be very high in some seasons and the existing infrastructure cannot meet the demand.)
  2. Spin up and down instances on the cloud on demand.

Keeping in mind the speed and demand at which expression data was needed, it became increasingly clear to us that the most viable solution would be the ability to spin up and down instances on the cloud on demand. Corteva Agriscience uses AWS Batch for many projects; it is a set of batch management capabilities that enables you to easily and efficiently run hundreds or thousands of batch computing jobs on AWS. We wanted to develop a solution that enhances AWS Batch capabilities without duplicating the neat features and processes it provides. Nextflow’s capabilities with AWS Batch + Spot instances slotted perfectly in this scenario, as Nextflow provides features that extend AWS Batch functionalities in a multi-fold manner.

The team’s decision to use Nextflow with AWS Batch as the solution for standardizing quantification data for expression was primarily based on these four (of many) salient features:

  1. Nextflow spares AWS Batch configuration steps by automatically taking care of the required Job definitions and Job requests as needed.
  2. Nextflow spins up the required computing instances, scaling up and down the number and composition of the instances to best accommodate the actual workload resource needs at any given point in time.
  3. Nextflow synergizes the auto-scaling ability provided by AWS Batch along with the use of spot instances to bring about huge savings in cost, time, and resources.
  4. Nextflow can reschedule failed job automatically, providing a truly fault-tolerant environment.

Standardized Corteva Quantification Pipeline for Expression: Bioinformatics tools, AWS compute environment, and architecture

The standardized quantification pipeline for expression written in Nextflow lingua uses these bioinformatics software programs: fastqc, bbtools, fastp, salmon, tximport, MultiQC. The underlying compute environment on AWS can scale up to 1024 vCPUS using the combination of r4.8xlarge, r5.8xlarge, r5d.8xlarge, and r5a.8xlarge instance types depending on the compute or memory needs of each of the bioinformatics processes within the workflow.  The architecture implemented with Nextflow and AWS Batch + Amazon EC2 Spot Instances is depicted in Figure 1.

Figure 1: Nextflow + AWS Batch architecture for Standardized Corteva Quantification Pipeline for Expression

Notable parts of the architecture are the Scheduler Batch Node, the Executor Batch Node, and the EC2 Batch template.

The Scheduler Batch Node is an EC2 instance launched by AWS Batch to run the main Nextflow process which schedules workflow processes. It is important that this node use on-demand instances so that the main scheduling process is not interrupted. To enable this, we created an AWS Batch compute environment and job queue to use just for scheduling nodes using CloudFormation like the following:

scheduler-batch-node

{
    "SchedulerCE": {
        "Type": "AWS::Batch::ComputeEnvironment",
        "Properties": {
            "Type": "MANAGED",
            "ServiceRole": {
                "Ref": "BatchServiceRole"
            },
            "ComputeEnvironmentName": {
                "Ref": "SchedulerComputeEnv"
            },
            "ComputeResources": {
                "MinvCpus": 0, "MaxvCpus": 128,  "DesiredvCpus": 0,
                "SecurityGroupIds": [ { "Ref": "BatchSecurityGroup" } ],
                "Type": "EC2",
                "Subnets": [
                    {
                        "Fn::ImportValue": { "Fn::Sub": "PrivateSubnetOneV1"  }
                    },
                    {
                        "Fn::ImportValue": {"Fn::Sub": "PrivateSubnetTwoV1"   }
                    }
                ],
                "ImageId":      { "Ref": "SchedulerAMI"   },
                "InstanceRole": { Ref": "InstanceRole" },
                "InstanceTypes": [ "optimal"  ],
                "Ec2KeyPair": { "Ref": "KeyName"  }
            },
            "State": "ENABLED"
        }
    },
    "SchedulerQueue": {
        "Type": "AWS::Batch::JobQueue",
        "Properties": {
            "ComputeEnvironmentOrder": [
                {
                    "Order": 1,
                    "ComputeEnvironment": {
                        "Ref": "SchedulerCE"
                    }
                }
            ],
            "State": "ENABLED",
            "Priority": 1,
            "JobQueueName": {
                "Ref": "SchedulerQueue"
            }
        }
    },
    "Scheduler": {
        "Type": "AWS::Batch::JobDefinition",
        "Properties": {
            "Type": "container",
            "JobDefinitionName": {
                "Ref": "SchedulerJobDef"
            },
            "ContainerProperties": {
                "Memory": 1024, "Privileged": true, "JobRoleArn": { "Ref": "JobRoleARN" },
                "ReadonlyRootFilesystem": false,
                "Vcpus": 1,
                "Image": { "Ref": "SchedulerImage"  }
            }
        }
    }
}

Workflow processes are run in Executor Batch Nodes which can run on EC2 SPOT instances. Again, we created a dedicated compute environment and job queue just for executor nodes using a CloudFormation snippet like the following:

executor-batch-node

{
    "ExecutorCE": {
        "Type": "AWS::Batch::ComputeEnvironment",
        "Properties": {
            "Type": "MANAGED",
            "ServiceRole": {
                "Ref": "BatchServiceRole"
            },
            "ComputeEnvironmentName": {
                "Ref": "ExecutorComputeEnv"
            },
            "ComputeResources": {
                "SpotIamFleetRole": {
                    "Ref": "SpotFleetRole"
                },
                "BidPercentage": 80,
                "MinvCpus": 0, "MaxvCpus": 2000,  "DesiredvCpus": 0,
                "SecurityGroupIds": [
                    {
                        "Ref": "BatchSecurityGroup"
                    }
                ],
                "Type": "spot",
                "Subnets": [
                    {
                        "Fn::ImportValue": {
                            "Fn::Sub": "PrivateSubnetOneV1"
                        }
                    },
                    {
                        "Fn::ImportValue": {
                            "Fn::Sub": "PrivateSubnetFiveV1"
                        }
                    }
                ],
                "ImageId": {
                    "Ref": "imageID"
                },
                "InstanceRole": {
                    "Ref": "InstanceRole"
                },
                "InstanceTypes": [
                "r5.large", "r4.large", "r5.8xlarge", "r4.8xlarge", “r5a.8xlarge", "r5d.8xlarge"
                ],
                "Ec2KeyPair": {
                    "Ref": "KeyName"
                }
            },
            "State": "ENABLED"
        }
    },
    "ExecutorQueue": {
        "Type": "AWS::Batch::JobQueue",
        "Properties": {
            "ComputeEnvironmentOrder": [
                {
                    "Order": 1,  "ComputeEnvironment": { "Ref": "ExecutorCE"  }
                }
            ],
            "State": "ENABLED",
            "Priority": 1,
            "JobQueueName": {
                "Ref": "ExecutorQueue"
            }
        }
    }
}

Executor Batch nodes also need some minimal provisioning to work with Nextflow. For this, we used a custom launch template like the following:

ec2_batch_template

{
    "Resources": {
        "EC2LaunchTemplate": {
            "Type": "AWS::EC2::LaunchTemplate",
            "Properties": {
                "LaunchTemplateName": {
                    "Fn::Join": [
                        "-",
                        [
                            {
                                "Ref": "EC2LaunchtemplateName"
                            },
                            {
                                "Fn::Select": [
                                    2,
                                    {
                                        "Fn::Split": [
                                            "/",
                                            {
                                                "Ref": "AWS::StackId"
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    ]
                },
                "LaunchTemplateData": {
                    "InstanceMarketOptions": {
                        "MarketType": "spot"
                    },
                    "BlockDeviceMappings": [
                        {
                            "Ebs": {
                                "DeleteOnTermination": true, "VolumeSize": 80,
                                "VolumeType": "gp2"
                            },
                            "DeviceName": "/dev/xvda"
                        }
                    ],
                    "UserData": {
                        "Fn::Base64": {
                            "Fn::Sub": "MIME-Version: 1.0\nContent-Type: multipart/mixed; boundary=\"==BOUNDARY==\"\n\n--==BOUNDARY==\nContent-Type: text/cloud-config; charset=\"us-ascii\"\n\npackages:\n- wget\n\nruncmd:\n- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh\n- bash /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh -b -f -p /home/ec2-user/miniconda\n- /home/ec2-user/miniconda/bin/conda install -c conda-forge -y awscli\n- rm /home/ec2-user/Miniconda3-latest-Linux-x86_64.sh\n\n--==BOUNDARY==--\n"
                        }
                    }
                }
            }
        }
    },
    "Outputs": {
        "LaunchTemplateId": {
            "Description": "NFS EC2 Launch Template ID for AWS Batch use ",
            "Value": {
                "Ref": "EC2LaunchTemplate"
            }, "Export": {"Name": "batch-ec2-launchtemplate"} 
      } 
    }
}

Conclusions

The Standardized Corteva Quantification Pipeline implemented with Nextflow and AWS Batch + EC2 Spot instances has resulted in significant savings in time and resources, as one can spin up a multitude of data analytics jobs on as-needed basis without the requirement for continual infrastructure. The pipeline has provided Corteva researchers worldwide with a best practices platform to quantitate expression data at scale, with the focus being on data analysis, rather than worrying about computational scalability and on-premises infrastructure costs.

After deployment, this pipeline continues to be heavily used to generate large datasets in a time-critical manner. Data generated by the pipeline feeds into an in-house expression data repository. The importance of this pipeline is underscored by the fact that it serves a crucial component for researchers developing innovative ways to assess regulatory and genic regions of genomes.

The Nextflow + AWS Batch solution we developed for the Standardized Corteva Quantification Pipeline will serve as a prototype for developing similar computationally intensive genomics pipelines (e.g., annotation/assembly) of varying complexities that are well positioned to be orchestrated in Nextflow. These efforts will thereby facilitate important biological or computational questions to be answered by researchers, while simultaneously enabling fault-tolerance, automation, and reproducibility of the pipelines being deployed. The futuristically designed prototype will be used as guide to migrate legacy bioinformatics workflows to Nextflow code which will help overcome issues with maintenance and reproducibility of those pipelines.

Next Steps

To learn more about the technologies in this blog, see Nextflow on AWS Batch and AWS Biotech Blueprint with Nextflow.

Anand Venkatraman

Anand Venkatraman

Anand Venkatraman (PhD Bioinformatics and AWS 3X Certified) is a Bioinformatics Associate Research Scientist at Corteva Agriscience with 12 years’ of diverse experience as a scientific and technical lead in the Bioinformatics and Scientific Computing domain. He is an avid photographer who loves traveling during his spare time while also looking to explore unsung destinations and places off the beaten track.

Srinivasarao Annapareddi

Srinivasarao Annapareddi

Srinivasarao Annapareddi (Srini) is a Cloud & DevOps Engineer at Corteva Agriscience. Srini is a AWS Certified Solutions Associate Architect with 8 years’ experience in Cloud Infrastructure and Administration while also serving as an expert in Data Migration and Automation for all things related to cloud and on-prem infrastructure. In his spare time, Srini likes home cooked food with his favorite being Hyderabadi Dum Biryani. Srini is always trying to learn new things and loves sharing his professional knowledge.