biotech blueprint

Nextflow

Workflow orchestration for genomics analysis on AWS

This Quick Start deploys a genomics analysis environment on the Amazon Web Services (AWS) Cloud, using Nextflow to create and orchestrate analysis workflows and AWS Batch to run the workflow processes.

Nextflow is an open-source workflow framework and domain-specific language (DSL) for Linux, developed by the Comparative Bioinformatics group at the Barcelona Centre for Genomic Regulation (CRG). The tool enables you to create complex, data-intensive workflow pipeline scripts, and simplifies the implementation and deployment of genomics analysis workflows in the cloud.

This Quick Start is for teams or individuals who manage informatics infrastructure and genomics analysis for a biotech company.

The Quick Start deploys Nextflow into the infrastructure set up by the Biotech Blueprint core Quick Start. If you want to use an existing virtual private cloud (VPC) or create a new VPC, follow the Genomics Workflows on AWS instructions instead. If you're new to AWS or don’t have a strong VPC architecture already, we recommend that you first use the Biotech Blueprint core Quick Start to set up the landing zone for future AWS usage. This environment is automatically configured for identity management, access control, encryption key management, network configuration, logging, alarms, partitioned environments, and built-in compliance auditing to help meet your security and compliance requirements.

nextflow_logo

This Quick Start was developed by
AWS solutions architects.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  What you'll build
  • This Quick Start sets up the following environment in a preclinical VPC: 

    • In the public subnet, an optional Jupyter notebook in Amazon SageMaker that is integrated with an AWS Batch environment.
    • In the private application subnets, an AWS Batch compute environment for managing Nextflow job definitions and queues, and for running Nextflow jobs. AWS Batch containers have Nextflow installed and configured, in an Auto Scaling group. 
    • Because there are no databases required for Nextflow, this Quick Start does not deploy anything into the private database (DB) subnets created by the Biotech Blueprint core Quick Start.
    • An Amazon Simple Storage Service (Amazon S3) bucket to store your Nextflow workflow scripts, input and output files, and working directory.  

    For more information about the preclinical VPC and other infrastructure components, see the Biotech Blueprint core Quick Start.

  •  How to deploy
  • To deploy Nextflow on AWS, follow the instructions in the deployment guide. The deployment process includes these steps:

    1. If you don't already have an AWS account, sign up https://aws.amazon.com, and sign in to your account.
    2. If you haven’t already deployed the Biotech Blueprint core Quick Start, do so now.
    3. Launch the Quick Start. The deployment takes about 10 minutes. The Quick Start is available in the following AWS Regions: US East (N. Virginia), US West (Oregon), and EU (Ireland).
    4. Run a sample Nextflow script.
  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation template for this Quick Start includes configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. For cost estimates, see the pricing pages for each AWS service you will be using. Prices are subject to change.

    Tip   After you deploy the Quick Start, we recommend that you enable the AWS Cost and Usage Report to track costs associated with the Quick Start. This report delivers billing metrics to an S3 bucket in your account. It provides cost estimates based on usage throughout each month and finalizes the data at the end of the month. For more information about the report, see the AWS documentation.

    Nextflow is free, open-source software that is distributed under the Apache 2.0 license.