Using Amazon FSx for Lustre for Genomics Workflows on AWS

Genomics datasets are getting larger year-over-year. Combining data from research initiatives across the globe, and having the ability to process it quickly, has been identified by large bioinformatics and genomics communities as an enabling mechanism for making significant scientific discoveries. Collaborating on a global scale requires data storage solutions that are globally accessible, highly available, and allow high-performance data processing.

Amazon FSx for Lustre combines the ease of use of a high-performance POSIX-compliant shared file system with the industry-leading scalability and data availability of Amazon Simple Storage Service (Amazon S3). Amazon FSx for Lustre works natively with Amazon S3 and transparently presents S3 objects as files, making it easy for you to process cloud data sets with high performance file systems and write results back to S3. Having data stored in Amazon S3 enables downstream analysis with analytics and machine learning solutions such as Amazon Redshift, Amazon Athena, Amazon EMR, and Amazon SageMaker.

In this blog post, I show you how you can easily use Amazon FSx for Lustre to simplify and accelerate a genomics workflow on AWS.

Genomics Workflows

Genomics workflows are typically composed of multiple command line tools designed to work with files. That is, they take files like FASTQs or BAMs as inputs and generate files like TSVs/CSVs or VCFs as outputs.

The genomics workflow we are going to use is a secondary analysis pipeline. This particular pipeline processes raw whole genome sequences into variants – differences in the sequence relative to a standard reference – using a set of containerized tools.

In previous blog posts, we described how to build and run such workflows in a scalable manner using AWS Batch and AWS Step Functions. Our reference architecture uses S3 as durable storage for data inputs and outputs, and uses the AWS CLI to stage data to compute jobs with sufficiently sized EBS volumes accordingly.

While this is the most cost effective way to manage the large (often >100 GB per sample) amounts of data, it can introduce some delays in the workflow while data is being transferred into and out of compute nodes. This also requires tooling that can interact with S3. With containerized tools, this involves incorporating the AWS CLI into the container image. If you have numerous pre-built containerized tools, or rely on public sources, this can be many images to rebuild and manage.

Most genomics and bioinformatics tools are built to work with files on a POSIX-compliant filesystem. In addition, it is common to use data inputs or intermediate files generated more than once in a genomics workflow. For example, a reference genome bundle (typically ~10 GB in size) is used to align raw sample reads and call variants, two different steps of a workflow. Using a shared filesystem across workflow steps, you can easily access data that was collected or generated in earlier steps.

Amazon FSx for Lustre for genomics workflows

Here we can use Amazon FSx for Lustre as a performant shared filesystem mounted across all compute instances launched by AWS Batch for the workflow. Amazon FSx for Lustre also provides transparent synchronization of data with S3 – making it effectively a caching layer to S3 that is accessible as a parallel POSIX filesystem.

Architecture for using Amazon FSx for Lustre for genomics workflows

POSIX metadata enhancements to Amazon FSx for Lustre and the Data Repository Task API also make it easier and cheaper to process S3 data at high speeds for a broad set of workloads. This includes workloads that require processing large numbers of small files, like streaming ticker data and financial transactions in the financial services industry. More relevantly, it can be useful for workloads that require access controls on sensitive data, such as whole human DNA sequences in the healthcare and life sciences industry.

For our genomics workflow we’re going to create three filesystems:

The first is for reference data – that is, the hg38 reference human genome. This uses a public S3 bucket contributed by the Broad Institute and hosted by the AWS Public Datasets program.

The second is for source data. Here I’m using a couple raw sequence files (FASTQs) that have been reduced in size for demonstration purposes and placed in a public bucket. In a real world scenario, you would likely use a private bucket for this – that is, a bucket your organization uses as a genomics data lake.

The third and final filesystem is for working data. This is used to store outputs generated by the workflow. This is associated with a private S3 bucket created just for this demo. In a real world scenario, this bucket could be one associated with your research group.

To provision the storage and compute infrastructure needed, I have an AWS CloudFormation template that creates:

An Amazon VPC to isolate compute resources
An Amazon S3 bucket for workflow outputs
Amazon FSx for Lustre filesystems we need for reference, source, and output data
An AWS Batch Job Queue and Compute Environment to execute workflow steps

Looking at the Amazon FSx for Lustre resources in the template…

Resources:
    ...
  FSxLustreReferenceFileSystem:
    Type: AWS::FSx::FileSystem
    Properties:
      FileSystemType: LUSTRE
      LustreConfiguration:
        ImportPath: !Ref S3ReferencePath
      StorageCapacity: !Ref LustreStorageCapacity
      SecurityGroupIds:
        - !Ref EC2BatchSecurityGroup
      SubnetIds:
        - !Sub "${VpcStack.Outputs.PrivateSubnet1AID}"
  
  FSxLustreSourceDataFileSystem:
    Type: AWS::FSx::FileSystem
    Properties:
      FileSystemType: LUSTRE
      LustreConfiguration:
        ImportPath: !Ref S3SourceDataPath
      StorageCapacity: !Ref LustreStorageCapacity
      SecurityGroupIds:
        - !Ref EC2BatchSecurityGroup
      SubnetIds:
        - !Sub "${VpcStack.Outputs.PrivateSubnet1AID}"

  FSxLustreWorkingFileSystem:
    Type: AWS::FSx::FileSystem
    Properties:
      FileSystemType: LUSTRE
      LustreConfiguration:
        ImportPath: !Ref S3WorkingPath
        ExportPath: !Ref S3WorkingPath
      StorageCapacity: !Ref LustreStorageCapacity
      SecurityGroupIds:
        - !Ref EC2BatchSecurityGroup
      SubnetIds:
        - !Sub "${VpcStack.Outputs.PrivateSubnet1AID}"
    ...

Notice that only the “Working” filesystem has an export path, which is the same as the import path. Thus, the “Working” filesystem is the only one that can have Amazon FSx for Lustre write data back to S3. This also means that the other two filesystems for Reference and Source data are read-only from an S3 perspective.

Next, let’s look at the AWS Batch resources setup.

Here, we’ve only created one job queue and associated with only one compute environment. The compute environment uses Spot Instances, which help to maximize compute cost savings.

Resources:
    ...
  BatchSpotComputeEnv:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      ComputeEnvironmentName: !Sub 
        - spot-${StackGuid}
        - StackGuid: !Select [ 2, !Split [ "/", !Ref "AWS::StackId" ]]
      ServiceRole: !GetAtt IAMBatchServiceRole.Arn
      Type: MANAGED
      State: ENABLED
      ComputeResources:
        MinvCpus: 2
        DesiredvCpus: 2
        MaxvCpus: 256
        LaunchTemplate:
          LaunchTemplateId: !Ref EC2LaunchTemplate
        InstanceRole: !GetAtt IAMBatchInstanceProfile.Arn
        InstanceTypes:
          - optimal
        SecurityGroupIds:
          - !Ref EC2BatchSecurityGroup
        SpotIamFleetRole: !GetAtt IAMBatchSpotFleetRole.Arn
        Subnets:
          - !Sub "${VpcStack.Outputs.PrivateSubnet1AID}"
          - !Sub "${VpcStack.Outputs.PrivateSubnet2AID}"
        Type: SPOT
        Tags:
          Name: !Sub
            - batch-spot-worker-${StackGuid}
            - StackGuid: !Select [ 2, !Split [ "/", !Ref "AWS::StackId" ]]

  BatchDefaultQueue:
    Type: AWS::Batch::JobQueue
    Properties:
      JobQueueName: !Sub
        - default-${StackGuid}
        - StackGuid: !Select [ 2, !Split [ "/", !Ref "AWS::StackId" ]]
      Priority: 1
      State: ENABLED
      ComputeEnvironmentOrder:
        - Order: 1
          ComputeEnvironment: !Ref BatchSpotComputeEnv
    ...

The compute environment also uses an Amazon EC2 LaunchTemplate with a UserData script to provision instances that it launches by installing the lustre-client and mounting our Amazon FSx for Lustre file systems.

Resources:
    ...
  EC2LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: !Join ["-", [!Ref LaunchTemplateNamePrefix, !Select [2, !Split ["/", !Ref "AWS::StackId" ]]]]
      LaunchTemplateData:
        ...
        UserData:
          Fn::Base64: !Sub |-
            MIME-Version: 1.0
            Content-Type: multipart/mixed; boundary="==BOUNDARY=="

            --==BOUNDARY==
            Content-Type: text/cloud-config; charset="us-ascii"

            packages:
            - lustre-client
            - amazon-ssm-agent

            runcmd:
            - start amazon-ssm-agent
            - mkdir -p /scratch/reference /scratch/data /scratch/working
            - mount -t lustre ${FSxLustreReferenceFileSystem}.fsx.${AWS::Region}.amazonaws.com@tcp:/fsx /scratch/reference
            - mount -t lustre ${FSxLustreSourceDataFileSystem}.fsx.${AWS::Region}.amazonaws.com@tcp:/fsx /scratch/data
            - mount -t lustre ${FSxLustreWorkingFileSystem}.fsx.${AWS::Region}.amazonaws.com@tcp:/fsx /scratch/working

            --==BOUNDARY==--

In this case, all of our Amazon FSx for Lustre file systems are mounted to paths in /scratch. Specifically:

/scratch/reference for our reference data filesystem
/scratch/data for our raw input data filesystem
/scratch/working for our workflow output data filesystem

Let’s create the tooling for the workflow.

The workflow is a simple one that uses:

BWA-mem to align sequence data
SAMtools to sort and index the alignment
BCFtools to make variant calls

To build the workflow, I have a CloudFormation template that creates:

Container images and ECR image repositories for the workflow tools
AWS Batch Job Definitions for the workflow tools
An AWS Step Functions State Machine for our workflow

In AWS Batch, let’s look at the Job Definition for one of the tools. The important part to note is that we’ve created a volume mount specification for the job that points to the scratch location on the host – where all our Amazon FSx filesystems are mounted.

{
    "jobDefinitionName": "bwa",
    "type": "container",
    "parameters": {
        "command": "mem",
        "reference_name": "Homo_sapiens_assembly38",
        "sample_id": "NIST7035",
        "input_path": "./data"
    },
    "containerProperties": {
        "image": "123456789012.dkr.ecr.us-west-2.amazonaws.com/bwa:aws",
        "vcpus": 8,
        "memory": 64000,
        "command": [
            "Ref::command",
            "Ref::reference_name",
            "Ref::sample_id",
            "Ref::input_path"
        ],
        "volumes": [
            {
                "host": {
                    "sourcePath": "/scratch"
                },
                "name": "scratch"
            }
        ],
        "mountPoints": [
            {
                "containerPath": "/scratch",
                "sourceVolume": "scratch"
            }
        ]
    }
}

The Step Functions State Machine we are using is a simple linear workflow that chains bwa-mem, samtools, and bcftools.

At the end of the workflow, we have an extra task – one that creates a Data Repository Task to sync the generated results back to S3.

"ExportToDataRepository": {
    "Type": "Task",
    "InputPath": "$",
    "ResultPath": "$.fsx.data_repository_task.result",
    "Resource": "arn:aws:states:::batch:submitJob.sync",
    "Parameters": {
        "JobName": "export-to-data-repository",
        "JobDefinition": "${BatchJobDefFSxDataRespositoryTask}",
        "JobQueue.$": "$.defaults.queue",
        "ContainerOverrides": {
            "Vcpus": 2,
            "Memory": 8000,
            "Environment": [
                { "Name": "WORKFLOW_NAME", "Value.$": "$$.StateMachine.Name" },
                { "Name": "EXECUTION_ID", "Value.$": "$$.Execution.Id" }
            ]
        }
    },
    "End": true
}

The AWS Batch Job Definition for this task:

BatchJobDefFSxDataRepositoryTask:
    Type: AWS::Batch::JobDefinition
    Properties:
      JobDefinitionName: fsx-data-repo-task
      Type: container
      ContainerProperties:
        Image: amazonlinux:2
        Vcpus: 2
        Memory: 8000
        Command:
          - /opt/miniconda/bin/aws
          - fsx
          - create-data-repository-task
          - --file-system-id
          - !Ref FSxLustreWorkingFileSystem
          - --paths
          - /scratch/working/$WORKFLOW_NAME/$EXECUTION_ID
          - --report
          - !Sub "Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=${S3WorkingPath}"
        Privileged: True
        Volumes:
          - Host:
              SourcePath: /scratch
            Name: scratch
          - Host:
              SourcePath: /opt/miniconda
            Name: aws-cli
        MountPoints:
          - ContainerPath: /scratch
            SourceVolume: scratch
          - ContainerPath: /opt/miniconda
            SourceVolume: aws-cli

Bind mounts an AWS CLI (which is installed on the host instance via miniconda) into an AmazonLinux2 container, and issues a create-data-repository-task API call on the file system for working data. This is described in this blog post as a container run command.

To run the workflow, we can use the AWS CLI locally to start an execution of the state machine:

aws stepfunctions start-execution \
    --state-machine-arn arn:aws:states:region:account:stateMachine:<state-machine-name> \
    --cli-input-json file://inputs.json

When the workflow completes (takes about 10 minutes), there should be a set of files in the S3 bucket used as the data repository for the “working” Amazon FSx for Lustre filesystem. Also, the Amazon FSx for Lustre filesystems are no longer needed and can be shut down.

Considerations and closing

In the example above, we used three Amazon FSx for Lustre file systems as temporary scratch storage for data coming from reference S3 buckets and data generated by the workflow.

Cost-wise, using the smallest Amazon FSx for Lustre filesystem capacity (1.2 TB) for all three filesystems comes to about $0.12 for the workflow:

$0.000194 per GB-hr * 1200 GB * 3 filesystems * 10 min = $0.1164

In comparison, the same workflow takes at least 15 minutes to complete using direct from/to S3 data staging. Using a 1-TB EBS volume as scratch space for each step of the workflow, but accounting for more time to completion, would cost about $0.04:

$0.10 per GB-month / 720 hr/month * 1000 GB * 15 min = $0.035

Real world workflows typically take 8–12 hrs to complete, have more steps involved, and use more data. So there may be less of a difference between using Amazon FSx for Lustre vs self-managed EBS volumes from a cost perspective.

From a workflow development standpoint, Amazon FSx for Lustre accelerates and enables running genomics workflows on AWS in two ways:

You can continue to use tooling that assumes a POSIX filesystem, making it easier to migrate existing workflows to the cloud.
You can also leverage the durability and availability of S3 for data storage without needing to update your tooling containers.

To learn more, visit the Amazon FSx for Lustre webpage for more resources.

AWS Storage Blog

Using Amazon FSx for Lustre for Genomics Workflows on AWS

Genomics Workflows

Amazon FSx for Lustre for genomics workflows

Considerations and closing

Resources

Follow