Analyzing Genomic Data using Amazon Genomics CLI and Amazon SageMaker

Dr. Arlind Nocaj, Solutions Architect, Olivia Choudhury, PhD, Senior Partner Solutions Architect, Healthcare & Life Sciences, Lee Pang, PhD, Principal Dev Advocate – Genomics, Anuj Guruacharya, PhD, Health care and life sciences data scientist

Introduction

Whole genomes are a key to developing precision therapeutics and targeted health strategies. To date, the cost of generating a whole human genome has fallen to under $1000, making the technology more widely accessible and feasible for large-scale population sequencing initiatives to adopt. As a result, the volume of data has grown exponentially in the past few years. Public repositories like the NCBI Sequence Read Archive and the European Nucleotide Archive host many petabytes of data combined. Given the incredible size of genomics datasets, customers need tooling and workflows that leverage the cloud to scale efficiently and cost effectively.

The challenge faced by many customers who are new to AWS is deploying and managing the resources they need to run genomics workflows and derive insights from them. Amazon Genomics CLI (AGC) helps solve some of these challenges. It is an open-source tool for genomics and life science customers that simplifies and automates the deployment of cloud infrastructure, providing an easy-to-use command line interface to quickly set up and run genomics workflows on AWS, specified by languages like Workflow Description Language (WDL). By removing the heavy lifting from setting up and running genomics workflows in the cloud, software developers and researchers can automatically provision, configure, and scale cloud resources to enable faster and more cost-effective population-level genetics studies, drug discovery cycles, and more.

In this blog post, we demonstrate how to leverage AGC and Amazon SageMaker to analyze large-scale exome sequences and derive meaningful insights. As an example, we use variant calling to identify single base pair mutations and visualize the resulting relationships to build new hypothesis for research.

We use the bioinformatics workflow manager Nextflow, which can deploy workflows on various execution platforms and support underlying workflow dependencies. To mitigate the overhead of manually setting up compute environments and batch queues, we deploy AGC to run existing Nextflow workflows from nf-core without major modification. nf-core is an open-source repository of publicly available genomics pipelines built using Nextflow. As an additional step, we use Amazon SageMaker Notebook instances to interactively analyze and explore the resulting data from genomics workflows. Alternatively, you can also use Amazon SageMaker Studio, a web-based integrated development environment (IDE) for advanced analytics, including machine learning.

Figure 1: Architecture for executing genomics workflows with Amazon Genomics CLI (AGC). AGC automates deployment of cloud infrastructure to quickly set up and run genomics workflows (Nextflow, Snakemake, Cromwell, miniWDL) on AWS. Users can interactively analyze the results of these workflows using Amazon SageMaker Notebooks.

Prerequisite

In the AWS account you want to perform the workflows, you need to make sure to start a SageMaker notebook instance and install AGC within the notebook instance, either by following the getting started guide or by using the sample notebook from this blog.

You can use SageMaker notebook instances for all steps as they allow integration of markdown comments, terminal commands, and R/python commands in a single interface. All steps can be performed as cell commands within the sample notebooks from this blog.

Workflow

The overall workflow consists of the following steps:

Preprocessing: Since most genomics workflows require data to be in FASTQ format, we need to convert the exome sequences from the Sequence Read Archive (SRA) to FASTQ format.
Variant Calling: To identify relevant mutations that differ by a single basepair (SNP mutations), we use the Sarek pipeline. Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. It generates Variant Call Format (VCF) files containing details of these mutations.
Analyzing Mutations: We perform filtering of the VCF files for the genes of interest and create a larger data-frame that combines all samples for visualization. They can be used for downstream machine learning analytics, if needed.

Step1 – Preprocessing and setting up using AGC

Files in the SRA public database are stored in the SRA format. To be usable for genomic processing, we need to convert it to FASTQ format. A simple approach for conversion is to use existing genomics workflows, like fetchngs from nf-core. It offers multiple mechanisms to download SRA data and convert them to FASTQ format using fasterq-dump. Execution of these workflows can be outsourced to scalable cloud resources using AGC, which uses AWS Batch for the workflow execution. This means you can perform the preprocessing without having to manually use larger Amazon Elastic Compute Cloud (Amazon EC2) instances and storage.

To call the fetchngs pipeline for our list of SRA IDs, we create a fetchngs folder within the AGC workflows folder.

The inputs.json and MANIFEST.json define how the pipeline will be called with AGC.

agc-project
├── agc-project.yaml
└── workflows
    └── fetchngs
       ├── MANIFEST.json
       ├── ids.txt
       └── inputs.json

MANIFEST.json defines the main workflow URL and inputs.json to be used with the input parameters. To run fetchngs with the existing AGC version, we adjusted the Nextflow version to match the current Nextflow version supported by AGC. As the time of this blog, AGC supports NextFlow version 21.04.3. Either the user can fork the nf-core/fetchngs public repository and replace the nextflow.config file to point to a different version of Nextflow or the user can use the version 1.4 of nf-core/fetchngs that uses NextFlow version 21.04.3 as we have done below.

{
"mainWorkflowURL": "https://github.com/nf-core/fetchngs/tree/e1cb675c04392bcc799cc58bc19120904828b01f",
"inputFileURLs": ["inputs.json"]
}

inputs.json defines the parameters for the fetchngs workflow as described by the usage docs. In our case, we create and copy the file ids.txt, which contains a list of SRA IDs, to S3 and set input to its path. In addition, we store the output files in the S3 output directory.

{
"input": "s3://roda3/ids.txt",
"outdir": "s3://roda3/fetchngs_fastq"
}

After setting up AGC, as described in its getting-started tutorial, we deploy the context

agc context deploy —context bigMemCtx

We then run the fetchngs workflow.

agc workflow run fetchngs —context bigMemCtx

Notice that we do not have to set up any complex environments. AGC deploys it, based on the context defined in `agc-project.yaml`. Only the required instances defined in the YAML are utilized to run the workflow in a distributed manner. The overall structure of `agc-project.yaml` can be reviewed in the AGC documentation. All genomics workflow-related computational resources are shut down when the workflow completes or a maximum duration has passed, as specified by the default time directive in the nf-core scripts.

Step2: Variant Calling using AGC

We use the Sarek pipeline, offered through Allele Frequency Community (AFC), to perform variant calling.

By default, this pipeline executes the following tools:

Sequencing quality control (FastQC)
Map Reads to Reference (BWA mem)
Mark Duplicates (GATK MarkDuplicatesSpark)
Base (Quality Score) Recalibration (GATK BaseRecalibrator, GATK ApplyBQSR)
Preprocessing quality control (samtools stats)
Preprocessing quality control (Qualimap bamqc)
Overall pipeline run summaries (MultiQC)

Similar to Step 1, we create a folder under workflows to define how Sarek should be invoked.

agc_project/
├── agc-project.yaml
└── workflows
    ├── sarek
    │ ├── inputs.json
    │ ├── input.tsv
    │ └── MANIFEST.json

The input.tsv (Tab Separated Value (TSV) file) defines a list of input files, in FASTQ format, which should be used to perform the variant calling.

A TSV files was created such that it would be useful for both the mapping steps and the variant calling steps.

Table 1: A TSV file needs to be created to enable variant calling. It is used by the NextFlow script to identify tumor vs normal samples. It also provides location of the files and naming convention for the samples.

The inputs.json defines the actual parameters for the Sarek workflow. An extensive list of available parameters can be found in the Sarek usage documentation.

{
  "input": "s3://roda3/input.tsv",
  "outdir": "s3://roda3/results_variant_calling_3samples",
  "tools": "Mutect2,snpEff,VEP"
}

All these parameters get propagated automatically to Nextflow when starting the workflow with AGC. We copy the TSV file to S3 and trigger the workflow with AGC.

!aws s3 cp workflows/sarek/input.tsv s3://roda3/
!agc workflow run sarek —context bigMemCtx

To track the logs of the distributed workflow, we run:

!agc logs engine —context bigMemCtx

The Sarek workflow contains many subsequent steps which need to be invoked. Since each step has different compute and memory requirements, it automatically provisions resources. Multiple small steps can be executed on the same Amazon EC2 instances. You can also easily utilize spot instances by activating a switch in the `agc-project.yaml` file. For variant calling, we test the somatic mutation pipeline with the tool Mutec2. The output of Mutec2 is a filtered VCF file, which is used for the analytics step.

After starting the Sarek workflow with multiple input files, you can easily view the utilization of EC2 instances using Amazon CloudWatch. As shown in Figure 2, there is a high CPU usage for a particular step after the genome files have been downloaded onto the instances.

Figure 2: Amazon CloudWatch dashboard for observing resource utilization after starting Sarek genomics workflow using AGC.

Step3: Analyzing Mutations using Amazon SageMaker

SageMaker can be set up using the instructions provided by SageMaker documentation. It supports both Python and R kernels for Jupyter notebooks. For the purpose of demonstration, we use the R kernel. The R user guide provides an introduction to SageMaker and R. The large and diverse set of powerful tools and packages like ggplot or visNetwork enables rapid iterations for analytics. Note that instead of using Jupyter notebooks, you can also use the fully managed RStudio notebooks within SageMaker for Python kernels. Input for this step is the set of VCF files which contain information about the mutations. We perform cleaning and filtering of the VCF files using R in the Jupyter notebook. We concatenate the resulting dataframes to analyze and visualize them together.

We create a sample script that cleans and filters the VCF file for the BRCA2 gene, which is a human tumor suppressor gene found in all humans. Its protein is responsible for repairing DNA. Each identified SNP mutation of the BRCA2 gene is listed along with the associated sample number. The Jupyter notebook is available in our GitHub repo.

Figure 3: Cleaning up the VCF files to extract SNPs for genes of interest can be performed in SageMaker. We can obtain a table of all genes of interest, SNPs, and the sample ID the genes are extracted from.

As the first step of downstream analysis, customers often want to study the mutations detected for each sample. You can visualize the detected mutations together with the samples using network visualization. This gives insights into the most relevant mutations related to a particular cancer type.

We use the visNetwork package for R to visualize this dataset, which contains relational data for multiple patient samples and their gene mutations. As shown in Figure 4, some mutations, denoted by red nodes, are common among all patients. Although for the purpose of demonstration, we provide a simple network of a small dataset, custom Graph neural net models on billions of variants can be implemented using SageMaker. In Figure 4, the samples and their SNP mutations are shown such that the mutations that are repeated among multiple samples have a densely connected links. The visualization can be used as a starting point for building new hypothesis for research.

Figure 4: Multiple samples from different breast cancer patients (blue nodes, labelled with prefix “SAMPLE”) and the detected BRCA2 mutations (red nodes labelled with prefix with “rs”). The cluster of red nodes at the center denote the most common mutations.

Costs

We can categorize the incurred cost into two categories. The first category comprises the cost of utilizing AWS resources needed to run the genomics workflow, as defined in steps 1 and 2. These resources are provisioned and deleted automatically through AGC. The second category is related to notebook usage time for the analytics in step 3.

AGC automatically tags resources so that you can use tools like AWS Cost & Usage Report to understand the costs related to genomics data analysis across multiple AWS services. Make sure to activate the AGC-tags like `agc-project` and `agc-context` under Billing – Cost Allocation Tag so that the costs can be split up in AWS Cost Explorer.

Workflow costs

Step 1:

We measure the runtime for two samples SRR7890830 and SRR7890831. The download and conversion from SRA to FASTQ format for these two files using the public SRA FTP took 1 hr 14 mins. The cost of conversion for these two files is $0.46, which can be obtained in the Cost Explorer by using AGC-defined resource tags.

Figure 5: Preprocessing of two SRA files to convert to FASTQ format incurred $0.46 using AGC. It did not require setting up and maintaining cloud resources or HPC clusters.

Step 2:

For the given two samples, the overall runtime is 7 hr 45 mins. The overall cost, which can be viewed using the specific AGC-context tag in Cost Explorer is $16.52 . This is obtained using the spot flag in AGC `requestSpotInstances: true` . A snippet from the execution time of each subprocess of the Sarek flow is shown in Figure 6.

Figure 6: Preview from the execution of nf-core Sarek workflow for mutation extraction using AGC. AGC sets up the required configurations in AWS Batch so that you can run existing workflows, e.g. Nextflow, in which each process is provisioned the requested resources when needed.

Step 3:

This analytics step is iterative in nature and requires only smaller instances. The cost depends on the actual time a notebook is being used by a developer or data scientist. For example, the cost incurred by a ml.t2.medium instance, typically used by SageMaker notebooks, is $1.11, when used for 8 hours a day for 3 days.

Running the workflow on your projects

We provide two Jupyter notebooks with all the steps and snippets you need to be able to run the same or similar pipelines for your use case. The main steps include:

Set up a notebook instance in SageMaker
Check out our GitHub repository: https://github.com/aws-samples/amazon-genomics-cli-sagemaker-sample-code
Make sure to adjust the agc-project.yaml to point to your S3 buckets
Use the provided notebook ‘step01_step02_preprocess_variant-calling’ for the heavy genomics workflows. The code inside these notebooks can also be executed on AWS Cloud9 or an Amazon EC2 instance with very lightweight memory and compute setup.
Use the provided notebook ‘step_03_Analytics_r’ to perform analytics and visualization using R.

Note that you can also use AGC inside Cloud9 IDE or on an EC2 instance with access to execute shell commands.

Cleaning up

To avoid incurring future charges, shut down the SageMaker notebook instances that were running and also delete the genomic files that were downloaded into S3 buckets. In addition, you should run agc account deactivate to make sure that all its resources are being removed.

Conclusion

In this blog, we present a solution based on Amazon Genomics CLI and Amazon SageMaker, which automatically leverages the scalability and cost-effectiveness of AWS Batch infrastructure while providing a powerful IDE interface for advanced analytics.

Using this solution, you can easily run Nextflow workflows from nf-core on the given exome sequences and interactively analyze the outputs using Amazon SageMaker. Here, we demonstrate building a simple edge-node network to visualize relationships in our data. However, Amazon SageMaker ML capabilities supports more complex analysis for customers wanting to develop customized graph neural networks to model relationships from billions of records.

References:

Garcia M, Juhos S, Larsson M et al. Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 2; peer review: 2 approved] F1000Research 2020, 9:63 doi: 10.12688/f1000research.16665.2.
Mercer, T.R., Xu, J., Mason, C.E. et al. The Sequencing Quality Control 2 study: establishing community standards for sequencing in precision medicine. Genome Biol 22, 306 (2021). https://doi.org/10.1186/s13059-021-02528-3

AWS HPC Blog