AWS HPC Blog
Analyzing Genomic Data using Amazon Genomics CLI and Amazon SageMaker
Dr. Arlind Nocaj, Solutions Architect, Olivia Choudhury, PhD, Senior Partner Solutions Architect, Healthcare & Life Sciences, Lee Pang, PhD, Principal Dev Advocate – Genomics, Anuj Guruacharya, PhD, Health care and life sciences data scientist
Introduction
Whole genomes are a key to developing precision therapeutics and targeted health strategies. To date, the cost of generating a whole human genome has fallen to under $1000, making the technology more widely accessible and feasible for large-scale population sequencing initiatives to adopt. As a result, the volume of data has grown exponentially in the past few years. Public repositories like the NCBI Sequence Read Archive and the European Nucleotide Archive host many petabytes of data combined. Given the incredible size of genomics datasets, customers need tooling and workflows that leverage the cloud to scale efficiently and cost effectively.
The challenge faced by many customers who are new to AWS is deploying and managing the resources they need to run genomics workflows and derive insights from them. Amazon Genomics CLI (AGC) helps solve some of these challenges. It is an open-source tool for genomics and life science customers that simplifies and automates the deployment of cloud infrastructure, providing an easy-to-use command line interface to quickly set up and run genomics workflows on AWS, specified by languages like Workflow Description Language (WDL). By removing the heavy lifting from setting up and running genomics workflows in the cloud, software developers and researchers can automatically provision, configure, and scale cloud resources to enable faster and more cost-effective population-level genetics studies, drug discovery cycles, and more.
In this blog post, we demonstrate how to leverage AGC and Amazon SageMaker to analyze large-scale exome sequences and derive meaningful insights. As an example, we use variant calling to identify single base pair mutations and visualize the resulting relationships to build new hypothesis for research.
We use the bioinformatics workflow manager Nextflow, which can deploy workflows on various execution platforms and support underlying workflow dependencies. To mitigate the overhead of manually setting up compute environments and batch queues, we deploy AGC to run existing Nextflow workflows from nf-core without major modification. nf-core is an open-source repository of publicly available genomics pipelines built using Nextflow. As an additional step, we use Amazon SageMaker Notebook instances to interactively analyze and explore the resulting data from genomics workflows. Alternatively, you can also use Amazon SageMaker Studio, a web-based integrated development environment (IDE) for advanced analytics, including machine learning.
Prerequisite
In the AWS account you want to perform the workflows, you need to make sure to start a SageMaker notebook instance and install AGC within the notebook instance, either by following the getting started guide or by using the sample notebook from this blog.
You can use SageMaker notebook instances for all steps as they allow integration of markdown comments, terminal commands, and R/python commands in a single interface. All steps can be performed as cell commands within the sample notebooks from this blog.
Workflow
The overall workflow consists of the following steps:
- Preprocessing: Since most genomics workflows require data to be in FASTQ format, we need to convert the exome sequences from the Sequence Read Archive (SRA) to FASTQ format.
- Variant Calling: To identify relevant mutations that differ by a single basepair (SNP mutations), we use the Sarek pipeline. Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. It generates Variant Call Format (VCF) files containing details of these mutations.
- Analyzing Mutations: We perform filtering of the VCF files for the genes of interest and create a larger data-frame that combines all samples for visualization. They can be used for downstream machine learning analytics, if needed.
Step1 – Preprocessing and setting up using AGC
Files in the SRA public database are stored in the SRA format. To be usable for genomic processing, we need to convert it to FASTQ format. A simple approach for conversion is to use existing genomics workflows, like fetchngs from nf-core. It offers multiple mechanisms to download SRA data and convert them to FASTQ format using fasterq-dump. Execution of these workflows can be outsourced to scalable cloud resources using AGC, which uses AWS Batch for the workflow execution. This means you can perform the preprocessing without having to manually use larger Amazon Elastic Compute Cloud (Amazon EC2) instances and storage.
To call the fetchngs
pipeline for our list of SRA IDs, we create a fetchngs
folder within the AGC workflows folder.
The inputs.json
and MANIFEST.json
define how the pipeline will be called with AGC.
agc-project
├── agc-project.yaml
└── workflows
└── fetchngs
├── MANIFEST.json
├── ids.txt
└── inputs.json
MANIFEST.json
defines the main workflow URL and inputs.json to be used with the input parameters. To run fetchngs with the existing AGC version, we adjusted the Nextflow version to match the current Nextflow version supported by AGC. As the time of this blog, AGC supports NextFlow version 21.04.3. Either the user can fork the nf-core/fetchngs public repository and replace the nextflow.config file to point to a different version of Nextflow or the user can use the version 1.4 of nf-core/fetchngs that uses NextFlow version 21.04.3 as we have done below.
{
"mainWorkflowURL": "https://github.com/nf-core/fetchngs/tree/e1cb675c04392bcc799cc58bc19120904828b01f",
"inputFileURLs": ["inputs.json"]
}
inputs.json
defines the parameters for the fetchngs
workflow as described by the usage docs. In our case, we create and copy the file ids.txt, which contains a list of SRA IDs, to S3 and set input to its path. In addition, we store the output files in the S3 output directory.
{
"input": "s3://roda3/ids.txt",
"outdir": "s3://roda3/fetchngs_fastq"
}
After setting up AGC, as described in its getting-started tutorial, we deploy the context
agc context deploy —context bigMemCtx
We then run the fetchngs workflow.
agc workflow run fetchngs —context bigMemCtx
Notice that we do not have to set up any complex environments. AGC deploys it, based on the context defined in `agc-project.yaml`. Only the required instances defined in the YAML are utilized to run the workflow in a distributed manner. The overall structure of `agc-project.yaml` can be reviewed in the AGC documentation. All genomics workflow-related computational resources are shut down when the workflow completes or a maximum duration has passed, as specified by the default time directive in the nf-core scripts.
Step2: Variant Calling using AGC
We use the Sarek pipeline, offered through Allele Frequency Community (AFC), to perform variant calling.
By default, this pipeline executes the following tools:
- Sequencing quality control (FastQC)
- Map Reads to Reference (BWA mem)
- Mark Duplicates (GATK MarkDuplicatesSpark)
- Base (Quality Score) Recalibration (GATK BaseRecalibrator, GATK ApplyBQSR)
- Preprocessing quality control (samtools stats)
- Preprocessing quality control (Qualimap bamqc)
- Overall pipeline run summaries (MultiQC)
Similar to Step 1, we create a folder under workflows to define how Sarek should be invoked.
agc_project/
├── agc-project.yaml
└── workflows
├── sarek
│ ├── inputs.json
│ ├── input.tsv
│ └── MANIFEST.json
The input.tsv
(Tab Separated Value (TSV) file) defines a list of input files, in FASTQ format, which should be used to perform the variant calling.
A TSV files was created such that it would be useful for both the mapping steps and the variant calling steps.
The inputs.json defines the actual parameters for the Sarek workflow. An extensive list of available parameters can be found in the Sarek usage documentation.
{
"input": "s3://roda3/input.tsv",
"outdir": "s3://roda3/results_variant_calling_3samples",
"tools": "Mutect2,snpEff,VEP"
}
All these parameters get propagated automatically to Nextflow when starting the workflow with AGC. We copy the TSV file to S3 and trigger the workflow with AGC.
!aws s3 cp workflows/sarek/input.tsv s3://roda3/
!agc workflow run sarek —context bigMemCtx
To track the logs of the distributed workflow, we run:
!agc logs engine —context bigMemCtx
The Sarek workflow contains many subsequent steps which need to be invoked. Since each step has different compute and memory requirements, it automatically provisions resources. Multiple small steps can be executed on the same Amazon EC2 instances. You can also easily utilize spot instances by activating a switch in the `agc-project.yaml` file. For variant calling, we test the somatic mutation pipeline with the tool Mutec2. The output of Mutec2 is a filtered VCF file, which is used for the analytics step.
After starting the Sarek workflow with multiple input files, you can easily view the utilization of EC2 instances using Amazon CloudWatch. As shown in Figure 2, there is a high CPU usage for a particular step after the genome files have been downloaded onto the instances.
Step3: Analyzing Mutations using Amazon SageMaker
SageMaker can be set up using the instructions provided by SageMaker documentation. It supports both Python and R kernels for Jupyter notebooks. For the purpose of demonstration, we use the R kernel. The R user guide provides an introduction to SageMaker and R. The large and diverse set of powerful tools and packages like ggplot or visNetwork enables rapid iterations for analytics. Note that instead of using Jupyter notebooks, you can also use the fully managed RStudio notebooks within SageMaker for Python kernels. Input for this step is the set of VCF files which contain information about the mutations. We perform cleaning and filtering of the VCF files using R in the Jupyter notebook. We concatenate the resulting dataframes to analyze and visualize them together.
We create a sample script that cleans and filters the VCF file for the BRCA2 gene, which is a human tumor suppressor gene found in all humans. Its protein is responsible for repairing DNA. Each identified SNP mutation of the BRCA2 gene is listed along with the associated sample number. The Jupyter notebook is available in our GitHub repo.
As the first step of downstream analysis, customers often want to study the mutations detected for each sample. You can visualize the detected mutations together with the samples using network visualization. This gives insights into the most relevant mutations related to a particular cancer type.
We use the visNetwork package for R to visualize this dataset, which contains relational data for multiple patient samples and their gene mutations. As shown in Figure 4, some mutations, denoted by red nodes, are common among all patients. Although for the purpose of demonstration, we provide a simple network of a small dataset, custom Graph neural net models on billions of variants can be implemented using SageMaker. In Figure 4, the samples and their SNP mutations are shown such that the mutations that are repeated among multiple samples have a densely connected links. The visualization can be used as a starting point for building new hypothesis for research.
Costs
We can categorize the incurred cost into two categories. The first category comprises the cost of utilizing AWS resources needed to run the genomics workflow, as defined in steps 1 and 2. These resources are provisioned and deleted automatically through AGC. The second category is related to notebook usage time for the analytics in step 3.
AGC automatically tags resources so that you can use tools like AWS Cost & Usage Report to understand the costs related to genomics data analysis across multiple AWS services. Make sure to activate the AGC-tags like `agc-project
` and `agc-context
` under Billing – Cost Allocation Tag so that the costs can be split up in AWS Cost Explorer.
Workflow costs
Step 1:
We measure the runtime for two samples SRR7890830 and SRR7890831. The download and conversion from SRA to FASTQ format for these two files using the public SRA FTP took 1 hr 14 mins. The cost of conversion for these two files is $0.46, which can be obtained in the Cost Explorer by using AGC-defined resource tags.
Step 2:
For the given two samples, the overall runtime is 7 hr 45 mins. The overall cost, which can be viewed using the specific AGC-context tag in Cost Explorer is $16.52 . This is obtained using the spot flag in AGC `requestSpotInstances: true
` . A snippet from the execution time of each subprocess of the Sarek flow is shown in Figure 6.
Step 3:
This analytics step is iterative in nature and requires only smaller instances. The cost depends on the actual time a notebook is being used by a developer or data scientist. For example, the cost incurred by a ml.t2.medium instance, typically used by SageMaker notebooks, is $1.11, when used for 8 hours a day for 3 days.
Running the workflow on your projects
We provide two Jupyter notebooks with all the steps and snippets you need to be able to run the same or similar pipelines for your use case. The main steps include:
- Set up a notebook instance in SageMaker
- Check out our GitHub repository: https://github.com/aws-samples/amazon-genomics-cli-sagemaker-sample-code
- Make sure to adjust the
agc-project.yaml
to point to your S3 buckets - Use the provided notebook ‘
step01_step02_preprocess_variant-calling
’ for the heavy genomics workflows. The code inside these notebooks can also be executed on AWS Cloud9 or an Amazon EC2 instance with very lightweight memory and compute setup. - Use the provided notebook ‘
step_03_Analytics_r
’ to perform analytics and visualization using R.
Note that you can also use AGC inside Cloud9 IDE or on an EC2 instance with access to execute shell commands.
Cleaning up
To avoid incurring future charges, shut down the SageMaker notebook instances that were running and also delete the genomic files that were downloaded into S3 buckets. In addition, you should run agc account deactivate
to make sure that all its resources are being removed.
Conclusion
In this blog, we present a solution based on Amazon Genomics CLI and Amazon SageMaker, which automatically leverages the scalability and cost-effectiveness of AWS Batch infrastructure while providing a powerful IDE interface for advanced analytics.
Using this solution, you can easily run Nextflow workflows from nf-core on the given exome sequences and interactively analyze the outputs using Amazon SageMaker. Here, we demonstrate building a simple edge-node network to visualize relationships in our data. However, Amazon SageMaker ML capabilities supports more complex analysis for customers wanting to develop customized graph neural networks to model relationships from billions of records.
References:
- Garcia M, Juhos S, Larsson M et al. Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 2; peer review: 2 approved] F1000Research 2020, 9:63 doi: 10.12688/f1000research.16665.2.
- Mercer, T.R., Xu, J., Mason, C.E. et al. The Sequencing Quality Control 2 study: establishing community standards for sequencing in precision medicine. Genome Biol 22, 306 (2021). https://doi.org/10.1186/s13059-021-02528-3