AWS for Industries

Announcing Amazon Genomics CLI (Preview)

Today, we are excited to announce preview availability of Amazon Genomics CLI, a tool for genomics and life science customers to process genomics data at petabyte scale on AWS enabling population level genetic studies, faster drug discovery, and more. In this blog, we take a brief look at how to use Amazon Genomics CLI with just a few commands to easily provision, configure, and scale cloud resources in minutes to run genomic workflows on AWS. For access to Amazon Genomics CLI, sign up for the preview.

In studies published in 2015 and 2017, it was estimated that between 60 million and 2 billion people would have their individual genomes sequenced by 2025, producing data at a rate of EXA-bytes daily globally. Scientific researchers around the world are creating datasets like these to develop deeper insights into the mechanisms of disease, find novel drug targets, and study population scale genetic traits. Similarly, the more data you have, the deeper the insights you can generate ­– a fundamental principle behind population sequencing programs like UK BioBank and AllOfUs. Sequencing technology has also improved at a rate that outpaces Moore’s law, such that it costs well under $1,000 to generate a personal genome and is rapidly becoming a diagnostic tool in the clinic. In short, there is a lot of genomics data being produced, and an ever-growing need to be able to process and analyze it at scale.

A crucial step in genomics data analysis is converting the raw data (typically short read sequencing generated by machines from Illumina) into formats that list unique genetic characteristics. Despite sounding simple, there are many steps required, like alignment, QC, recalibration, and variant calling, each with varied computational needs. This process, called secondary analysis, can be run at higher scale and in less time using the cloud and the diversity of compute that it offers, reducing the time to useful insights like variant identification and disease diagnosis. Customers find it hard to run secondary analysis in the cloud. These analyses also use a variety of tools that need to be orchestrated as a specific sequence of steps, or a workflow. To facilitate developing, sharing, and running workflows, the genomics and bioinformatics communities have developed specialized workflow definition languages like WDL, Nextflow, CWL, and Snakemake. Getting these workflows running on AWS was previously a challenge, and we made things easier with reference architectures like Cromwell on AWS and Nextflow on AWS, which customers can use as a starting point to build their own custom solutions. However, many of our customers want something that removes the undifferentiated heavy lifting of both launching the infrastructure they need and running existing workflows they have on hand. Amazon Genomics CLI addresses these customer needs by further simplifying and automating the deployment of cloud resources required and providing an easy-to-use command line to quickly setup and run genomics workflows on AWS.

To get started with Amazon Genomics CLI, you define a project config that lists the workflows you want to run. This looks like:

---
name: MyProject
workflows:
  myWorkflow:
    type: wdl
    sourceURL: workflows/my-workflow.wdl
...

Amazon Genomics CLI is designed to run the existing workflows you have today with minimal modification. If your workflow is written in a language Amazon Genomics CLI supports, and the data is in S3, you should be good to go.

To run workflows, Amazon Genomics CLI uses “contexts”. Contexts encapsulate and automate time consuming tasks like configuring and deploying workflow engines, creating data access policies, and tuning compute clusters for operation at scale. To start the default context that comes with Amazon Genomics CLI, run:

$ agc context start default

When the default context is fully deployed, you can run a workflow in this context with:

$ agc workflow run myWorkflow

That’s about all it takes to run genomics workflows on AWS with Amazon Genomics CLI.

We’re excited about Amazon Genomics CLI and hope you are too. We are making Amazon Genomics CLI available to customers interested in giving it a test drive to provide us with valuable feedback. If that’s you, please sign up for the preview!

Summary

Amazon Genomics CLI is a tool for genomics and life science customers to process raw genomics and biological data in the cloud, at petabyte scale. Amazon Genomics CLI makes it easy for software developers and researchers to easily and quickly provision, configure, and scale cloud resources to run genomic workflows, and is now available for access as part of a private preview program. To access Amazon Genomics CLI as part of our preview program, visit Amazon Genomics CLI Preview.

Lee Pang

Lee Pang

Lee is a Principal Bioinformatics Architect with the Health AI services team at AWS. He has a PhD in Bioengineering and over a decade of hands-on experience as a practicing research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.

Taha Kass-Hout

Taha Kass-Hout

Dr. Taha Kass-Hout is Director of Machine Learning and Chief Medical Officer at Amazon Web Services, and leads our Health AI strategy and efforts, including Amazon Comprehend Medical and Amazon HealthLake. He works with teams at Amazon responsible for developing the science, technology, and scale for COVID-19 lab testing, including Amazon’s first FDA authorization for testing our associates—now offered to the public for at-home testing. A physician and bioinformatician, Taha served two terms under President Obama, including the first Chief Health Informatics officer at the FDA. During this time as a public servant, he pioneered the use of emerging technologies and the cloud (the CDC’s electronic disease surveillance), and established widely accessible global data sharing platforms: the openFDA, which enabled researchers and the public to search and analyze adverse event data, and precisionFDA (part of the Presidential Precision Medicine initiative).