Genomic Analysis with Hail on Amazon EMR and Amazon Athena
UPDATE: Hail version 0.2 (stable) represents over a year of great work by the community. It includes new capabilities, numerous fixes and improved performance. We’ve updated the CloudFormation template in this post to allow you to launch an Amazon EMR cluster with Hail v0.2.
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
Open your browser, and navigate to http://www.internationalgenome.org/data-portal/data-collection/phase-3. In the Samples box, choose Download the list. Then copy the file to the S3 bucket location where you saved the VCF files from the previous step.
Exploring the genomic data
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
For information about the Hail API, visit the Python API Documentation page.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In .
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
For the sample annotations
And for variant annotations:
Querying annotations with Athena
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
If you found this post useful, be sure to check out Interactive Analysis of Genomic Datasets Using Amazon Athena and Building High-Throughput Genomics Batch Workflows on AWS: Introduction (Part 1 of 4).
About the Author
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.