Interactive Analysis of Genomic Datasets Using Amazon Athena

Aaron Friedman is a Healthcare and Life Sciences Solutions Architect with Amazon Web Services

The genomics industry is in the midst of a data explosion. Due to the rapid drop in the cost to sequence genomes, genomics is now central to many medical advances. When your genome is sequenced and analyzed, raw sequencing files are processed in a multi-step workflow to identify where your genome differs from a standard reference. Your variations are stored in a Variant Call Format (VCF) file, which is then combined with other individuals to enable population-scale analyses. Many of these datasets are publicly available, and an increasing number are hosted on AWS as part of our Open Data project.

To mine genomic data for new discoveries, researchers in both industry and academia build complex models to analyze populations at scale. When building models, they first explore the datasets-of-interest to understand what questions the data might answer. In this step, interactivity is key, as it allows them to move easily from one question to the next.

Recently, we launched Amazon Athena as an interactive query service to analyze data on Amazon S3. With Amazon Athena there are no clusters to manage and tune, no infrastructure to setup or manage, and customers pay only for the queries they run. Athena is able to query many file types straight from S3. This flexibility gives you the ability to interact easily with your datasets, whether they are in a raw text format (CSV/JSON) or specialized formats (e.g. Parquet). By being able to flexibly query different types of data sources, researchers can more rapidly progress through the data exploration phase for discovery. Additionally, researchers don’t have to know nuances of managing and running a big data system. This makes Athena an excellent complement to data warehousing on Amazon Redshift and big data analytics on Amazon EMR.

In this post, I discuss how to prepare genomic data for analysis with Amazon Athena as well as demonstrating how Athena is well-adapted to address common genomics query paradigms. I use the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study, to demonstrate these approaches. All code that is used as part of this post is available in our GitHub repository.

Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.

Select, aggregate, annotate query pattern in genomics

Genomics researchers may ask different questions of their dataset, such as:

What variations in a genome may increase the risk of developing disease?
What positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the genomic reference?
What variations in a genome influence how an individual may respond to a specific drug treatment?
Does a group of individuals contain a higher frequency of a genomic variant known to alter response to a drug relative to the general population?

All these questions, and more, can be generalized under a common query pattern I like to call “Select, Aggregate, Annotate”. Some of our genomics customers, such as Human Longevity, Inc., routinely use this query pattern in their work.

In each of the above queries, you execute the following steps:

SELECT: Specify the cohort of individuals meeting certain criteria (disease, drug response, age, BMI, entire population, etc.).

AGGREGATE: Generate summary statistics of genomic variants across the cohort that you selected.

ANNOTATE: Assign meaning to each of the variants by joining on known information about each variant.

Dataset preparation

Properly organizing your dataset is one of the most critical decisions for enabling fast, interactive analyses. Based on the query pattern I just described, the table representing your population needs to have the following information:

A unique sample ID corresponding to each sample in your population
Information about each variant, specifically its location in the genome as well as the specific deviation from the reference
Information about how many times in a sample a variant occurs (0, 1, or 2 times) as well as if there are multiple variants in the same site. This is known as a genotype.

The extract, transform, load (ETL) process to generate the appropriate data representation has two main steps. First, you use ADAM, a genomics analysis platform built on top of Spark, to convert the variant information residing a VCF file to Parquet for easier downstream analytics, in a process similar to the one described in the Will Spark Power the Data behind Precision Medicine? post. Then, you use custom Python code to massage the data and select only the appropriate fields that you need for analysis with Athena.

First, spin up an EMR cluster (version 5.0.3) for the ETL process. I used a c4.xlarge for my master node and m4.4xlarges with 1 TB of scratch for my core nodes.

After you SSH into your master node, clone the git repository. You can also put this in as a bootstrap action when spinning up your cluster.

sudo yum –y install git
git clone https://github.com/awslabs/aws-big-data-blog.git

You then need to install ADAM and configure it to run with Spark 2.0. In your terminal, enter the following:

# Install Maven
wget http://apache.claz.org/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
tar xvzf apache-maven-3.3.9-bin.tar.gz
export PATH=/home/hadoop/apache-maven-3.3.9/bin:$PATH

# Install ADAM
git clone https://github.com/bigdatagenomics/adam.git
cd adam
sed -i 's/2.6.0/2.7.2/' pom.xml
./scripts/move_to_spark_2.sh
export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=256m"
mvn clean package -DskipTests

export PATH=/home/hadoop/adam/bin:$PATH

Now that ADAM is installed, enter the working directory of the repo that you first cloned, which contains the code relevant to this post. Your next step is to convert a VCF containing all the information about your population.

For this post, you are going to focus on a single region in the genome, defined as chromosome 22, for all 2,504 samples in the Thousand Genomes dataset. As chromosome 22 was the first chromosome to be sequenced as part of the Human Genome Project, it is your first here as well. You can scale the size of your cluster depending on whether you are looking at just chromosome 22, or the entire genome. In the following lines, replace mytestbucket with your bucket name of choice.

PARQUETS3PATH=s3://<mytestbucket>/thousand_genomes/grch37/chr22.parquet/
cd ~/aws-big-data-blog/aws-blog-athena-genomics/etl/thousand_genomes/
chmod +x ./convert_vcf.sh
./convert_vcf.sh $PARQUETS3PATH

After this conversion completes, you can reduce your Parquet file to only the fields you need.

TRIMMEDS3PATH=s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/
spark-submit --master yarn --executor-memory 40G --deploy-mode cluster create_trimmed_parquet.py --input_s3_path $PARQUETS3PATH --output_s3_path $TRIMMEDS3PATH

Annotation data preparation

For this post, use the ClinVar dataset, which is an archive of information about variation in genomes and health. While Parquet is a preferred format for Athena relative to raw text, using the ClinVar TXT file demonstrates the ability of Athena to read multiple file types.

In a Unix terminal, execute the following commands (replace mytestbucket with your bucket name of choice):

ANNOPATH=s3://<mytestbucket>/clinvar/

wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

# Need to strip out the first line (header)
zcat variant_summary.txt.gz | sed '1d' > temp ; mv -f temp variant_summary.trim.txt ; gzip variant_summary.trim.txt

aws s3 cp variant_summary.trim.txt.gz $ANNOPATH --sse

Creating Tables using Amazon Athena

In the recent Amazon Athena – Interactive SQL Queries for Data in Amazon S3 post, Jeff Barr covered how to get started with Athena and navigate to the console. Athena uses the Hive DDL when you create, alter, or drop a table. This means you can use the standard Hive syntax for creating tables. These commands can also be found in the aws-blog-athena-genomics GitHub repo under sql/setup.

First, create your database:

CREATE DATABASE demo;

Next, define your table with the appropriate mappings from the previous ETL processes. Be sure to change to your appropriate bucket name.

For your population data:

If you want to skip the ETL process described above for your population data, you can use the following path for your analysis: s3://aws-bigdata-blog/artifacts/athena_genomics.parquet/ – be sure to substitute it in for the LOCATION below.

CREATE EXTERNAL TABLE demo.samplevariants
(
  alternateallele STRING
  ,chromosome STRING
  ,endposition BIGINT
  ,genotype0 STRING
  ,genotype1 STRING
  ,referenceallele STRING
  ,sampleid STRING
  ,startposition BIGINT
)
STORED AS PARQUET
LOCATION ' s3://<mytestbucket>/thousand_genomes/grch37_trimmed/chr22.parquet/';

For your annotation data:

CREATE EXTERNAL TABLE demo.clinvar
(
    alleleId STRING
    ,variantType STRING
    ,hgvsName STRING
    ,geneID STRING
    ,geneSymbol STRING
    ,hgncId STRING
    ,clinicalSignificance STRING
    ,clinSigSimple STRING
    ,lastEvaluated STRING
    ,rsId STRING
    ,dbVarId STRING
    ,rcvAccession STRING
    ,phenotypeIds STRING
    ,phenotypeList STRING
    ,origin STRING
    ,originSimple STRING
    ,assembly STRING
    ,chromosomeAccession STRING
    ,chromosome STRING
    ,startPosition INT
    ,endPosition INT
    ,referenceAllele STRING
    ,alternateAllele STRING
    ,cytogenetic STRING
    ,reviewStatus STRING
    ,numberSubmitters STRING
    ,guidelines STRING
    ,testedInGtr STRING
    ,otherIds STRING
    ,submitterCategories STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://<mytestbucket>/clinvar/';

You can confirm that both tables have been created by choosing the eye icon next to a table name in the console. This automatically runs a query akin to SELECT * FROM table LIMIT 10. If it succeeds, then your data has been loaded successfully.

o_athena_genomes_1

Applying the select, aggregate, annotate paradigm

In this section, I walk you through how to use the query pattern described earlier to answer two common questions in genomics. You examine the frequency of different genotypes, which is a distinct combination of all fields in the samplevariants table with the exception of sampleid. These queries can also be found in the aws-blog-athena-genomics GitHub repo under sql/queries.

Population drug response

What small molecules/drugs are most likely to affect a subpopulation of individuals (ancestry, age, etc.) based on their genomic information?

In this query, assume that you have some phenotype data about your population. In this case, also assume that all samples sharing the pattern “NA12” are part of a specific demographic.

In this query, use sampleid as your predicate pushdown. The general steps are:

Filter by the samples in your subpopulation
Aggregate variant frequencies for the subpopulation-of-interest
Join on ClinVar dataset
Filter by variants that have been implicated in drug-response
Order by highest frequency variants

To answer this question, you can craft the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.rsid
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants
        WHERE sampleid LIKE 'NA12%'))
JOIN demo.clinvar cv
ON sv.chromosome = cv.chromosome
    AND sv.startposition = cv.startposition - 1
    AND sv.endposition = cv.endposition
    AND sv.referenceallele = cv.referenceallele
    AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
    AND cv.clinicalsignificance LIKE '%response%'
    AND sampleid LIKE 'NA12%'
GROUP BY  sv.chromosome
          ,sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,cv.rsid
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

Enter the query into the console and you see something like the following:

o_athena_genomes_2

When you inspect the results, you can quickly see (often in a matter of seconds!) that this population of individuals has a high frequency of variants associated with the metabolism of debrisoquine.

o_athena_genomes_3

Quality control

Are you systematically finding locations in your reference that are called as variation? In other words, what positions in the genome have abnormal levels of variation, suggesting issues in quality of sequencing or errors in the reference? Are any of these variants implicated in clinically? If so, could this be a false positive clinical finding?

In this query, the entire population is needed so the SELECT predicate is not used. The general steps are:

Aggregate variant frequencies for the entire Thousand Genomes population
Join on the ClinVar dataset
Filter by variants that have been implicated in disease
Order by the highest frequency variants

This translates into the following query:

SELECT
    count(*)/cast(numsamples AS DOUBLE) AS genotypefrequency
    ,cv.clinicalsignificance
    ,cv.phenotypelist
    ,sv.chromosome
    ,sv.startposition
    ,sv.endposition
    ,sv.referenceallele
    ,sv.alternateallele
    ,sv.genotype0
    ,sv.genotype1
FROM demo.samplevariants sv
CROSS JOIN
    (SELECT count(1) AS numsamples
    FROM
        (SELECT DISTINCT sampleid
        FROM demo.samplevariants))
    JOIN demo.clinvar cv
    ON sv.chromosome = cv.chromosome
        AND sv.startposition = cv.startposition - 1
        AND sv.endposition = cv.endposition
        AND sv.referenceallele = cv.referenceallele
        AND sv.alternateallele = cv.alternateallele
WHERE assembly='GRCh37'
        AND cv.clinsigsimple='1'
GROUP BY  sv.chromosome ,
          sv.startposition
          ,sv.endposition
          ,sv.referenceallele
          ,sv.alternateallele
          ,sv.genotype0
          ,sv.genotype1
          ,cv.clinicalsignificance
          ,cv.phenotypelist
          ,numsamples
ORDER BY  genotypefrequency DESC LIMIT 50

As you can see in the following screenshot, the highest frequency results have conflicting information, being listed both as potentially causing disease and being benign. In genomics, variants with a higher frequency are less likely to cause disease. Your quick analysis allows you to discount the pathogenic clinical significance annotations of these high genotype frequency variants.

o_athena_genomes_4

Summary

I hope you have seen how you can easily integrate datasets of different types and values, and combine them to quickly derive new meaning from your data. Do you have a dataset you want to explore or want to learn more about Amazon Athena? Check out our Getting Started Guide and happy querying!

If you have questions or suggestions, please comment below.

About the author

Dr. Aaron Friedman a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with our ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s out hiking or learning a new thing to cook.

Analyzing Data in S3 using Amazon Athena