Unlock Powerful Genomic Insights with AWS HealthOmics Analytics and Amazon EMR

Analyzing large-scale genomic variant data just got easier with the latest release of Amazon EMR and its integration with AWS HealthOmics. HealthOmics Analytics provides tabular access to genetic variant data sets and annotations, empowering researchers and data scientists to uncover valuable insights through Amazon Athena queries and now Amazon EMR jobs.

While Amazon Athena excels at quick, interactive querying of data using SQL, Amazon EMR offers a more flexible big data platform for processing vast amounts of data. EMR leverages popular frameworks like Apache Spark, allowing you to go beyond the interactive model of Athena and gain greater flexibility and control over your compute environment and analytics workflow. This enables you to tackle complex, multi-stage data processing workflows via cost-effective EMR jobs. By combining HealthOmics Analytics Stores with the power of EMR, you can unlock a wide range of use cases, including genotype-phenotype association analysis, population-scale variant analysis, and integration with a diverse ecosystem of bioinformatics tools such as Hail, ADAM, GATK, and more. HealthOmics leverages AWS Lake Formation to ensure secure and centrally managed governance of your genomic data, giving you complete control over who can access it, including via EMR jobs. This blog post will guide you through the initial configuration and setup of EMR to help you get started with your first genomic data query.

At a high level

A high-level architecture diagram of showing the steps taken when a Spark job is run on EMR to query a HealthOmics Variant Store.

Today, we’ll be running a spark job on EMR (1). EMR integrates seamlessly with LakeFormation to secure credentials and authenticate access to your genomic data (2). We will write our queries against our HealthOmics Analytics Store in the Glue Data Catalog (3) which pulls the secured data from HealthOmics (4).

Initial setup

This blog will assume you’ve created a HealthOmics Variant and/or Annotation store and have imported genomic and/or annotation data into it already. If you haven’t, see the HealthOmics Analytics Documentation to get started.

There are a few initial steps required to enable your EMR cluster to access HealthOmics Analytics data through AWS Lake Formation. You will need a role permissive enough to interact with EMR, IAM, and Lake Formation.

Configuring Permissions in IAM

The following steps will modify the default EMR roles, in your workflow you can use any roles you’d like as long as they are given the correct IAM and Lake Formation permissions.

To create the default EMR roles, you can do so using the AWS command line interface with

aws emr create-default-roles --region <AWS_REGION>

This creates two IAM roles EMR_EC2_DefaultRole and EMR_DefaultRole which will be used by EMR.

EMR_EC2_DefaultRole

This role is assumed by all of the EC2 instances in the EMR Cluster. use it to grant the appropriate permissions to interact with other AWS services and resources as part of the data processing and management tasks. Aside from the default AmazonElasticMapReduceforEC2Role managed policy it should already have, we will modify this role with the following inline policy:

Please replace <AWS_ACCOUNT> with your AWS Account Id

Please replace <AWS_REGION> with the appropriate AWS Region

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LakeFormationPermission",
"Effect": "Allow",
"Action": "lakeformation:GetDataAccess",
"Resource": "*"
},
{
"Sid": "DefaultDBPermissions",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:catalog",
"arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:database/default"
]
}
]
}

Breaking it down,

The LakeFormationPermission statement grants the role the ability to request temporary credentials from LakeFormation to access the underlying data.

The DefaultDBPermissions statement gives your cluster the ability to initialize with the default database. In EMR’s catalog integration, EMR always tries to check whether the default database exists. This will allow the role to create the default if necessary. If a default database already exists in Lake Formation, you can simply grant EMR_EC2_DefaultRole with Describe permission on it and leave out this statement.

EMR_DefaultRole

This role is assumed by the Amazon EMR service to help manage resources and perform various actions on your behalf during the lifecycle of an EMR cluster. It should already be configured with the permissions needed for this blog.

Lake Formation

Create a data lake admin

If you have not already done so you will need to create a data lake administrator. This will allow you to create the necessary resource links and grants required to access your data from EMR.

In the Lake Formation Console:

In the left-side navigation panel, expand Administration
Under Administration, choose Administrative roles and tasks
On the page you will see a Data lake administrators box, choose add
Add an IAM role you can assume (e.g. the role you are currently using in the console) and Confirm
You should now see that IAM role in the Data lake administrators list
If that role is different than the one you are currently using, log into the console with that role

Here I have a role named Admin I am making a Data lake Administrator, but you can use any role you’d like.

Adding a LakeFormation Admin via the AWS Lake Formation console page. Access type: Data lake administrator is selected and assigned the “Admin” role.

Change the default permissions model

This is necessary to let Lake Formation handle permissions on newly created resource links. Disable use of IAM access controls for new databases and database tables and revoke IAMAllowedPrinciples permission for database creators.

In the left-side navigation panel, expand Administration
Under Administration, choose Data Catalog settings
On the page you will see a Default permissions for newly created databases and tables
Uncheck and Save the following options:
1. Use only IAM access control for new databases
2. Use only IAM access control for new tables in new databases

LakeFormation Data Catalog Settings web form. Check boxes are unchecked.

Application integration for full table access

In the Lake Formation Console:

In the left-side navigation, expand Administration
Under Administration, choose Application integration settings.
On the Application integration settings page, choose the checkbox to Allow external engines to access data in Amazon S3 locations with full table access

This will authorize the EMR query engine to request credentials to query against the underlying data.

LakeFormation Application Integration Settings web form. Allow external engines to access data in Amazon S3 locations with full table access checkbox is checked. Other boxes are unchecked.

Take Note of the HealthOmics Analytics Store Amazon S3 Path

To find your HealthOmics Analytics Store Amazon S3 Path:

In the left-side navigation panel, expand Data Catalog
Under Data Catalog, choose Databases
On the page you will see a Databases box, look for your HealthOmics Analytics store
Copy the S3 path listed under the Amazon S3 Path Column. Save this S3 Path, we will use it later in the EMR Cluster!

LakeFormation Databases Console filtered by the database name and showing the Amazon S3 path for the database.

Create a Data Lake Resource Link

Now that we’ve configured the Lake Formation settings and created our Lake Formation Admin, we can now create a resource link to HealthOmics Analytics store and grant the EMR_EC2_DefaultRole role permission to access it in EMR. If you used Amazon Athena to query your HealthOmics Analytics Store previously, it is the same process.

To learn more about how resource links work, see how resource links work in Lake Formation.

To create a resource link:

Ensure you’re using the Lake Formation Admin Role previously set
In the left-side navigation panel, expand Data Catalog
Under Data Catalog, choose Databases
On the page you will see a Databases box, choose your HealthOmics Analytics store
Under Actions, choose Create resource link
Choose a Resource Link Name – this is the name of the database we will use in EMR
1. Note: This name must be a compliant SQL database name. To avoid having to escape the name your queries we suggest using only lowercase letters and underscores
Leave everything else to its pre-set value

Creating a Resource Link in LakeFormation web form with resource link name, database region, shared database and shared database owner ID fields filled with example values.

Grant EMR_EC2_DefaultRole Describe permissions on the Resource Link Database

Recall when we created the default EMR roles, and set the lakeformation:GetDataAccess permission on EMR_EC2_DefaultRole. Since we will use this role to access our data through Lake Formation, we’ll need to grant this role Describe permission the Resource Link.

To grant Describe on the Resource Link

Ensure you’re using the Lake Formation Admin Role previously set
In the left-side navigation panel, expand Data Catalog
Under Data Catalog, choose Databases
On the page you will see a Databases box, choose Resource Link you created in the previous step
Under Actions, choose Grant
Choose the EMR_EC2_DefaultRole to add
Under Resource link permissions, choose Describe
Select Grant to save

Granting LakeFormation permissions on a database web form as it appears after adding the values as described in the instructions.

Grant EMR_EC2_DefaultRole Select and Describe permissions on the Resource Link Table

Similar to the previous step, we are going to grant another set of permissions on the resource link. This time we will grant permissions specifically on the HealthOmics Analytics store table our Resource Link refers to

To grant Select and Describe on the Resource Link table

Ensure you’re using the Lake Formation Admin Role previously set
In the left-side navigation panel, expand Data Catalog
Under Data Catalog, choose Databases
On the page you will see a Databases box, choose Resource Link you created
Under Actions, choose Grant
Choose the EMR_EC2_DefaultRole to add
Under Tables, choose All Tables or the table that matches the name of your HealthOmics Analytics Store or Variant Cohort
Under Table permissions, choose Select and Describe

Granting LakeFormation permissions on a table web form as it appears after adding the values as described in the instructions.

These are all the permissions required to access the data in EMR, you should be able to verify your permissions for EMR_EC2_DefaultRole on the Permissions page.

At the minimum, you should have:

Describe permission on the Resource Link Database
Describe permission on the HealthOmics Analytics Store Table
Select permission on all the columns in the HealthOmics Analytics Store Table

Tabular display of LakeFormation permissions filtered by the EMR_EC2_DefaultRole

EMR Cluster Setup

We are now ready to setup EMR and query HealthOmics Analytics data.

Create an EMR Cluster

There are several ways to create/configure an EMR cluster, here we will create one with the smallest footprint required to access our HealthOmics Analytics data.

At the minimum, we will need:

Amazon EMR Release emr-6.13.0 or greater
Spark 3.4.1 or greater
For AWS Glue Data Catalog Settings, ensure Use For Spark table Metadata is enabled
Configuration for iceberg-defaults, value iceberg-enabled is true

To create a cluster:

In the left-side navigation panel, expand EMR on EC2 and choose Clusters
On the page you will see a Clusters box, choose Create Cluster
Name your Cluster (e.g. omics-analytics-cluster)
Select an Amazon EMR Release greater than or equal to emr-6.13.0
Under Application Bundle, choose Custom
1. Choose Spark 3.4.1 or greater
2. Under AWS Glue Data Catalog settings, choose Use for Spark table metadata
Under Software Settings, set the iceberg-defaults config
[
   {
      "Classification": "iceberg-defaults",
         "Properties": {
            "iceberg.enabled": "true"
      }
   }
]
Make sure to set an EC2 Key pair so you can SSH into the cluster
Under Identity and Access Management (IAM) roles, set the default EMR roles we created earlier
1. Under Amazon EMR Service Role, choose EMR_DefaultRole
2. Under EC2 instance profile for Amazon EMR, choose EMR_EC2_DefaultRoleRecall that EMR_EC2_DefaultRole was the role we granted database and table access to in Lake Formation earlier
Click Create Cluster

SSH into your EMR Cluster

Here are a few helpful reminders when SSHing into EC2

Ensure the EC2 Security Group attached to your EC2 Instance allows for inbound SSH connections (port 22) from your computer’s IP Address
Set read-only permissions on the pem file associated with your SSH Key
1. e.g. chmod 400 emrKey.pem
You can grab the SSH command from the Cluster’s detail page

A set of three instructions for connecting to EMR via SSH displayed when the Cluster’s detail page is selected.

You should now be able to SSH into the Primary Node your EMR Cluster.

EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
  E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
  E::::E M::::::M:::M M:::M::::::M R:::R R::::R
  E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
  E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
  E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
  E::::E M:::::M M:::M M:::::M R:::R R::::R
  E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR

[hadoop@ip ~]$

Querying Your HealthOmics Analytics Store

You can use any of the spark commands available to you. Here we’ll make a query using the SparkSQL shell. In all cases the HealthOmics Analytics table is read only, however you may create derivative tables and store these in your account.

Replace <WAREHOUSE_LOCATION> with the full HealthOmics Analytics Store Amazon S3 Path
- This is the HealthOmics Analytics Store Amazon S3 Path retrieved from earlier in this blog, use it now or refer back to the Lake Formation section on how to retrieve it
Replace <AWS_ACCOUNT> with your AWS Account ID
Replace <AWS_REGION> with the region your HealthOmics Analytics Store and EMR cluster reside in.

spark-sql \
   --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
   --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
   --conf spark.sql.catalog.my_catalog.warehouse=<WAREHOUSE_LOCATION> \
   --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
   --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
   --conf spark.sql.catalog.my_catalog.glue.lakeformation-enabled=true \
   --conf spark.sql.catalog.my_catalog.glue.account-id=<AWS_ACCOUNT> \
   --conf spark.sql.catalog.my_catalog.client.region=<AWS_REGION> \
   --conf spark.sql.catalog.my_catalog.client.factory=org.apache.iceberg.aws.lakeformation.LakeFormationAwsClientFactory

You should now be able to run any of the following queries in the SparkSQL shell.

Use the catalog name you define in the configuration, this blog uses my_catalog
Replace <MY_RESOURCE_LINK> with the name of your Database Resource Link
Replace <TABLE_NAME> with your HealthOmics Analytics Store name or Variant Cohort you’d like to query

Describe the Table

spark-sql (default)> DESCRIBE my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME>;

Result

importjobid string
contigname string
start bigint
end bigint
names array<string>
referenceallele string
alternatealleles array<string>
qual double
filters array<string>
splitfrommultiallelic boolean
attributes map<string,string>
phased boolean
calls array<int>
genotypelikelihoods array<double>
phredlikelihoods array<int>
alleledepths array<int>
conditionalquality int
spl array<int>
depth int
ps int
sampleid string
information map<string,string>
annotations struct<vep:array<struct<allele:string,consequence:array<string>,impact:string,symbol:string,gene:string,feature_type:string,feature:string,biotype:string,exon:struct<rank:string,total:string>,intron:struct<rank:string,total:string>,hgvsc:string,hgvsp:string,cdna_position:string,cds_position:string,protein_position:string,amino_acids:struct<reference:string,variant:string>,codons:struct<reference:string,variant:string>,existing_variation:array<string>,distance:string,strand:string,flags:array<string>,symbol_source:string,hgnc_id:string,extras:map<string,string>>>>
Time taken: 1.291 seconds, Fetched 23 row(s)

Count the number of variants where the Allele Frequency (AF score) is > 0.5

spark-sql (default)> SELECT count(*) FROM my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME> where attributes['AF'] > 0.5;

Result

32
Time taken: 8.763 seconds, Fetched 32 row(s)

Wrapping up

With the release of EMR 6.13.0. HealthOmics Analytics Data is now available in your EMR Cluster, unlocking more use-cases and greater scale for genomic analytics. Now you should be all set to run your HealthOmics Analytics Spark workloads at AWS. There are several ways to deploy EMR at AWS, such as with EC2, EKS, or Outpost. Check out the EMR pricing page to estimate the cost for your particular use-case. There are no additional costs for querying your data in your HealthOmics Analytics Store. Have fun and though your EMR cluster should auto-terminate, don’t forget to clean up your resources when your finished. On the EMR console select omics-analytics-cluster (or whatever name your gave your cluster) and then Terminate.

Clusters table displayed on the EMR console filtered by the cluster name “omics-analytics-cluster”

AWS for Industries

Unlock Powerful Genomic Insights with AWS HealthOmics Analytics and Amazon EMR

At a high level

Initial setup

Configuring Permissions in IAM

EMR_EC2_DefaultRole

EMR_DefaultRole

Lake Formation

Create a data lake admin

Change the default permissions model

Application integration for full table access

Take Note of the HealthOmics Analytics Store Amazon S3 Path

Create a Data Lake Resource Link

Grant EMR_EC2_DefaultRole Describe permissions on the Resource Link Database

Grant EMR_EC2_DefaultRole Select and Describe permissions on the Resource Link Table

EMR Cluster Setup

Create an EMR Cluster

SSH into your EMR Cluster

Querying Your HealthOmics Analytics Store

Wrapping up

Resources

Follow

Learn

Resources

Developers

Help