AWS for Industries
Unlock Powerful Genomic Insights with AWS HealthOmics Analytics and Amazon EMR
Analyzing large-scale genomic variant data just got easier with the latest release of Amazon EMR and its integration with AWS HealthOmics. HealthOmics Analytics provides tabular access to genetic variant data sets and annotations, empowering researchers and data scientists to uncover valuable insights through Amazon Athena queries and now Amazon EMR jobs.
While Amazon Athena excels at quick, interactive querying of data using SQL, Amazon EMR offers a more flexible big data platform for processing vast amounts of data. EMR leverages popular frameworks like Apache Spark, allowing you to go beyond the interactive model of Athena and gain greater flexibility and control over your compute environment and analytics workflow. This enables you to tackle complex, multi-stage data processing workflows via cost-effective EMR jobs. By combining HealthOmics Analytics Stores with the power of EMR, you can unlock a wide range of use cases, including genotype-phenotype association analysis, population-scale variant analysis, and integration with a diverse ecosystem of bioinformatics tools such as Hail, ADAM, GATK, and more. HealthOmics leverages AWS Lake Formation to ensure secure and centrally managed governance of your genomic data, giving you complete control over who can access it, including via EMR jobs. This blog post will guide you through the initial configuration and setup of EMR to help you get started with your first genomic data query.
At a high level
Today, we’ll be running a spark job on EMR (1). EMR integrates seamlessly with LakeFormation to secure credentials and authenticate access to your genomic data (2). We will write our queries against our HealthOmics Analytics Store in the Glue Data Catalog (3) which pulls the secured data from HealthOmics (4).
Initial setup
This blog will assume you’ve created a HealthOmics Variant and/or Annotation store and have imported genomic and/or annotation data into it already. If you haven’t, see the HealthOmics Analytics Documentation to get started.
There are a few initial steps required to enable your EMR cluster to access HealthOmics Analytics data through AWS Lake Formation. You will need a role permissive enough to interact with EMR, IAM, and Lake Formation.
Configuring Permissions in IAM
The following steps will modify the default EMR roles, in your workflow you can use any roles you’d like as long as they are given the correct IAM and Lake Formation permissions.
To create the default EMR roles, you can do so using the AWS command line interface with
aws emr create-default-roles --region <AWS_REGION>
This creates two IAM roles EMR_EC2_DefaultRole and EMR_DefaultRole which will be used by EMR.
EMR_EC2_DefaultRole
This role is assumed by all of the EC2 instances in the EMR Cluster. use it to grant the appropriate permissions to interact with other AWS services and resources as part of the data processing and management tasks. Aside from the default AmazonElasticMapReduceforEC2Role managed policy it should already have, we will modify this role with the following inline policy:
Please replace <AWS_ACCOUNT>
with your AWS Account Id
Please replace <AWS_REGION>
with the appropriate AWS Region
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "LakeFormationPermission",
"Effect": "Allow",
"Action": "lakeformation:GetDataAccess",
"Resource": "*"
},
{
"Sid": "DefaultDBPermissions",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:catalog",
"arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT>:database/default"
]
}
]
}
Breaking it down,
The LakeFormationPermission
statement grants the role the ability to request temporary credentials from LakeFormation to access the underlying data.
The DefaultDBPermissions
statement gives your cluster the ability to initialize with the default database. In EMR’s catalog integration, EMR always tries to check whether the default database exists. This will allow the role to create the default if necessary. If a default database already exists in Lake Formation, you can simply grant EMR_EC2_DefaultRole with Describe permission on it and leave out this statement.
EMR_DefaultRole
This role is assumed by the Amazon EMR service to help manage resources and perform various actions on your behalf during the lifecycle of an EMR cluster. It should already be configured with the permissions needed for this blog.
Lake Formation
Create a data lake admin
If you have not already done so you will need to create a data lake administrator. This will allow you to create the necessary resource links and grants required to access your data from EMR.
In the Lake Formation Console:
- In the left-side navigation panel, expand Administration
- Under Administration, choose Administrative roles and tasks
- On the page you will see a Data lake administrators box, choose add
- Add an IAM role you can assume (e.g. the role you are currently using in the console) and Confirm
- You should now see that IAM role in the Data lake administrators list
- If that role is different than the one you are currently using, log into the console with that role
Here I have a role named Admin I am making a Data lake Administrator, but you can use any role you’d like.
Change the default permissions model
This is necessary to let Lake Formation handle permissions on newly created resource links. Disable use of IAM access controls for new databases and database tables and revoke IAMAllowedPrinciples permission for database creators.
- In the left-side navigation panel, expand Administration
- Under Administration, choose Data Catalog settings
- On the page you will see a Default permissions for newly created databases and tables
- Uncheck and Save the following options:
- Use only IAM access control for new databases
- Use only IAM access control for new tables in new databases
Application integration for full table access
In the Lake Formation Console:
- In the left-side navigation, expand Administration
- Under Administration, choose Application integration settings.
- On the Application integration settings page, choose the checkbox to Allow external engines to access data in Amazon S3 locations with full table access
This will authorize the EMR query engine to request credentials to query against the underlying data.
Take Note of the HealthOmics Analytics Store Amazon S3 Path
To find your HealthOmics Analytics Store Amazon S3 Path:
- In the left-side navigation panel, expand Data Catalog
- Under Data Catalog, choose Databases
- On the page you will see a Databases box, look for your HealthOmics Analytics store
- Copy the S3 path listed under the Amazon S3 Path Column. Save this S3 Path, we will use it later in the EMR Cluster!
Create a Data Lake Resource Link
Now that we’ve configured the Lake Formation settings and created our Lake Formation Admin, we can now create a resource link to HealthOmics Analytics store and grant the EMR_EC2_DefaultRole role permission to access it in EMR. If you used Amazon Athena to query your HealthOmics Analytics Store previously, it is the same process.
To learn more about how resource links work, see how resource links work in Lake Formation.
To create a resource link:
- Ensure you’re using the Lake Formation Admin Role previously set
- In the left-side navigation panel, expand Data Catalog
- Under Data Catalog, choose Databases
- On the page you will see a Databases box, choose your HealthOmics Analytics store
- Under Actions, choose Create resource link
- Choose a Resource Link Name – this is the name of the database we will use in EMR
- Note: This name must be a compliant SQL database name. To avoid having to escape the name your queries we suggest using only lowercase letters and underscores
- Leave everything else to its pre-set value
Grant EMR_EC2_DefaultRole Describe permissions on the Resource Link Database
Recall when we created the default EMR roles, and set the lakeformation:GetDataAccess
permission on EMR_EC2_DefaultRole. Since we will use this role to access our data through Lake Formation, we’ll need to grant this role Describe permission the Resource Link.
To grant Describe on the Resource Link
- Ensure you’re using the Lake Formation Admin Role previously set
- In the left-side navigation panel, expand Data Catalog
- Under Data Catalog, choose Databases
- On the page you will see a Databases box, choose Resource Link you created in the previous step
- Under Actions, choose Grant
- Choose the EMR_EC2_DefaultRole to add
- Under Resource link permissions, choose Describe
- Select Grant to save
Grant EMR_EC2_DefaultRole Select and Describe permissions on the Resource Link Table
Similar to the previous step, we are going to grant another set of permissions on the resource link. This time we will grant permissions specifically on the HealthOmics Analytics store table our Resource Link refers to
To grant Select and Describe on the Resource Link table
- Ensure you’re using the Lake Formation Admin Role previously set
- In the left-side navigation panel, expand Data Catalog
- Under Data Catalog, choose Databases
- On the page you will see a Databases box, choose Resource Link you created
- Under Actions, choose Grant
- Choose the EMR_EC2_DefaultRole to add
- Under Tables, choose All Tables or the table that matches the name of your HealthOmics Analytics Store or Variant Cohort
- Under Table permissions, choose Select and Describe
These are all the permissions required to access the data in EMR, you should be able to verify your permissions for EMR_EC2_DefaultRole on the Permissions page.
At the minimum, you should have:
- Describe permission on the Resource Link Database
- Describe permission on the HealthOmics Analytics Store Table
- Select permission on all the columns in the HealthOmics Analytics Store Table
EMR Cluster Setup
We are now ready to setup EMR and query HealthOmics Analytics data.
Create an EMR Cluster
There are several ways to create/configure an EMR cluster, here we will create one with the smallest footprint required to access our HealthOmics Analytics data.
At the minimum, we will need:
- Amazon EMR Release emr-6.13.0 or greater
- Spark 3.4.1 or greater
- For AWS Glue Data Catalog Settings, ensure Use For Spark table Metadata is enabled
- Configuration for iceberg-defaults, value iceberg-enabled is true
To create a cluster:
- In the left-side navigation panel, expand EMR on EC2 and choose Clusters
- On the page you will see a Clusters box, choose Create Cluster
- Name your Cluster (e.g. omics-analytics-cluster)
- Select an Amazon EMR Release greater than or equal to emr-6.13.0
- Under Application Bundle, choose Custom
- Choose Spark 3.4.1 or greater
- Under AWS Glue Data Catalog settings, choose Use for Spark table metadata
- Under Software Settings, set the iceberg-defaults config
[
{
"Classification": "iceberg-defaults",
"Properties": {
"iceberg.enabled": "true"
}
}
]
- Make sure to set an EC2 Key pair so you can SSH into the cluster
- Under Identity and Access Management (IAM) roles, set the default EMR roles we created earlier
- Under Amazon EMR Service Role, choose EMR_DefaultRole
- Under EC2 instance profile for Amazon EMR, choose EMR_EC2_DefaultRoleRecall that EMR_EC2_DefaultRole was the role we granted database and table access to in Lake Formation earlier
- Click Create Cluster
SSH into your EMR Cluster
Here are a few helpful reminders when SSHing into EC2
- Ensure the EC2 Security Group attached to your EC2 Instance allows for inbound SSH connections (port 22) from your computer’s IP Address
- Set read-only permissions on the pem file associated with your SSH Key
- e.g.
chmod 400 emrKey.pem
- e.g.
- You can grab the SSH command from the Cluster’s detail page
You should now be able to SSH into the Primary Node your EMR Cluster.
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[hadoop@ip ~]$
Querying Your HealthOmics Analytics Store
You can use any of the spark commands available to you. Here we’ll make a query using the SparkSQL shell. In all cases the HealthOmics Analytics table is read only, however you may create derivative tables and store these in your account.
- Replace <WAREHOUSE_LOCATION> with the full HealthOmics Analytics Store Amazon S3 Path
- This is the HealthOmics Analytics Store Amazon S3 Path retrieved from earlier in this blog, use it now or refer back to the Lake Formation section on how to retrieve it
- Replace <AWS_ACCOUNT> with your AWS Account ID
- Replace <AWS_REGION> with the region your HealthOmics Analytics Store and EMR cluster reside in.
spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=<WAREHOUSE_LOCATION> \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.glue.lakeformation-enabled=true \
--conf spark.sql.catalog.my_catalog.glue.account-id=<AWS_ACCOUNT> \
--conf spark.sql.catalog.my_catalog.client.region=<AWS_REGION> \
--conf spark.sql.catalog.my_catalog.client.factory=org.apache.iceberg.aws.lakeformation.LakeFormationAwsClientFactory
You should now be able to run any of the following queries in the SparkSQL shell.
- Use the catalog name you define in the configuration, this blog uses my_catalog
- Replace <MY_RESOURCE_LINK> with the name of your Database Resource Link
- Replace <TABLE_NAME> with your HealthOmics Analytics Store name or Variant Cohort you’d like to query
Describe the Table
spark-sql (default)> DESCRIBE my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME>;
Result
importjobid string
contigname string
start bigint
end bigint
names array<string>
referenceallele string
alternatealleles array<string>
qual double
filters array<string>
splitfrommultiallelic boolean
attributes map<string,string>
phased boolean
calls array<int>
genotypelikelihoods array<double>
phredlikelihoods array<int>
alleledepths array<int>
conditionalquality int
spl array<int>
depth int
ps int
sampleid string
information map<string,string>
annotations struct<vep:array<struct<allele:string,consequence:array<string>,impact:string,symbol:string,gene:string,feature_type:string,feature:string,biotype:string,exon:struct<rank:string,total:string>,intron:struct<rank:string,total:string>,hgvsc:string,hgvsp:string,cdna_position:string,cds_position:string,protein_position:string,amino_acids:struct<reference:string,variant:string>,codons:struct<reference:string,variant:string>,existing_variation:array<string>,distance:string,strand:string,flags:array<string>,symbol_source:string,hgnc_id:string,extras:map<string,string>>>>
Time taken: 1.291 seconds, Fetched 23 row(s)
Count the number of variants where the Allele Frequency (AF score) is > 0.5
spark-sql (default)> SELECT count(*) FROM my_catalog.<MY_RESOURCE_LINK>.<TABLE_NAME> where attributes['AF'] > 0.5;
Result
32
Time taken: 8.763 seconds, Fetched 32 row(s)
Wrapping up
With the release of EMR 6.13.0.
HealthOmics Analytics Data is now available in your EMR Cluster, unlocking more use-cases and greater scale for genomic analytics. Now you should be all set to run your HealthOmics Analytics Spark workloads at AWS. There are several ways to deploy EMR at AWS, such as with EC2, EKS, or Outpost. Check out the EMR pricing page to estimate the cost for your particular use-case. There are no additional costs for querying your data in your HealthOmics Analytics Store. Have fun and though your EMR cluster should auto-terminate, don’t forget to clean up your resources when your finished. On the EMR console select omics-analytics-cluster
(or whatever name your gave your cluster) and then Terminate
.