How can I use Hive and Spark on Amazon EMR to query an AWS Glue Data Catalog that's in a different AWS account?

4 minute read

I want to access and query another account's AWS Glue Data Catalog using Apache Hive and Apache Spark in Amazon EMR.

Short description

You can dynamically access Data Catalog in different accounts by specifying the property aws.glue.catalog.separator in your Hive or Spark configurations. Additionally, you can access a specific Data Catalog in another account by specifying the property hive.metastore.glue.catalogid in your Hive or Spark configurations.

Resolution

If you didn't grant cross-account access to AWS Glue, then follow the steps in Grant cross-account access to set it up. Be sure that the Amazon Simple Storage Service (Amazon S3) bucket that the AWS Glue tables point to is configured for cross-account access. For more information, see How can I set up cross-account access for Amazon EMRFS?

Dynamically access Data Catalog in different accounts

Apply configuration changes to a new cluster

When you launch an EMR cluster, add a configuration object similar to the following object. For more information, see Configure applications when you create a cluster.

[{
    "Classification": "hive-site",
    "Properties": {
        "aws.glue.catalog.separator": "/",
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}, {
    "Classification": "spark-hive-site",
    "Properties": {
        "aws.glue.catalog.separator": "/",
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}]

Apply configuration changes to a running cluster

Open the Amazon EMR console.
In the cluster list, under Cluster Id, choose the active cluster that you want to reconfigure.
In the cluster details page, choose the Applications tab.
In the applications details page, scroll down to Instance group configurations.
Select the instance group, and then choose Reconfigure.
In the next screen, choose Add new configuration, and then enter the following:
For Classification: hive-site
For Property: aws.glue.catalog.separator
For Value: /
For Classification: hive-site
For Property: hive.metastore.client.factory.class
For Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
For Classification: spark-hive-site
For Property: aws.glue.catalog.separator
For Value: /
For Classification: spark-hive-site
For Property: hive.metastore.client.factory.class
For Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Check Apply this configuration to all active instance groups, and then choose Save changes. For more information, see Reconfigure an instance group in a running cluster.

Query tables in another AWS account

Note: In Amazon EMR 6.5, pass the configuration property spark.sql.catalogImplementation=hive when running Spark jobs. For more information, see Dynamically loading Spark properties in the Apache Spark documentation.

To query a table that's in a different AWS account, specify the account number in the query. The account number is the same as the catalog ID. For example, to query demodb.tab1 in account 111122223333 in Hive, run the following command:

SELECT * FROM `111122223333/demodb.tab1` LIMIT 5;

Spark example (run this in the spark-submit script, or as a notebook shell command):

spark.sql("SELECT * FROM `111122223333/demodb`.tt1 LIMIT 5").show()

You can also join tables across two catalogs.

Hive example:

SELECT * FROM `111122223333/demodb.tab1` t1 INNER JOIN `444455556666/demodb.tab2` t2 ON t1.col1 = t2.col2

Spark example (run this in the spark-submit script, or as a notebook shell command):

spark.sql(SELECT * FROM `111122223333/demodb.tab1` t1 INNER JOIN `444455556666/demodb.tab2` t2 ON t1.col1 = t2.col2).show()

Access a specific Data Catalog in another account

Apply configuration changes to a new cluster

When you launch an EMR cluster, add a configuration object similar to the following object. For more information, see Configure applications when you create a cluster.

[{
    "Classification": "hive-site",
    "Properties": {
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "hive.metastore.glue.catalogid": "account-id"
    }
}, {
    "Classification": "spark-hive-site",
    "Properties": {
        "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "hive.metastore.glue.catalogid": "account-id"
    }
}]

Apply configuration changes to a running cluster

Open the Amazon EMR console.
In the cluster list, under Cluster Id, choose the active cluster that you want to reconfigure.
In the cluster details page, choose the Applications tab.
In the applications details page, scroll down to Instance group configurations.
Select the instance group, and then choose Reconfigure
In the next screen, choose Add new configuration, and then enter the following:
For Classification: hive-site
For Property: hive.metastore.glue.catalogid
For Value: account-id
For Classification: hive-site
For Property: hive.metastore.client.factory.class
For Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
For Classification: spark-hive-site
For Property: hive.metastore.glue.catalogid
For Value: account-id
For Classification: spark-hive-site
For Property: hive.metastore.client.factory.class
For Value: com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
Check Apply this configuration to all active instance groups, and then choose Save changes. For more information, please see Reconfigure an instance group in a running cluster.

Query tables in another AWS account

To query tables that are stored in the specified AWS Glue Data Catalog, you can run the following commands. For example, to query demodb.tab1:

In Hive, run the following command:

SELECT * FROM `demodb.tab1` LIMIT 5;

Spark example (run this in the spark-submit script, or as a notebook shell command):

spark.sql("SELECT * FROM `demodb`.tt1 LIMIT 5").show()

Topics

Analytics

Relevant content

Accessing Glue Data Catalog from Spark program
Accepted Answer
AWS-User-9719193
asked 7 years ago
Connecting to Glue Hive Data Catalog from EC2 or Local Computer with Spark
Thomas
asked a year ago
Issue with accessing Glue Data Catalog with Spark
awenclaw
asked 5 years ago
EMR with Glue Data Catalog as Hive Metastore
Accepted Answer
zying
asked 4 months ago
EMR Serverless not populating AWS Glue Catalog
zzzz8888
asked a day ago
How can I provide cross-account access to resources in the AWS Glue Data Catalog?
AWS OFFICIALUpdated 2 years ago
How can I troubleshoot EMR job failures when trying to connect to the Glue Data Catalog?
AWS OFFICIALUpdated a year ago
How can I set up a cross-account AWS Glue ETL job for tables using catalog resource policies?
AWS OFFICIALUpdated a year ago
How can I use Apache Iceberg with a cross-account AWS Glue Data Catalog in Spark?
AWS OFFICIALUpdated a year ago
Migrating Glue Data Catalog tables to use Apache Iceberg open table format using Athena
EXPERT
Hamzah Chaudhry
published 12 days ago