How can I use Hive and Spark on Amazon EMR to query an AWS Glue Data Catlog that's in a different AWS account?

Last updated: 2020-08-31

I want to access and query another account's AWS Glue Data Catalog using Apache Hive and Apache Spark in Amazon EMR.

Short description

In your Hive and Spark configurations, add the property "aws.glue.catalog.separator": "/". You can configure this property on a new cluster or on a running cluster.

Note: This solution is valid on Amazon EMR releases 5.28.0 and later. The EMR cluster and AWS Glue Data Catalog must be in the same Region.

Resolution

Set up cross-account access to AWS Glue, if you haven't already done so. Be sure that the Amazon Simple Storage Service (Amazon S3) bucket that the AWS Glue tables point to is configured for cross-account access. For more information, see How can I set up cross-account access for EMRFS?

Then, set the aws.glue.catalog.separator property to / for Hive and Spark:

On a new cluster

Add a configuration object similar to the following when you launch the cluster:

[
    {
        "Classification": "hive-site",
        "Properties": {
            "aws.glue.catalog.separator": "/"
        }
    },
    {
        "Classification": "spark-hive-site",
        "Properties": {
            "aws.glue.catalog.separator": "/"
        }
    }    
]

On a running cluster

  1. Open the Amazon EMR console.
  2. In the cluster list, under Name, choose the active cluster that you want to reconfigure.
  3. Open the cluster details page for the cluster and choose the Configurations tab.
  4. In the Filter dropdown list, select the instance group that you want to reconfigure.
  5. In Reconfigure dropdown list, choose Edit in table.
  6. In the configuration classification table, choose Add configuration, and then enter the following:
    For Classification: hive-site
    For Property: aws.glue.catalog.separator
    For Value: /

    For Classification: spark-hive-site
    For Property: aws.glue.catalog.separator
    For Value: /
  7. Select Apply this configuration to all active instance groups, and then choose Save changes.

Query the tables in the other AWS account

To query a table that's in a different AWS account, specify the account number in the query. The account number is the same as the catalog ID. For example, to query demodb.tab1 in account 111122223333 in Hive:

select * from `111122223333/demodb.tab1` limit 5;

Spark example (run this in the spark-submit script, or as a notebook shell command):

spark.sql("select * from `111122223333/demodb`.tt1 limit 5").show()

You can also join tables across two catalogs. Hive example:

select * from `111122223333/demodb.tab1` t1 inner join  `444455556666/demodb.tab2` t2 on t1.col1 = t2.col2

Spark example (run this in the spark-submit script, or as a notebook shell command):

spark.sql(select * from `111122223333/demodb.tab1` t1 inner join  `444455556666/demodb.tab2` t2 on t1.col1 = t2.col2).show()

Or, pass the parameter using the --conf option in the spark-submit script, or as a notebook shell command. Example:

pyspark --conf spark.hadoop.aws.glue.catalog.separator="/"

Did this article help?


Do you need billing or technical support?