AWS Big Data Blog

Cross-account AWS Glue Data Catalog access with Amazon Athena

Many AWS customers use a multi-account strategy. A centralized AWS Glue Data Catalog is important to minimize the amount of administration related to sharing metadata across different accounts. This post introduces capability that allows Amazon Athena to query a centralized Data Catalog across different AWS accounts.

Overview of solution

In late 2019, AWS introduced the ability to Connect Amazon Athena to your Apache Hive Metastore. Using this functionality, you can also configure Athena to point to a Data Catalog in a different account. The Hive Metastore functionality uses an AWS Lambda function to federate queries to your Data Catalog of choice. You can use this same functionality to proxy catalog queries to a Data Catalog in a different account.

The following diagram shows necessary components used in two different accounts and the flow between them for cross account Glue Data Catalog access using Athena:

In this walkthrough, you create the Lambda function in the same account in which you run Athena queries (Account B). You grant Cross-Account Access to the Lambda function using a resource policy, which allows the function in Account B to query the Data Catalog in Account A. Users in Account B must have access to the Amazon S3 resources that the tables point to and have access to execute the Lambda function. For more information about implementing the Lambda function, see the Github repo.

This post also provides an AWS CloudFormation stack to create the Lambda function and a read-only IAM role for the function. This post uses data from the Registry of Open Data on AWS. You don’t need to worry about ensuring cross-account access to the data on S3.

Prerequisites

For this walkthrough, you need to have the following prerequisites:

  • An AWS account (Account B from the diagram) with the ability to create IAM roles, Lambda resources, and run Athena queries
  • A second AWS account (Account A) where you can create a Data Catalog
  • AWS CLI admin access to AWS Glue in Account A

As part of this post, you create a read-only IAM role (which the CloudFormation stack launches) in Account B for the Lambda function in Account B to access the Data Catalog in Account A.

This role has two policies attached to it. The first policy provides the Lambda function read-only access to the specified resources in the cross account Data Catalog. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "glue:BatchGetPartition",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<account-id-A>:catalog",
                "arn:aws:glue:us-east-1:<account-id>:database/<database-name>",
                "arn:aws:glue:us-east-1:<account-id-A>:table/<database-name>/<table-name>"
            ],
            "Effect": "Allow"
        }
    ]
}

The second policy provides the Lambda function write permissions to CloudWatch Logs. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

You also create the following additional resources:

  • A Lambda function (which the CloudFormation stack launches) in Account B that proxies Data Catalog requests
  • Cross-account execution access granted to the Lambda function by adding a policy to the Data Catalog in Account A
  • A sample table in the Account A

The CloudFormation stack in this post creates the Lambda function in Account B and grants the necessary access by attaching above two policies. Click this button to deploy the CloudFormation template in us-east-1 region from the CloudFormation console.

Please input values for parameters to launch stack as described below:

  • GlueDataCatalogAccountID: Determine the account ID from the account (Account A) that serves as your centralized Data Catalog and fill that in as a value
  • Region: us-east-1
  • DatabaseName: You are giving Lambda function access to this database in cross-account Glue Catalog. For the purpose of this blog, enter opendata
  • TableName: You are giving Lambda function access to this table in above specified database in cross-account Glue Catalog. For the purpose of this blog, enter * . This will give access to all tables in above specified database

Now, Hit Next button until you see button to create stack. When the stack is finished provisioning, you are ready to begin the walkthrough.

Note: If you want to allow Lambda function to access more databases and tables, you can edit the inline IAM policy AWSGlueReadOnlyAccess attached to the IAM Role created by the CloudFormation stack to add more resources.

Granting cross-account access

After you create a Lambda function in the account in which you want to run Athena queries accessing the cross-account Data Catalog, you need to register the Lambda function as a data source inside of Athena.

Registering the central Data Catalog

First, create a workgroup to access the preview functionality. Then follows steps on connect Athena to an Apache Hive Metastore. On the Connection details page, for Lambda function, select the Lambda function that the CloudFormation stack created.  It should look like <StackName>-AthenaCrossAccountLambdaFunc-<StackID>. Name your Data Catalog centraldata.

Granting cross-account execution access

Attach a policy that grants access to the IAM role in Account B to read databases and tables in Account A’s Data Catalog. Run the following command in Account A, but make the following changes:

  • Replace the principal with the value of CrossAccountPrincipal in the Outputs tab of the CloudFormation stack (from Account B).
  • Replace <ACCOUNT_ID> with the AWS account ID of Account A.

In the resource section, replace <database-name> with the database’s name from the cross-account Catalog that you intend to query. This post uses opendata.

aws glue put-resource-policy --policy-exists-condition NOT_EXIST --policy-in-json '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables"
      ],
      "Principal": {"AWS": [
        "<REPLACE_WITH_CLOUDFORMATION_CrossAccountPrincipal_VALUE>"
      ]},
      "Resource": [
        "arn:aws:glue:us-east-1:<ACCOUNT_ID_A>:catalog",
        "arn:aws:glue:us-east-1:<ACCOUNT_ID_A>:database/<database-name>",
        "arn:aws:glue:us-east-1:<ACCOUNT_ID_A>:table/<database-name>/*"
      ]
    }
  ]
}'

The preceding code attaches a policy to the Data Catalog if there is no existing policy already attached. If there is one and you wish to overwrite it, remove --policy-exists-condition NOT_EXIST from the code. It replaces the existing policy with the value of –policy-in-json. You can also edit the policy directly in the AWS Glue console to merge your existing policy with the value of --policy-in-json.

Executing queries across accounts

Now that you’ve created your Lambda function and registered it in Athena, you can run queries across accounts.

Create a sample table of Amazon Customers Reviews from the Registry of Open Data on AWS that you can use to query in Account B. Complete the following steps:

  1. Log in to Account A and open the Athena console.
  2. Run the following three queries, one at a time:
    • To create the database for this example, run the following query:CREATE DATABASE opendata
    • To create the sample table, run the following query:
      CREATE EXTERNAL TABLE amazon_reviews_parquet(
        marketplace string, 
        customer_id string, 
        review_id string, 
        product_id string, 
        product_parent string, 
        product_title string, 
        star_rating int, 
        helpful_votes int, 
        total_votes int, 
        vine string, 
        verified_purchase string, 
        review_headline string, 
        review_body string, 
        review_date bigint, 
        year int)
      PARTITIONED BY (product_category string)
      ROW FORMAT SERDE 
        'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
      STORED AS INPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
      OUTPUTFORMAT 
        'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      LOCATION
        's3://amazon-reviews-pds/parquet/'
    • To add partitions to the table, run the following query:MSCK REPAIR TABLE amazon_reviews_parquet
  3. Log in to the Athena account (Account B).
  4. From the Athena console, enter the following query:SELECT * FROM centraldata.opendata.amazon_reviews_parquet LIMIT 10;

The following screenshot shows the 100 random rows resulted from the above query:

This query runs in the Athena account but accesses the centralized Data Catalog in your primary account. Athena then retrieves the data from S3 and processes the query in the Athena account. You can even join data across accounts or create derivative tables in one account using data from another.

Creating a derivative table in Account B from the data in Account A

For this post, assume you want to create a derivative dataset from the Amazon Reviews table to determine if customers leaving helpful reviews in one product category also leave reviews in other categories.

Using the CREATE TABLE AS SELECT functionality in Athena to create a new table in the account that you’re running your Athena queries in, using the data from the centralized Data Catalog. You can create a table in your default database that finds customers that have written over 1,000 reviews in the Toys product category. See the following code:

CREATE TABLE default.helpful_reviewers
WITH (
  format='PARQUET'
) AS
SELECT customer_id,
    SUM(helpful_votes) AS total_helpful_votes,
    SUM(total_votes) AS total_votes,
    AVG(star_rating) AS avg_star_rating,
    SUM(total_votes)-SUM(helpful_votes) AS vote_differential,
    ROUND(SUM(helpful_votes)/CAST(SUM(total_votes) AS double),4) AS pct_helpful
FROM centraldata.opendata.amazon_reviews_parquet 
WHERE product_category='Toys' 
GROUP BY 1
HAVING SUM(total_votes) > 1000
ORDER BY 2 DESC

After you run the query, you should see the message Query successful. You can now review the resulting data. Run following query:

SELECT * FROM default.helpful_reviewers LIMIT 100;

The following screenshot shows the 10 random rows resulted from the above query on default.helpful_reviewers:

The results show that some people leave numerous reviews. You can now run queries over this derivative dataset in your account, which can be more efficient because you’ve already done some aggregation on the data that’s relevant to your research.

Joining tables across accounts

You can also look at what other product categories have reviews from the same people. The following query selects data from the table in Account B and joins it with the original data in Account A:

SELECT customer_id,
         product_category,
         COUNT(*) AS count
FROM centraldata.opendata.amazon_reviews_parquet
WHERE customer_id IN 
    (SELECT customer_id
    FROM default.helpful_reviewers)
        AND product_category != 'Toys'
GROUP BY  1, 2
ORDER BY  3 DESC

The following screenshot shows the result of above query which consists of customers that have written over 1,000 reviews in the Toys product category, product categories other than “toy” and number of reviews left by a customer in a specific product category:

You can see that people that left reviews in the Toys category also left numerous reviews in the Books and Video categories.

Cleaning up

To avoid incurring future charges, unregister the Data Catalog entry from your Athena account and delete the CloudFormation stack.

Limitations

While this is an effective way to share a Data Catalog across one or more accounts, there are the following limitations to this approach:

  • Authorization – The Lambda function executes as the IAM role that you created as part of the CloudFormation template, so any IAM users or roles querying this Data Catalog across accounts can see the same metadata from the Data Catalog that the IAM role can access. Access policies on S3 still take effect.
  • Read-only – The current implementation only implements the necessary functions for read-only access because it is assumed that a central team also manages the centralized Data Catalog.

Conclusion

This post showed how to use the Athena External Hive Metastore functionality to query a Data Catalog across AWS accounts. You also created derivative datasets in your account and joined data between these two accounts. The code for the Lambda function is open source; you are welcome to take a look and contribute to the GitHub repo.

 


About the Author

Pathik Shah is a big data architect for Amazon EMR at AWS.