AWS Partner Network (APN) Blog

Fully Managed Data Governance with Amazon EMR Integration with Apache Ranger and Privacera

By Don Bosco Durai, CTO – Privacera
By Changbin Gong, Sr. Solutions Architect – AWS
By Navnit Shukla, Solutions Architect – AWS
By Varun Rao Bhamidimarri, Sr. Manager – AWS

Privacera-AWS-Partners
Privacera
Connect with Privacera-1

Apache Ranger is an open-source project that provides authorization and audit capabilities for big data applications like Apache Hive, PrestoDD, Trino, and Apache Kafka.

Starting with Amazon EMR 5.32, Amazon EMR includes Ranger plugins for SparkSQL, Amazon Simple Storage Service (Amazon S3), and Apache Hive to integrate with Apache Ranger 2.0 and enable authorization and audit capabilities.

Users can set up a multi-tenant EMR cluster, use Kerberos for user authentication, use Apache Ranger 2.0 for authorization, and configure fine-grained data access policies for databases, tables, columns, and S3 objects.

For this integration, the Apache Ranger server needs to be self-deployed and managed separately outside the EMR cluster. This can introduce challenges for customers who do not have the resources or expertise to manage the Ranger server themselves.

PrivaceraCloud is a fully-managed software-as-a-service (SaaS) data access governance solution that works with Apache Ranger’s integration with Amazon EMR.

Privacera is an AWS Partner that provides security and privacy tools for enterprises to secure and govern user access to databases and datastores in the cloud. PrivaceraCloud reduces the burden of self-managing Apache Ranger by providing Ranger as a hosted service. It provides centralized management of data access, authorization policies, and auditing.

This post outlines how Amazon EMR can integrate with PrivaceraCloud to provide a fully-managed data governance solution.

Solution Overview

The Apache Ranger plugins for Amazon EMR can be configured to use the policies from PrivaceraCloud by setting two properties: Ranger Admin URL for policies and Solr URL for Audits.

Additionally, multiple EMR clusters can point to the same PrivaceraCloud account, so customers need only configure policies once for consistency across multiple EMR clusters. You can configure EMR to use PrivaceraCloud as the external Apache Ranger Policy Server.

The following diagram shows a logical architecture for Amazon EMR integration with PrivaceraCloud.

Figure 1 - PrivaceraCloud-Amazon EMR Spark cluster architecture.

Figure 1 – PrivaceraCloud-Amazon EMR Spark cluster architecture.

Now, let’s walk through the process of integrating Amazon EMR with PrivaceraCloud. First, you need to set up a certificate in AWS Secrets Manager, and set up an AWS Identity and Access Management (IAM) role for EMR and Apache Ranger integration.

Next, configure Ranger policies for Amazon S3 and Spark and Hive. You can also set up audit logs in PrivaceraCloud. After you complete these prerequisites, you can set up your EMR resources and integrate them with PrivaceraCloud.

Prerequisites

Before getting started, you must complete the following prerequisites.

Set Up a Certificate in AWS Secrets Manager

Amazon EMR Native Ranger integration requires mutual transport layer security (TLS ) between Ranger plugins and the Privacera Ranger Admin. These certificates must be uploaded to AWS Secrets Manager. Amazon Resource Names (ARNs) are specified during EMR security configuration.

You’ll need to download the Ranger Admin Plugin Cert and Ranger Client KeyPair from the Privacera Ranger Admin server, as shown below.

Figure 2 – Download Ranger Admin Certificate and Ranger Client KeyPair from PrivaceraCloud.

Figure 2 – Download Ranger Admin Certificate and Ranger Client KeyPair from PrivaceraCloud.

After download, follow these instructions to upload the certificate and key pair in AWS Secrets Manager.

Set Up an IAM Role for Amazon EMR and Apache Ranger Integration

Follow these instructions to set up the required IAM roles.

Configure Ranger Policies for Amazon S3 and Spark and Hive

On the Privacera Ranger Admin server, configure the Hive and Amazon S3 policies. Spark and Hive will use the same privacera_hive (service type – hive) policy definition. On Amazon S3, set up a policy named privacera_s3.

A sample policy for Spark is shown in Figure 3. Note that only “Select” policy permission will be supported because the EMR record server currently only supports reads.

Figure 3 – A Sample Ranger Policy for Hive in Privacera Ranger Admin Server.

Figure 3 – Sample Ranger policy for Hive in Privacera Ranger Admin Server.

A sample policy for Amazon S3, meanwhile, looks like this:

Figure 4 – A Sample Ranger Policy for Amazon S3 in Privacera Ranger Admin Server.

Figure 4 – A Sample Ranger Policy for Amazon S3 in Privacera Ranger Admin Server.

Set Up Audit Logs in Privacera

To set up audit logs, go to Privacera. You’ll need to use the audit script URL that will be used as an EMR bootstrap action.

Figure 5 – Get Audit Setup Script URL from PrivaceraCloud.

Figure 5 – Get audit setup script URL from PrivaceraCloud.

Setting Up Your Resources

Step 1: Configure Amazon EMR Security Configuration

A new EMR security configuration needs to be created with the Kerberos server and Ranger integration details that will be attached to the EMR cluster.

Follow these instructions to set up the EMR security configuration. For the Admin server address, use the Ranger Admin mTLS URL, as shown in the following figure. Use privacera_hive and privacera_s3 as the service definition names.

Figure 6 – Get Ranger mTLS URL from PrivaceraCloud.

Figure 6 – Get Ranger mTLS URL from PrivaceraCloud.

Step 2: Set Up Amazon EMR Cluster

Follow these instructions to set up a new EMR cluster. To set up audits, follow the instructions in the section “Create EMR Cluster” in the Privacera and EMR Ranger integration guide to add a new EMR bootstrap action.

When setup is complete, you can use the Privacera Ranger Admin server to configure authorization policies and view the audit logs.

Figure 7 –Configure authorization policies and audit logs in Privacera Ranger Admin server.

Figure 7 – Configure authorization policies and audit logs in Privacera Ranger Admin server.

Limitations

Existing Amazon EMR application limitations apply to this solution.

Although Hive policy definitions are reused for Spark, all existing EMR record server plugin limitations still exist. Note the Spark plugin doesn’t support column masking and row-level filters. SparkSQL “INSERT INTO” and “INSERT OVERWRITE” overrides aren’t supported.

For more information, see Apache Ranger plugin limitations.

Cleaning Up

When you finish testing this solution, keep in mind you are charged for any resources that remain running. Follow the instructions in the Privacera and EMR Ranger integration guide to clean up the related resources created by this solution.

Conclusion

This post provides a solution to integrate Amazon EMR Apache Ranger with PrivaceraCloud. It extends Apache Ranger capabilities with PrivaceraCloud and provides a fully-automated data access governance for AWS environments.

For more information, see the following resources:

.
Privacera-APN-Blog-CTA-1
.


Privacera – AWS Partner Spotlight

Privacera is an AWS Partner that provides security and privacy tools for enterprises to secure and govern user access to databases and datastores in the cloud.

Contact Privacera | Partner Overview | AWS Marketplace

*Already worked with Privacera? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.