Fully Managed Data Governance with Amazon EMR Integration with Apache Ranger and Privacera
By Don Bosco Durai, CTO – Privacera
By Changbin Gong, Sr. Solutions Architect – AWS
By Navnit Shukla, Solutions Architect – AWS
By Varun Rao Bhamidimarri, Sr. Manager – AWS
Apache Ranger is an open-source project that provides authorization and audit capabilities for big data applications like Apache Hive, PrestoDD, Trino, and Apache Kafka.
Starting with Amazon EMR 5.32, Amazon EMR includes Ranger plugins for SparkSQL, Amazon Simple Storage Service (Amazon S3), and Apache Hive to integrate with Apache Ranger 2.0 and enable authorization and audit capabilities.
Users can set up a multi-tenant EMR cluster, use Kerberos for user authentication, use Apache Ranger 2.0 for authorization, and configure fine-grained data access policies for databases, tables, columns, and S3 objects.
For this integration, the Apache Ranger server needs to be self-deployed and managed separately outside the EMR cluster. This can introduce challenges for customers who do not have the resources or expertise to manage the Ranger server themselves.
PrivaceraCloud is a fully-managed software-as-a-service (SaaS) data access governance solution that works with Apache Ranger’s integration with Amazon EMR.
Privacera is an AWS Partner that provides security and privacy tools for enterprises to secure and govern user access to databases and datastores in the cloud. PrivaceraCloud reduces the burden of self-managing Apache Ranger by providing Ranger as a hosted service. It provides centralized management of data access, authorization policies, and auditing.
This post outlines how Amazon EMR can integrate with PrivaceraCloud to provide a fully-managed data governance solution.
Additionally, multiple EMR clusters can point to the same PrivaceraCloud account, so customers need only configure policies once for consistency across multiple EMR clusters. You can configure EMR to use PrivaceraCloud as the external Apache Ranger Policy Server.
The following diagram shows a logical architecture for Amazon EMR integration with PrivaceraCloud.
Figure 1 – PrivaceraCloud-Amazon EMR Spark cluster architecture.
Now, let’s walk through the process of integrating Amazon EMR with PrivaceraCloud. First, you need to set up a certificate in AWS Secrets Manager, and set up an AWS Identity and Access Management (IAM) role for EMR and Apache Ranger integration.
Next, configure Ranger policies for Amazon S3 and Spark and Hive. You can also set up audit logs in PrivaceraCloud. After you complete these prerequisites, you can set up your EMR resources and integrate them with PrivaceraCloud.
Before getting started, you must complete the following prerequisites.
Set Up a Certificate in AWS Secrets Manager
Amazon EMR Native Ranger integration requires mutual transport layer security (TLS ) between Ranger plugins and the Privacera Ranger Admin. These certificates must be uploaded to AWS Secrets Manager. Amazon Resource Names (ARNs) are specified during EMR security configuration.
You’ll need to download the Ranger Admin Plugin Cert and Ranger Client KeyPair from the Privacera Ranger Admin server, as shown below.
Figure 2 – Download Ranger Admin Certificate and Ranger Client KeyPair from PrivaceraCloud.
After download, follow these instructions to upload the certificate and key pair in AWS Secrets Manager.
Set Up an IAM Role for Amazon EMR and Apache Ranger Integration
Follow these instructions to set up the required IAM roles.
Configure Ranger Policies for Amazon S3 and Spark and Hive
On the Privacera Ranger Admin server, configure the Hive and Amazon S3 policies. Spark and Hive will use the same privacera_hive (service type – hive) policy definition. On Amazon S3, set up a policy named privacera_s3.
A sample policy for Spark is shown in Figure 3. Note that only “Select” policy permission will be supported because the EMR record server currently only supports reads.
Figure 3 – Sample Ranger policy for Hive in Privacera Ranger Admin Server.
A sample policy for Amazon S3, meanwhile, looks like this:
Figure 4 – A Sample Ranger Policy for Amazon S3 in Privacera Ranger Admin Server.
Set Up Audit Logs in Privacera
To set up audit logs, go to Privacera. You’ll need to use the audit script URL that will be used as an EMR bootstrap action.
Figure 5 – Get audit setup script URL from PrivaceraCloud.
Setting Up Your Resources
Step 1: Configure Amazon EMR Security Configuration
A new EMR security configuration needs to be created with the Kerberos server and Ranger integration details that will be attached to the EMR cluster.
Follow these instructions to set up the EMR security configuration. For the Admin server address, use the Ranger Admin mTLS URL, as shown in the following figure. Use privacera_hive and privacera_s3 as the service definition names.
Figure 6 – Get Ranger mTLS URL from PrivaceraCloud.
Step 2: Set Up Amazon EMR Cluster
Follow these instructions to set up a new EMR cluster. To set up audits, follow the instructions in the section “Create EMR Cluster” in the Privacera and EMR Ranger integration guide to add a new EMR bootstrap action.
When setup is complete, you can use the Privacera Ranger Admin server to configure authorization policies and view the audit logs.
Figure 7 – Configure authorization policies and audit logs in Privacera Ranger Admin server.
Existing Amazon EMR application limitations apply to this solution.
Although Hive policy definitions are reused for Spark, all existing EMR record server plugin limitations still exist. Note the Spark plugin doesn’t support column masking and row-level filters. SparkSQL “INSERT INTO” and “INSERT OVERWRITE” overrides aren’t supported.
For more information, see Apache Ranger plugin limitations.
When you finish testing this solution, keep in mind you are charged for any resources that remain running. Follow the instructions in the Privacera and EMR Ranger integration guide to clean up the related resources created by this solution.
This post provides a solution to integrate Amazon EMR Apache Ranger with PrivaceraCloud. It extends Apache Ranger capabilities with PrivaceraCloud and provides a fully-automated data access governance for AWS environments.
For more information, see the following resources:
- Introducing Amazon EMR integration with Apache Ranger
- Privacera and Amazon EMR Apache Ranger integration
Privacera – AWS Partner Spotlight
Privacera is an AWS Partner that provides security and privacy tools for enterprises to secure and govern user access to databases and datastores in the cloud.
*Already worked with Privacera? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.