Posted On: Jan 8, 2021

Amazon EMR now natively integrates with Apache Ranger, allowing you to define, enforce, and audit fine-grained data access control. With this feature, you can define and enforce 1/ database, table, and column level authorization policies for Apache Spark and Apache Hive users to access data through Hive Metastore, and 2/ prefix and object level authorization policies when accessing data in Amazon S3 via the Amazon EMR File System (EMRFS), leveraging Amazon CloudWatch to capture auditing logs.

Apache Ranger is an open-source tool to enable, monitor, and manage comprehensive data security across the Hadoop platform. Previously, you can use Apache Ranger to enforce fine-grained authorization on data in HDFS with Apache Hive using this blog post. Now this native integration enables additional capabilities. You can define three types of authorization policies on Apache Ranger Policy Admin server. You can set table, column, and row level authorization for Apache Hive, table and column level authorization for Apache Spark, and prefix and object level authorization for Amazon S3. Amazon EMR automatically installs and configures the corresponding Apache Ranger plugins on the cluster. These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data access control, and send auditing events to Amazon CloudWatch Logs.

Here are some considerations and limitations before you enable Apache Ranger integration on Amazon EMR. 1/ Row-level authorization and data masking policies are currently only supported with Apache Hive. 2/ The EMR Ranger-Spark plugin enforces fine-grained authorization when reading and writing data using the Spark API with Java, Scala, R, and Pyspark. However, writing data using Spark SQL on Ranger-Enabled Clusters is currently not supported; only reading data using SparkSQL is supported. 3/ This native integration supports selected applications like Apache Zeppelin and Hue. For a full list of supported applications, see Supported Applications

Amazon EMR native integration with Apache Ranger is available in the following AWS Regions: US East (N. Virginia and Ohio), US West (N. California and Oregon), Europe (Frankfurt, Ireland, London, Paris, Milan, and Stockholm), Canada (Central), Asia Pacific (Mumbai, Seoul, Singapore, Hong Kong, Tokyo, and Sydney), South America (São Paulo), Middle East (Bahrain), and Africa (Cape Town).

To get started, see the following list of resources:

• Amazon EMR Management Guide: Integrating Amazon EMR with Apache Ranger
• AWS Big Data Blog post: Introducing Amazon EMR integration with Apache Ranger