AWS Big Data Blog

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

by Varun Rao Bhamidimarri | on | | Comments

Varun Rao is a Big Data Architect for AWS Professional Services

Role-based access control (RBAC) is an important security requirement for multi-tenant Hadoop clusters. Enforcing this across always-on and transient clusters can be hard to set up and maintain.

Imagine an organization that has an RBAC matrix using Active Directory users and groups. They would like to manage it on a central security policy server and enforce it on all Hadoop clusters that are spun up on AWS. This policy server should also store access and audit information for compliance needs.

In this post, I provide the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.

Apache Ranger

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, like NameNode and HiveServer2.

Architecture

Using the setup in the following diagram, multiple EMR clusters can sync policies with a standalone security policy server. The idea is similar to a shared Hive metastore that can be used across EMR clusters.

EMRRanger_1

Walkthrough

In this walkthrough, three users—analyst1, analyst2, and admin1—are set up for the initial authorization, as shown in the following diagram. Using the Ranger Admin UI, I show how to modify these access permissions. These changes are propagated to the EMR cluster and validated through Hue.

o_EMRRanger_2

To manage users/groups/credentials, use Simple AD, a managed directory service offered by AWS Directory Service. You create a Windows EC2 instance to join the SimpleAD domain and load users/groups using a PowerShell script. Then, set up the security policy server (Ranger) and create and configure the EMR cluster. Test the security policies and then update them.

Prerequisites

The following steps assume that you have a VPC with at least two subnets, with NAT configured for private subnets. Also, verify that DNS Resolution (enableDnsSupport) and DNS Hostnames (enableDnsHostnames) are set to Yes on the VPC. The EC2 instance created in the steps below can be used as bastion if launched in a public subnet. If no public subnets are selected, you will need a bastion host or a VPN connection to login to the windows instance and access Web UI links (Hue, Ranger).

I have created AWS CloudFormation templates for each step and a nested CloudFormation template for single-click deployment launch_stack. If you use this nested Cloudformation template, skip to the “Testing the cluster” step after the stack has been successfully created.

To create each component individually, follow the steps below.

IMPORTANT: The templates use hard-coded username and passwords, and open security groups. They are not intended for production use without modification.

Setting up a SimpleAD server

Using this CloudFormation template, set up a SimpleAD server. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_1_1

CloudFormation output:

EMRRanger_Grid2

NOTE: SimpleAD creates two servers for high availability. For the following steps, you can use either of the two IP addresses.

Creating a Windows EC2 instance

To manage the SimpleAD server, set up a Windows instance. It is used to load LDAP users required to test the access policies. On instance startup, a PowerShell script is executed automatically to load users (analyst1, analyst2, admin1).

Using this CloudFormation template, set up this Windows instance. Select a public subnet if you want to use this as a bastion host to access Web UI (Hue, Ranger). To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_3_2

You can specify either of the two SimpleAD IP addresses.

CloudFormation output:

EMRRanger_Grid4

Once stack creation is complete, Remote desktop into this instance using the SimpleAD username (EmrSimpleAD\Administrator) and password (Password@123) before moving to the next step.

NOTE: The instance initialization is longer than usual because of the SimpleAD Join and PowerShell scripts that need to be executed after the join.

Setting up the Ranger server

Now that SimpleAD has been created and the users loaded, you are ready to set up the security policy server (Ranger). This runs on a standard Amazon Linux instance and Ranger is installed and configured on startup.

Using this CloudFormation template, set up the Ranger server. To launch the stack directly through the console, use launch_stack. It takes the following parameters:

EMRRanger_5_1

CloudFormation output:

EMRRanger_Grid6

NOTE: The Ranger server syncs users with SimpleAD and enables LDAP authentication for the Admin UI. The default Ranger Admin password is not changed.

Creating an EMR cluster

Finally, it’s time to create the EMR cluster and configure it with the required plugins. You can use the AWS CLI or CloudFormation to create and configure the cluster. EMR security configurations are not currently supported by CloudFormation.

Using the AWS CLI to create a cluster

aws emr create-cluster --applications Name=Hive Name=Spark Name=Hue --tags 'Name=EMR-Security' \
--release-label emr-5.0.0 \
--ec2-attributes 'SubnetId=<subnet-xxxxx>,InstanceProfile=EMR_EC2_DefaultRole,KeyName=<key name>' \
--service-role EMR_DefaultRole \
--instance-count 4 \
--instance-type m3.2xlarge \
--log-uri '<s3 location for logging>' \
--name 'SecurityPOCCluster' --region us-east-1 \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/scripts/download-scripts.sh","Args":["s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Name":"Download scripts"}]' \
--steps '[{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/updateHueLdapUrl.sh","<ip address of simple ad server>"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"UpdateHueLdapServer"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-policies.sh","<ranger host ip>","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger/inputdata"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPolicies"},{"Args":["spark-submit","--deploy-mode","cluster","--class","org.apache.spark.examples.SparkPi","/usr/lib/spark/examples/jars/spark-examples.jar","10"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"SparkStep"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/install-hive-hdfs-ranger-plugin.sh","<ranger host ip>","0.6","s3://aws-bigdata-blog/artifacts/aws-blog-emr-ranger"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"InstallRangerPlugin"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/loadDataIntoHDFS.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"LoadHDFSData"},{"Args":["/mnt/tmp/aws-blog-emr-ranger/scripts/emr-steps/createHiveTables.sh","us-east-1"],"Type":"CUSTOM_JAR","MainClass":"","ActionOnFailure":"CONTINUE","Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar","Properties":"","Name":"CreateHiveTables"}]' \
--configurations '[{"Classification":"hue-ini","Properties":{},"Configurations":[{"Classification":"desktop","Properties":{},"Configurations":[{"Classification":"auth","Properties":{"backend":"desktop.auth.backend.LdapBackend"},"Configurations":[]},{"Classification":"ldap","Properties":{"bind_dn":"binduser","trace_level":"0","search_bind_authentication":"false","debug":"true","base_dn":"dc=corp,dc=emr,dc=local","bind_password":"Bind@User123","ignore_username_case":"true","create_users_on_login":"true","ldap_username_pattern":"uid=<username>,cn=users,dc=corp,dc=emr,dc=local","force_username_lowercase":"true","ldap_url":"ldap://<ip address of simple ad server>","nt_domain":"corp.emr.local"},"Configurations":[{"Classification":"groups","Properties":{"group_filter":"objectclass=*","group_name_attr":"cn"},"Configurations":[]},{"Classification":"users","Properties":{"user_name_attr":"sAMAccountName","user_filter":"objectclass=*"},"Configurations":[]}]}]}]}]'

The LDAP-related configuration for HUE is passed using the --configurations option. For more information, see Configure Hue for LDAP Users and the EMR create-cluster CLI reference.

Using a CloudFormation template to create a cluster

This step requires some Hue configuration changes in the CloudFormation template. The IP address of the LDAP server (SimpleAD) needs to be updated.

  1. Open the template in CloudFormation Designer. For more information about how to modify a CloudFormation template, see Walkthrough: Use AWS CloudFormation Designer to Modify a Stack’s Template.
  2. Choose EMRSampleCluster.
  3. On the Properties section, update the value of ldap_url with the IP address of the SimpleAD server:
    "ldap_url": "ldap://<change it to the SimpleAD IP address>",
  4. On the Designer toolbar, choose Validate template to check for syntax errors in your template.
  5. Choose Create Stack.
  6. Update Stack name and the stack parameters.

CloudFormation parameters:

EMRRanger_7_1

CloudFormation output:

EMRRanger_Grid8

EMR steps are used to perform the following:

  • Install and configure Ranger HDFS and Hive plugins
  • Use the Ranger REST API to update repository and authorization policies.
    NOTE: This step needs to be executed the first time. New clusters do not need to include this step action.
  • Create Hive tables (tblAnalyst1 and tblAnalyst2) and copy sample data.
  • Create HDFS folders (/user/analyst1 and /user/analyst2) and copy sample data.
  • Run a SparkPi job using the spark submit action to verify the cluster setup.

To validate that all the step actions were executed successfully, view the Step section for the EMR cluster.

o_EMRRanger_3

NOTE: Cluster creation can take anywhere between 10-15 minutes.

Testing the cluster

Congratulations! You have successfully configured the EMR cluster with the ability to manage authorization policies, using Ranger. How do you know if it actually works? You can test this by accessing HDFS files and running Hive queries.

Using HDFS

Log in to Hue (URL: http://<master DNS or IP>:8888) as “analyst1” and try to delete a file owned by “analyst2”. For more information about how to access Hue, see Launch the Hue Web Interface. The Windows EC2 instance created in the previous steps can be used to access this without having to setup a SSH tunnel.

  1. Log in as user “analyst1” (password: Hadoop@Analyst1).
  2. Browse to the /user/analyst2 HDFS directory and move the file “football_coach_position.tsv” to trash.
  3. You should see a “Permission denied” error, which is expected.
    o_EMRRanger_4

Using Hive queries

Using the HUE SQL Editor, execute the following query.

These queries use external tables, and Hive leverages EMRFS to access the data stored in S3. Because HiveServer2 (where Hue is submitting these queries) is checking with Ranger to grant or deny before accessing any data in S3, you can create fine-grained SQL-based permissions for users even though there is a single EC2 role specified for the cluster (which is used by all requests the cluster makes to S3). For more information, see Additional Features of Hive on Amazon EMR.

SELECT * FROM default.tblanalyst1

This should return the results as expected. Now, run the following query:

SELECT * FROM default.tblanalyst2

You should see the following error:

o_EMRRanger_5

This makes sense. User analyst1 does not have table SELECT permissions on table tblanalyst2.

User analyst2 (default password: Hadoop@Analyst2) should see a similar error when accessing table tblanalyst1. User admin1 (default password: Hadoop@Admin1) should be able to run both queries.

Updating the security policies

You have verified that the policies are being enforced. Now, let’s try to update them.

  1. Log in to the Ranger Admin UI server
    • URL: http:://<ip address of the ranger server>:6080/login.jsp
    • Default admin username/password: admin/admin.
  2. View all the Ranger Hive policies by selecting “hivedev”
    o_EMRRanger_6
  3. Select the policy named “Analyst2Policy”
  4. Edit the policy by adding “analyst1” user with “select” permissions for table “tblanalyst2”
    EMRRanger_7
  5. Save the changes.

This policy change is pulled in by the Hive plugin on the EMR cluster. Give it at least 60 seconds for the policy refresh to happen.

Go back to Hue to test if this change has been propagated.

  1. Log back in to the Hue UI as user “analyst1” (see earlier steps).
  2. In the Hive SQL Editor, run the query that failed earlier:
    SELECT * FROM default.tblanalyst2

This query should now run successfully.

o_EMRRanger_8

Audits

Can you now find those who tried to access the Hive tables and see if they were “denied” or “allowed”?

  1. Log back in to the Ranger UI as admin (see earlier steps).
    URL: http://<ip address of the ranger server>:6080/login.jsp
  2. Choose Audit and filter by “analyst1”.
    • Analyst1 was denied SELECT access to the tblanalyst2 table.
      o_EMRRanger_9
    • After the policy change, the access was granted and logged.
      o_EMRRanger_10

The same audit information is also stored in SOLR for performing more complex and full test searches. The SOLR instance is installed on the same instance as the Ranger server.

  • Open Solr UI:
    http://<ip-address-of-ranger-server>:8983/solr/#/ranger_audits/query
  • Perform a document search
    o_EMRRanger_11

Direct URL: http:// <ip-address-of-ranger-server>:8983/solr/ranger_audits/select?q=*%3A*&wt=json&indent=true

Conclusion

In this post, I walked through the steps required to enable authorization and audit capabilities on EMR using Apache Ranger, with a centrally managed security policy server. I also covered the steps to automate this using CloudFormation templates.

Stay tuned for more posts about security on EMR. If you have questions or suggestions, please comment below.

For information about other EMR security aspects, see Jeff Barr’s posts:


About the author


varun_90Varun Rao is a Big Data Architect for AWS Professional Services.
He works with enterprise customers to define data strategy in the cloud. In his spare time, he tries to keep up with his 2-year-old.

 

 


Related

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

security_config