AWS Big Data Blog

Using LDAP via AWS Directory Service to Access and Administer Your Hadoop Environment

Erik Swensson is a Solutions Architect with AWS

In this post you will learn how to leverage a Lightweight Directory Access Protocol (LDAP) service via AWS Directory Service to authenticate and define permissions for users and administrators of Amazon EMR, Amazon’s hosted Hadoop service. A centralized LDAP repository for authentication and authorization lets you more easily manage one username and password across your organization. You can segregate duties, ensure that you are enforcing the same password policies throughout your organization, and have just one place to enable or disable access that is easily auditable. To accomplish this we will walk through the following steps:

  1. Create an Active Directory compliant LDAP service using AWS Directory Services.
  2. Create users in that directory for Hadoop administration and another user for Hadoop use.
  3. Demonstrate how to create and administer Hadoop environments with the LDAP-backed Hadoop Administrator user with the AWS Console.
  4. Demonstrate how to use Pig and Hive using LDAP authenticated users via the Hadoop User Experience (HUE) interface on Amazon EMR.

Creating an AWS Directory Service

  1. Log into the AWS console.
  1. Go to Directory Services under Administration & Security.
  1. Click Set up directory.

  1. Next, choose the directory type. If you already have an Active Directory on premises you can select AD Connector to integrate with it.  We will select Simple AD, which creates a new Active Directory Domain in AWS.   To do this, click the blue Create Simple AD button.

  1. Enter the details of the AD you want to create. You can put your own values here or use the ones in the screen capture. Remember this information. You’ll need it for the following steps. Click Next Step.

  1. Review the settings and click Create Simple AD.
  1. Once your Simple AD is created, view the details in the console. Note that the two DNS addresses provided in the box below are two Amazon Elastic Compute Cloud (Amazon EC2) instances that have been created in your Virtual Private Cloud (VPC) are the domain controllers. Record these IP addresses–you’ll need them for a future step.

Managing an AWS Directory Service

Now we’ll need a Window’s 2008 machine that can manage the Directory service and create users.  We will create two users:

  • HadoopAdmin –  A user who can use the AWS console to create and modify Amazon EMR clusters
  • HadoopUser – A user who can use Hive and Pig via Hue

To create users in AD, you need a server that has the AD Users and Computers tool.  You have the option of managing your new Simple AD from an existing Windows machine or you can launch a new Windows machine to manage the directory. Be aware that the domain controller instances created for your Simple AD are protected by Security Groups. You may need to review and modify the Security Group settings if you would like to access these instances from outside their VPC. From this Windows machine, use the following instructions to add it to the domain and create these users.

  1. Remote Desktop into the Windows server that you can add to this Domain.
  1. Go to Network Connections > Ethernet Properties > Internet Protocol Version 4 Properties. Select Use the following DNS server addresses and enter the values of the addresses found in Step 7 above for your Simple AD and click OK.

  1. Now we need to join this computer to the AWS Directory Service Domain.  Go to System > System Properties. Click Change to provide the domain this computer will be a member of (the domain entered in Step 5 above). You can also modify the computer’s workgroup or name to something more meaningful here if you would like, but it’s not required.  Click OK.

  1. You will be asked for a username and password to complete this action. Use Administrator as the username and use the password you entered in Step 5 above.  Click OK.   You should then receive a message saying “Welcome to the Domain.” Click OK and reboot your machine to finish joining the Domain.
  1. After reboot, log into that machine using the DomianNameAdministrator and password you used in the previous step. (The Domain was done in Step 5 above while creating the Active Directory Service. In our example it was TESTDOMAINAdministrator.)

Note: This is the domain administrator password, not the Windows password.

  1. Open up Active Directory Users and Computers to manage the domain by going to Administrative Tools > Active Directory Users and Computers.  If you do not have Active Directory Tools installed on your machine, follow these instructions to install them .
  1. In the left pane, double-click the domain. Click Users and add two users: one named HadoopAdmin and one named HadoopUser as soon below.   Click Next.

  1. Type a password for each new user and record them somewhere. You’ll need them later when interacting with Hadoop. Uncheck User must change password at next login, click Next, and then click Finish for both HadoopAdmin and HadoopUser.

Granting Your LDAP HadoopAdmin User Access to the AWS Console

We now have our Simple AD directory and users created and need to authorize HadoopAdmin to launch and administer Amazon EMR Clusters. The following steps will grant the HadoopAdmin user the ability to use the AWS console.

  1. Log into the AWS Console with user with AWS administrative rights over AWS Directory Service and AWS Identity and Access Management (IAM).
  1. Go to the AWS Directory Service and click the Directory Service we created above.
  1. Type in an access URL and click the Create an Access URL button as shown below.  This is the URL where users in your LDAP will be authenticated when attempting to access the AWS Management Console.

  1. Click Continue.
  1. In the Apps & Services section, click Manage Access next to the AWS Management Console.
  1. In the pop-up box, click Enable Access.
  1. In the next window, click New Role and select the Create New Role button on the next page.

  1. Select any Template. (We will change the permissions specific for running Amazon EMR.) For example, chose PowerUser and click Next.
  1. Click View Details, change the IAM role to Create a new IAM Role, and change the Role Name to EmrAdmin.
  1. Click View Policy Document.
  1. Click Edit to overwrite the policy and paste in the Policy below. This Policy allows access to launch and manage Amazon EMR clusters.
{
"Version": "2012-10-17",   
"Statement": [
    {
     "Action": [
        "elasticmapreduce:*",
        "ec2:*",
        "cloudwatch:*",
        "s3:*",
        "sdb:*",
        "iam:PassRole",
        "iam:ListRoles"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
  1. Click Allow.  To launch and modify Amazon EMR clusters, you need admin rights on Amazon EMR, Amazon EC2,  Amazon Cloudwatch, and Amazon S3. This is the minimum privilege policy that Amazon EMR requires for this use case. AWS documentation provides more information about IAM Policies.
  1. Select the users or groups you want to grant EmrAdmin access to such as HadoopAdmin. A dropdown box will appear.
  1. Select HadoopAdmin as shown below and click Next Step. In this case we are simply granting a user access, but you can also use LDAP groups here to more easily manage Hadoop Administrators. Any new Hadoop administrators  added to the LDAP group get this access automatically.

  1. Click Create Role Assignment to finish the process.  You will see the following confirming that the user was granted EmrAdmin access.

  1.  To prove this works, launch an Amazon EMR cluster via the HadoopAdmin user.  Go to the Access URL created in Step 3 with the add /console/ in a browser (e.g. https://awsbigdatablogtest.awsapps.com/console/).
  1. Type the username HadoopAdmin and the password that you gave to your HadoopAdmin and click Sign In.

  1. In the upper-right corner, make sure you are in the region you want to be and confirm  that you are EmrAdmin/HadoopAdmin.  You can validate that you can access AWS services with this account but cannot manage anything in IAM.

Now you can use the HadoopAdmin user to do some Hadoop/EMR administration that uses HUE and LDAP. Below you will see a JSON that needs to be put into a file that contains the LDAP configuration we want to give to Amazon EMR.  You will need to modify the base_dn, ldap_url, bind_dn, bind_password which are in bold. This is for your own information in the JSON below.   After modification, put this in Amazon Simple Storage Service (Amazon S3). You will need to refer to it in the bootstrap item to configure Hadoop in the next step using the argument –hue-config=s3://locationOfYourJSON/filename.json.

Note: We are using the Administrator as the bind user. In reality, this is not a good security practice. You should create a limited access read-only bind user for this. However, for simplicity we have used Administrator.

Note: Also for the sake of simplicity, we are using the IP address for the ldap_url of one of the Active Directory Service instances provided when we created our Simple AD. In reality you would want to use a DNS name with both servers behind it and a health check to avoid having a single point of failure. This can be done with Amazon Route53.

{
  "hue": {
    "ldap": {
      "ldap_servers": {
        "mytestdomain": {
          "base_dn": "DC=testdomain, DC=awsbigdatablog,DC=com",
          "ldap_url": "ldap://10.0.11.9",
          "search_bind_authentication": "true",
          "bind_dn": "CN=Administrator,CN=users,DC=testdomain,DC=awsbigdatablog,DC=com",
          "bind_password": "YOURPASSWORD",
          "groups": {
            "group_name_attr": "cn",
            "group_filter": "objectclass=group"
          },
          "users": {
            "user_filter": "objectclass=user",
            "user_name_attr": "sAMAccountName"
          }
        }
      }
    }
  }
}

Go to the Amazon EMR page in the AWS Console and select Launch Cluster. Note the following:

  • We are not selecting a key pair, so there will be no direct access to the Hadoop environment via SSH.  This is optional but adds to the segregation of duty–the HadoopAdmin cannot access the system or perform functions within Hadoop, which should be done by the HadoopUser.  However, if you want to use SSH tunneling to connect to Hue add a key to set this up.
  • Select the VPC where you install your Directory Service. This is not required, but if you choose another VPC you must ensure that VPC can talk to your directory server and that your security groups allow that communication.
  • The Hue application should be configured for LDAP integration referring to the file created and stored in Amazon S3 in the step above.

Click Create Cluster. You have now launched an Amazon EMR cluster running Hue with LDAP integration by creating a user HadoopAdmin that is authorized by an LDAP service.  Note that the HadoopUser we created cannot log into the Console and perform these actions only the HadoopAdmin.

Using Hue with LDAP Users

We now have a cluster up and running with Hue and LDAP integration.  You can now log into the Hue user interface by either SSH tunneling to the Amazon EMR master node or opening the security group on port 8888 to the master of the Amazon EMR cluster you just launched.

  1. After setting up the tunneling or direct access via the security group, open a browser and point it to the master node on port 8888.  In the graphic below, mytestdomain is automatically populated. This shows authentication off the domain we created in the steps above.

  1. Enter in HadoopUser and the password you set for HadoopUser to log into Hue.
  1. You can see that HadoopUser can use to Hue to analyze data sets within Hadoop.

Note:  At this point we are authenticating with the Directory Service which will enforce password policy and be a centralized place to remove or add users. However, right now we have not yet achieved full separation of duties. Currently, anyone in the base_dn provided in the Hue application configuration can access Hue. We need to authorize some users and groups and prohibit others.

  1. In the upper-right corner, go to HadoopUser > Manage Users.

  1. Click Sync LDAP users/groups to sync all users and groups from your directory, for a larger directory you can also click the button next to it to Add/Sync individual users and grant them access.

You will now have three options at your disposal to grant and revoke access to users in the form of Users, Groups and Permissions

You can grant users specific permissions to specific areas such as user management, particular tools (such as allowing access to Pig but not Hive) or make the user/group active/inactive.   To achieve the segregation, authorization and authentication requirements your organization needs, learn more about Hue permissions and authorization.

  1. After authorizing and granting permissions, you can have your LDAP users log into HUE and access Pig and Hive.

Conclusion

This post has shown you that administering Hadoop on Amazon EMR can be easily done with an LDAP service that uses the AWS Directory Service. Additionally, using Hadoop to run Pig and Hive queries can be managed by an LDAP service through AWS Directory Service. This allows enterprises to integrate their existing LDAP environments with AWS to provide a single source of authorization and authentication. Users and groups can now be managed in one centralized location with policies you can store and audit. Your big data environments can remain agile on top of AWS while complying with enterprise security in a simple and cost-effective manner.

If you have questions or suggestions, please leave a comment below.
————————————–

Related

Securely Access Web Interfaces on Amazon EMR Launched in a Private Subnet