Securely Analyze Data from Another AWS Account with EMRFS
Sometimes, data to be analyzed is spread across buckets owned by different accounts. In order to ensure data security, appropriate credentials management needs to be in place. This is especially true for large enterprises storing data in different Amazon S3 buckets for different departments. For example, a customer service department may need access to data owned by the research department, but the research department needs to provide that access in a secure manner.
This aspect of securing the data can become quite complicated. Amazon EMR uses an integrated mechanism to supply user credentials for access to data stored in S3. When you use an application (Hive, Spark, etc.) on EMR to read or write a file to or from an S3 bucket, the S3 API call needs to be signed by proper credentials to be authenticated.
Usually, these credentials are provided by the EC2 instance profile that you specify during cluster launch. What if the EC2 instance profile credentials are not enough to access an S3 object, because that object requires a different set of credentials?
This post shows how you can use a custom credentials provider to access S3 objects that cannot be accessed by the default credentials provider of EMRFS.
EMRFS and EC2 instance profiles
When an EMR cluster is launched, it needs an IAM role to be specified as the Amazon EC2 instance profile. An instance profile is a container that is used to pass the permissions contained in an IAM role when the EC2 instance is starting up. The IAM role essentially defines the permissions for anyone who assumes the role.
In the case of EMR, the IAM role contained in the instance profile has permissions to access other AWS services such as Amazon S3, Amazon CloudWatch, Amazon Kinesis, etc. This role obtains temporary credentials via the EC2 instance metadata service and provides them to the application that needs to access other AWS services.
For example, when a Hive application on EMR needs to read input data from an S3 bucket (where the S3 bucket path is specified by the s3:// URI), it invokes a default credentials provider function of EMRFS. The provider in turn obtains the temporary credentials from the EC2 instance profile and uses those credentials to sign the S3 GET request.
Custom credentials providers
In certain cases, the credentials obtained by the default credentials provider might not be enough to sign requests to an S3 bucket to which your IAM user does not have permissions to access. Maybe the bucket has a different owner, or restrictive bucket policies that allow access only to a specific IAM user or role.
In situations like this, you have other options that allow your IAM user to access the data. You could modify the S3 bucket policy to allow access to your IAM user but this might be a security risk. A better option is to implement a custom credentials provider for EMRFS to ensure that your S3 requests are signed by the correct credentials. A custom credentials provider ensures that only a configured EMR cluster has access to the data in S3. It provides much better control over who can access the data.
Configuring a custom credentials provider for EMRFS
Create a credentials provider by implementing both the AWSCredentialsProvider (from the AWS Java SDK) and the Hadoop Configurable classes for use with EMRFS when it makes calls to Amazon S3.
Each implementation of AWSCredentialsProvider can choose its own strategy for loading credentials depending on the use case. You can either load credentials using the AWS STS AssumeRole API action or from a Java properties file if you would like to make API calls using the credentials of a specific IAM user. Then, package your custom credentials provider in a JAR file, upload the JAR file to your EMR cluster, and specify the class name by setting fs.s3.customAWSCredentialsProvider in the emrfs-site configuration classification.
Update: It is now possible to use the EMR EC2 instance profile to assume a role in order to access S3 bucket in a different account. So as an alternative in the following walkthrough, you can replace the IAM user “data_analyst” with the EMR EC2 instance profile and use the InstanceProfileCredentialsProvider class of AWS Java SDK to obtain temporary credentials from the EC2 instance profile. And then use those credentials to call the STS AssumeRole request to access an S3 bucket in another account.
Suppose you would like to analyze data stored in an S3 bucket owned by the research department of your company which has its own AWS account. You can launch an EMR cluster in your account and leverage EMRFS to access data stored in the bucket owned by the research department.
For this example, the two accounts of your company are:
- Research: firstname.lastname@example.org (Account ID: 123456789012)
- Your department: email@example.com (Account ID: 111222333444)
Also, you have an IAM user called “data_analyst” in firstname.lastname@example.org. This user should be granted cross-account access to read data from the bucket called “research-data” in email@example.com. In order to enable cross-account access, follow these procedures:
- Configure IAM inside account firstname.lastname@example.org
- Configure IAM inside account email@example.com
Configure IAM inside account firstname.lastname@example.org
- Sign in to the IAM console.
- Choose Roles, Create New Role
- Enter a name for the role, such as “demo-role”
- Expand the Role for Cross-Account Access section and select role type Provide access between AWS accounts you own.
- Add email@example.com as the account from which IAM users can access firstname.lastname@example.org. This can be done by specifying the AWS account ID for email@example.com. The account ID can be obtained from the My Account page in the AWS Management Console.
- On the Attach Policy page, choose Next Step
- Review the details and choose Create Role
- In the left navigation, choose Policies, Create Policy
- Choose Create Your Own Policy and name the policy as demo-role-policy, where the policy document is:
- In the left navigation pane, choose Roles page, open the demo-role role, and choose Attach Policy. From the list of displayed policies, open demo-role-policy (which you just created in Step 9) and attach the policy
- To configure the trust relationship for the role, choose Trust Relationships, Edit Trust Relationship
- On the Edit Trust Relationship page, the policy document should specify the IAM user data_analyst who can assume this role
Configure IAM inside account firstname.lastname@example.org
Attach an IAM policy to the user “data_analyst” which allows the user to assume the IAM role “demo-role” created in the account email@example.com.
- Sign in to the IAM console
- Choose Policies, Create Policy
- Choose Create Your Own Policy and name the policy as “data-analyst-policy”, where the policy document is:
- Choose Create Policy
- Select the policy just created, choose Policy actions, Attach
- For Principal entity, select the user “data_analyst” and choose Attach policy
Implement the custom credential provider
After the IAM configurations are finished, you implement a custom credentials provider to enable the EMR cluster to access objects stored in the S3 bucket “research-data”.
The following is sample code for the custom credentials provider that reads the IAM user data_analyst credentials from a Java properties file. If the bucket URI points to a bucket in the firstname.lastname@example.org account, the provider then assumes the role to obtain temporary credentials. On the other hand, if the bucket URI points to a bucket in the user’s own account, the credentials are read from the EC2 instance profile for the EMR cluster.
As you might have noticed in the above code you are using the file “Credentials.properties” to read the IAM user data_analyst credentials, where the contents of that file are:
Now, compile and package the code in to a JAR file to be used by your EMR cluster for cross-account S3 access. Use the following steps:
- SSH in to the master node of a running EMR cluster, navigate to the home directory of the Hadoop user, create a folder called MyAWSCredentialsProvider, and add the following files in the folder:
- In the folder, execute the following commands to compile and package the Java code in to a JAR file:
- Upload the generated JAR file to your S3 bucket:
After you have executed the steps above, create a bootstrap action script (configure_emrfs_lib.sh) whose contents are as follows:
Now, launch a new EMR cluster. Specify the full class name of the credentials provider by setting fs.s3.customAWSCredentialsProvider in the emrfs-site configuration classification. Add a custom bootstrap action (configure_emrfs_lib.sh), which copies the JAR file to the auxiliary library folder of EMRFS on all nodes of the cluster.
Using AWS CLI, run the following command to launch an EMR cluster configured with a custom credentials provider. The configuration also includes a bootstrap action to place your custom JAR file in the relevant directory.
Now that you have a cluster running with the capability to access data in the research account, you can use various data processing applications available on EMR such as Hive, Pig, Spark or big data processing model like MapReduce on YARN to analyze the data.
There are some applications on EMR — like Presto and Oozie — that do not use EMRFS to interact with S3, so you will not be able to use these applications in this scenario.
In this post, I explained how EMRFS obtains credentials to sign API calls to S3. I covered how you can implement a custom credentials provider for EMRFS to access objects in an S3 bucket that otherwise could not be accessed using the default credentials provider. I also demonstrated how to configure cross-account S3 API access and use EMRFS to provide custom credentials. This enables your big data applications running on EMR to access data stored in an S3 bucket belonging to a different account.
For more information about using the EMRFS custom credentials provider, see Create an AWSCredentialsProvider for EMRFS.
If you have questions or suggestions, please leave a comment below.
About the Author
Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.