AWS Big Data Blog

Secure Amazon EMR with Encryption

In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.

A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.

You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.

o_Amazon_EMR_Encryption_1

In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.

Data at rest

  • Data residing on Amazon S3—S3 client-side encryption with EMR
  • Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)

Data in transit

  • Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR
  • Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
  • Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption

Encryption walkthrough

For this post, you create a security configuration that implements encryption in transit and at rest. To achieve this, you create the following resources:

  • KMS keys for LUKS encryption and S3 client-side encryption for data exiting EMR to S3
  • SSL certificates to be used for MapReduce shuffle encryption
  • The environment into which the EMR cluster is launched. For this post, you launch EMR in private subnets and set up an S3 VPC endpoint to get the data from S3.
  • An EMR security configuration

All of the scripts and code snippets used for this walkthrough are available on the aws-blog-emrencryption GitHub repo.

Generate KMS keys

For this walkthrough, you use AWS KMS, a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data and disks.

You generate two KMS master keys, one for S3 client-side encryption to encrypt data going out of EMR and the other for LUKS encryption to encrypt the local disks. The Hadoop MapReduce framework uses HDFS. Spark uses the local file system on each slave instance for intermediate data throughout a workload, where data could be spilled to disk when it overflows memory.

To generate the keys, use the kms.json AWS CloudFormation script.  As part of this script, provide an alias name, or display name, for the keys. An alias must be in the “alias/aliasname” format, and can only contain alphanumeric characters, an underscore, or a dash.

o_Amazon_EMR_Encryption_2

After you finish generating the keys, the ARNs are available as part of the outputs.

o_Amazon_EMR_Encryption_3

Generate SSL certificates

The SSL certificates allow the encryption of the MapReduce shuffle using HTTPS while the data is in transit between nodes.

o_Amazon_EMR_Encryption_4

For this walkthrough, use OpenSSL to generate a self-signed X.509 certificate with a 2048-bit RSA private key that allows access to the issuer’s EMR cluster instances. This prompts you to provide subject information to generate the certificates.

Use the cert-create.sh script to generate SSL certificates that are compressed into a zip file. Upload the zipped certificates to S3 and keep a note of the S3 prefix. You use this S3 prefix when you build your security configuration.

Important

This example is a proof-of-concept demonstration only. Using self-signed certificates is not recommended and presents a potential security risk. For production systems, use a trusted certification authority (CA) to issue certificates.

To implement certificates from custom providers, use the TLSArtifacts provider interface.

Build the environment

For this walkthrough, launch an EMR cluster into a private subnet. If you already have a VPC and would like to launch this cluster into a public subnet, skip this section and jump to the Create a Security Configuration section.

To launch the cluster into a private subnet, the environment must include the following resources:

  • VPC
  • Private subnet
  • Public subnet
  • Bastion
  • Managed NAT gateway
  • S3 VPC endpoint

As the EMR cluster is launched into a private subnet, you need a bastion or a jump server to SSH onto the cluster. After the cluster is running, you need access to the Internet to request the data keys from KMS. Private subnets do not have access to the Internet directly, so route this traffic via the managed NAT gateway. Use an S3 VPC endpoint to provide a highly reliable and a secure connection to S3.

o_Amazon_EMR_Encryption_5

In the CloudFormation console, create a new stack for this environment and use the environment.json CloudFormation template to deploy it.

As part of the parameters, pick an instance family for the bastion and an EC2 key pair to be used to SSH onto the bastion. Provide an appropriate stack name and add the appropriate tags. For example, the following screenshot is the review step for a stack that I created.

o_Amazon_EMR_Encryption_6

After creating the environment stack, look at the Output tab and make a note of the VPC ID, bastion, and private subnet IDs, as you will use them when you launch the EMR cluster resources.

o_Amazon_EMR_Encryption_7

Create a security configuration

The final step before launching the secure EMR cluster is to create a security configuration. For this walkthrough, create a security configuration with S3 client-side encryption using EMR, and LUKS encryption for local volumes using the KMS keys created earlier. You also use the SSL certificates generated and uploaded to S3 earlier for encrypting the MapReduce shuffle.

o_Amazon_EMR_Encryption_8

Launch an EMR cluster

Now, you can launch an EMR cluster in the private subnet. First, verify that the service role being used for EMR has access to the AmazonElasticMapReduceRole managed service policy. The default service role is EMR_DefaultRole. For more information, see Configuring User Permissions Using IAM Roles.

From the Build an environment section, you have the VPC ID and the subnet ID for the private subnet into which the EMR cluster should be launched. Select those values for the Network and EC2 Subnet fields. In the next step, provide a name and tags for the cluster.

o_Amazon_EMR_Encryption_9

The last step is to select the private key, assign the security configuration that was created in the Create a security configuration section, and choose Create Cluster.

o_Amazon_EMR_Encryption_10

Now that you have the environment and the cluster up and running, you can get onto the master node to run scripts. You need the IP address, which you can retrieve from the EMR console page. Choose Hardware, Master Instance group and note the private IP address of the master node.

o_Amazon_EMR_Encryption_11

As the master node is in a private subnet, SSH onto the bastion instance first and then jump from the bastion instance to the master node. For information about how to SSH onto the bastion and then to the Hadoop master, open the ssh-commands.txt file. For more information about how to get onto the bastion, see the Securely Connect to Linux Instances Running in a Private Amazon VPC post.

After you are on the master node, bring your own Hive or Spark scripts. For testing purposes, the GitHub /code directory includes the test.py PySpark and test.q Hive scripts.

Summary

As part of this post, I’ve identified the different phases where data needs to be encrypted and walked through how data in each phase can be encrypted. Then, I described a step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration. As part of this walkthrough, you also secured the data by launching your cluster in a private subnet within a VPC, and used a bastion instance for access to the EMR cluster.

If you have questions or suggestions, please comment below.


About the Author

sai_90Sai Sriparasa is a Big Data Consultant for AWS Professional Services. He works with our customers to provide strategic & tactical big data solutions with an emphasis on automation, operations & security on AWS. In his spare time, he follows sports and current affairs.

 

 

 


Related

Implementing Authorization and Auditing using Apache Ranger on Amazon EMR

EMRRanger_1