AWS Big Data Blog

Process Encrypted Data in Amazon EMR with Amazon S3 and AWS KMS

Russell Nash is a Solutions Architect with AWS.

Amo Abeyaratne, a Big Data consultant with AWS, also contributed to this post.

One of the most powerful features of Amazon EMR is the close integration with Amazon S3 through EMRFS. This allows you to take advantage of many S3 features, including support for S3 client-side and server-side encryption. In a recent release, EMR supported S3 server-side encryption with AWS KMS keys (SSE-KMS), alongside the already supported SSE-S3 (S3 managed keys) and S3 client-side encryption with KMS keys or custom key providers.

In this post, I show how easy it is to create a master key in KMS, encrypt data either client-side or server-side, upload it to S3, and have EMR seamlessly read and write that encrypted data to and from S3 using the master key that you created.

Encryption: AWS KMS, CSE, and SSE

AWS KMS is a centralized key management service which allows you to create, rotate, log, and control access to keys that are used for encrypting your data. It protects your keys by using hardware security modules (HSMs) and provides a very cost-effective solution by allowing you to pay only for what you use. KMS integrates with other AWS services using envelope encryption, which is described succinctly in the KMS Developer Guide.

KMS can be used to manage the keys for both S3 client-side and server side encryption. For more information, see How Amazon S3 uses AWS KMS.

The main difference between the two is the location where the encryption and decryption on the data is performed. This is important because although both CSE and SSE can encrypt data in transit using Transport Layer Security (TLS), certain applications must meet compliance requirements by also encrypting data at rest before it leaves the corporate network.

The following diagrams illustrate where encryption and decryption are performed for SSE and CSE.

Figure 1. Server-Side encryption – Location of operations

Server-side encryption uses S3 for the encrypt/decrypt operations.

Figure 2. Client-Side encryption – Location of operations

Client-side encryption, as the name suggests, uses whatever the ‘client’ to S3 is for the encryption and decryption tasks. When using CSE with EMR, your cluster becomes the client and performs the required operations when reading and writing data to S3.

Encryption tutorial

In this tutorial, you’ll learn how to:

  1. Create a master key in KMS.
  2. Load two data files into S3, one using CSE and the other using SSE.
  3. Launch two EMR clusters configured for CSE and SSE.
  4. Access data in S3 from the EMR clusters.

Create a master key in KMS

Use the console to create a KMS key in KMS. This is covered in detail in the KMS Developer Guide but I’ve also provided a summarized version below. Note that KMS is a regional service, so make sure you create your key in the same region as your S3 bucket and EMR cluster.

  • Go to the IAM section of the AWS Management Console
  • On the left navigation pane, choose Encryption Keys.
  • Filter for the region in which to create the key.
  • Choose Create Key.
  • Go through the 4 steps and make sure that you provide key usage permissions in Step 3 for:
    • The user or role that uploads the file to S3
    • The EMR_EC2_DefaultRole to allow EMR to use the key

Load the files into S3 — SSE files

To load data files for SSE-KMS, first make sure that Signature Version 4 is being used for your request. Follow the Specifying Signature Version in Request Authentication instructions which explain how to ensure that this is the case.

In the Encryption Keys section of the IAM console, find the key-id for the master key you just created and use it in the S3 cp command below. This instructs S3 to encrypt the file at rest and contact KMS for the master key that corresponds to the key-id. For more detail on what’s going on under the covers, see the Encrypt Your Amazon Redshift Loads with Amazon S3 and AWS KMS  post.

aws s3 cp flight_data.gz s3://redshiftdata-kmsdemo/flight_data_sse.gz --sse aws:kms --sse-kms-key-id abcdefg1-697a-413c-a023-1e43b53e5392	

Load the files into S3 — CSE files

Using S3 client-side encryption involves a little bit more work because you are responsible for the encryption. In this tutorial, use the AWS Java SDK because it includes the AmazonS3EncryptionClient which allows you to easily encrypt and upload your file.

A colleague of mine has kindly provided the following Java code as an example which takes as input the S3 bucket name, S3 object key, KMS master key id, AWS region, and the source file name.

package S3Ecopy;

import java.io.File;
import java.io.FileInputStream;
import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Region;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3EncryptionClient;
import com.amazonaws.services.s3.model.CryptoConfiguration;
import com.amazonaws.services.s3.model.KMSEncryptionMaterialsProvider;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;


public class S3ecopy {
		
    private static AmazonS3EncryptionClient encryptionClient;

    public static void main(String[] args) throws Exception { 
        
    	if (args.length == 5) {
    	
    	String bucketName = args[0]; 
        String objectKey  = args[1];
        String kms_cmk_id = args[2];
        Regions aws_region = Regions.valueOf(args[3]);
        File src_file_loc = new File(args[4]);

        // create KMS Provider
        
        KMSEncryptionMaterialsProvider materialProvider = new KMSEncryptionMaterialsProvider(kms_cmk_id);
       
        // creating a new S3EncryptionClient
        
        encryptionClient = new AmazonS3EncryptionClient(new ProfileCredentialsProvider(), materialProvider,
                new CryptoConfiguration().withKmsRegion(aws_region))
            .withRegion(Region.getRegion(aws_region));
        
        
        try {        
                
        System.out.println("uploading file: " + src_file_loc.getPath());
        
        encryptionClient.putObject(new PutObjectRequest(bucketName, objectKey,
                new FileInputStream(src_file_loc), new ObjectMetadata()));
        } catch (AmazonClientException ace) {
            System.out.println("Caught an AmazonClientException, which " +
            		"means the client encountered " +
                    "an internal error while trying to " +
                    "communicate with S3, " +
                    "such as not being able to access the network.");
            System.out.println("Error Message: " + ace.getMessage());
        }

    }
    	
	
	else {
		System.out.println("syntax for command line s3ecopy     ");
	}
    
    //syntax for command line s3ecopy     
   
    	
 }
}  

After this is compiled and you call it with the correct parameters, the code sends a request to KMS; in response, KMS returns a randomly generated data encryption key that the code uses to encrypt the data file. In addition, KMS provides an encrypted version of the data key that is uploaded with the data object and stored as metadata.

java -jar S3Ecopy.jar emr-demo-data/cse flight_data_cse.gz abcdefg1-697a-413c-a023-1e43b53e5392 AP_SOUTHEAST_2 flight_data.gz

Launch EMR clusters

To configure your EMR clusters to use CSE or SSE through the CLI, add EMRFS parameters to the create-cluster command; if you’re using the console, configure it under Advanced Options, Step 4 – Security.

It’s worth noting what the EMRFS encryption configurations mean for reading and writing data encrypted using the different methods, assuming of course that the cluster has permissions to use the KMS master key.

The table below shows that SSE data can be read by an EMR cluster regardless of the EMFRS configuration, while CSE data can only be read if the cluster has been configured for CSE. The EMRFS configuration also dictates how data is written to S3.

Now launch your two clusters, starting with the one for SSE:

aws emr create-cluster --release-label emr-4.5.0 
--name SSE-Cluster 
--applications Name=Hive 
--ec2-attributes KeyName=mykey 
--region ap-southeast-2 
--use-default-roles 
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge 
--emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryption.kms.keyId=abcdefg1-697a-413c-a023-1e43b53e5392]

The command for the CSE cluster is very similar but with a change to the emrfs parameter:

aws emr create-cluster --release-label emr-4.5.0 
--name CSE-Cluster 
--applications Name=Hive 
--ec2-attributes KeyName=mykey 
--region ap-southeast-2 
--use-default-roles 
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge 
--emrfs Encryption=ClientSide,ProviderType=KMS,KMSKeyId=abcdefg1-697a-413c-a023-1e43b53e5392

Access data in S3 from EMR

The advantage of having the SSE or CSE parameters baked into your EMRFS configuration is that any operations that access the S3 data through EMRFS are able to read and write the data seamlessly without any concern for the type of encryption or the key management because EMR, S3, and KMS handle that automatically.

Note that the encryption settings in EMRFS only apply to applications that use it to interface with S3; for example, Presto does not use EMRFS so you would have to enable encryption through the PrestoS3Filesystem.

Hive does use EMRFS so use it here to illustrate reading from S3.

SSH into the cluster and drop into the Hive shell.

$ ssh -i mykey.pem hadoop@<EMR-CLUSTER-MASTER-DNS>
$ hive

Create a Hive table pointing to the S3 location of your data.

hive> CREATE EXTERNAL TABLE FLIGHTS_SSE(
FL_DATE TIMESTAMP,
AIRLINE_ID INT,
ORIGIN_REGION String,
ORIGIN_DIVISION STRING,
ORIGIN_STATE_NAME STRING,
ORIGIN_STATE_ABR STRING,
ORIGIN_AP STRING,
DEST_REGION STRING,
DEST_DIVISION STRING,
DEST_STATE_NAME STRING,
DEST_STATE_ABR STRING,
DEST_AP STRING,
DEP_DELAY DECIMAL(8, 2),
ARR_DELAY DECIMAL(8, 2),
CANCELLED STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'
LOCATION 's3://emr-demo-data/sse';

If you get an error from KMS when you create the table, saying that you don’t have access to the key, then check that you’ve given usage permissions to EMR_EC2_DefaultRole on your KMS master key.

Now you can query the table.

hive> select origin_region, cancelled, count(*) from flights_sse group by origin_region,cancelled;
Midwest		f	211141
Midwest		t	14544
Northeast	f	126801
Northeast	t	10167
South		f	336676
South		t	11643
West		f	280969
West		t	8059

To test the CSE configuration, log in to your CSE-Cluster and point your Hive table to the location of your CSE data file, which in this example is emr-demo-data/cse.

Conclusion

In this post, I’ve shown you how to configure your EMR clusters so that they can read and write either Amazon S3 client-side or Amazon S3 server-side encrypted data seamlessly with EMRFS. There’s no additional cost for using encryption with S3 or EMR, and KMS has a free tier of 20,000 requests per month; if you have an encryption requirement for your EMR data, you can easily set it up and try it out.

Note that the encryption I’ve talked about in this post covers data in S3. If you need to encrypt data in HDFS, see Transparent Encryption in HDFS in the EMR documentation.

If you have questions or suggestions, please leave a comment below.

———————————

Related

Encrypt Your Amazon Redshift Loads with Amazon S3 and AWS KMS

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.