AWS Big Data Blog

Encrypt Your Amazon Redshift Loads with Amazon S3 and AWS KMS

Russell Nash is a Solutions Architect with AWS

Have you been looking for a straightforward way to encrypt your Amazon Redshift data loads? Have you wondered how to safely manage the keys and where to perform the encryption?

In this post, I will walk through a solution that meets these requirements by showing you how to easily encrypt your data loads into Amazon Redshift from end to end, using the server-side encryption features of Amazon S3 coupled with the AWS Key Management Service (AWS KMS).

Tutorial: Encrypting your data loads end to end

Below is a high-level view of the process.

High-level view of the process

The major steps in this process are:

  1. You create a master key in AWS KMS.
  2. You load the data into S3.
  3. S3 retrieves the key from KMS and encrypts the data.
  4. You run the COPY command in Amazon Redshift to load the data from S3.

You’ll need the following AWS pieces in place:

  • An AWS Identity and Access Management (IAM) user with the PowerUserAccess policy attached. Note that in a production system you would be more granular in controlling access to master keys.
  • An S3 bucket.
  • The AWS Command Line Interface (AWS CLI) installed and configured.
  • A running Amazon Redshift cluster.
  • A data file for loading into Amazon Redshift.

 

Step 1: Create a master key in AWS KMS

Before we start using the AWS Key Management Service, let’s quickly discuss what it is. Encryption requires keys to allow you to perform the encrypt and decrypt operations on your data. How you manage these keys is important, because you need to be able to generate, store, rotate, and control access to them. AWS KMS is a key management service that allows you to do this in a secure and cost-effective way. KMS is integrated into many AWS services, including Amazon Redshift and IAM, which makes it very easy to use.

As with most AWS services, we could interact with KMS using the console. But here, we are going to use the command line.

Note: KMS is a regional service, so make sure you are connecting to it in the same region as your S3 bucket.

First, let’s create the key. The description option isn’t required, but it’s good practice to include it.

$ aws kms create-key --description "For Redshift SSE loads”

You then get back some metadata about the new key.

{
    "KeyMetadata": {
        "KeyId": "abcdefg1-413c-a023-1e43b53e5392", 
        "Description": "For Redshift SSE loads", 
        "Enabled": true, 
        "KeyUsage": "ENCRYPT_DECRYPT", 
        "KeyState": "Enabled", 
        "CreationDate": 1456091123.452, 
        "Arn": "arn:aws:kms:ap-southeast-2:012345678999:key/abcdefg1-697a-413c-a023-1e43b53e5392", 
        "AWSAccountId": "012345678999"
    }
}

Add an alias to the key

This step is also optional, but again it’s good practice to add an alias to the key, to later identify the key you’ve used.

aws kms create-alias --alias-name alias/Redshift-SSE --target-key-id abcdefg1-697a-413c-a023-1e43b53e5392

 

Step 2: Load the data into S3

Before we upload our data file into S3, there are two other points to discuss.

Using TLS for encryption in transit – The data will be encrypted at rest in S3. To help make sure that the object is also secure in transit from your network to S3, the AWS CLI, SDK and API use Transport Layer Security (TLS), the successor to Secure Sockets Layer (SSL), when communicating with AWS services.

Using AWS Signature Version 4 – For added security, KMS requires that all requests for uploading or accessing objects use AWS Signature Version 4. In certain AWS regions, S3 will only accept Version 4, and the AWS SDKs and CLI will therefore use that by default in those regions. However, in other regions they will default to Version 2. Follow the instructions in the S3 documentation for specifying the signature version, which explain how to ensure that Version 4 is being used.

Once the key signature is set correctly for the AWS CLI, you should see the following in your AWS CLI config file, either under your default or custom profile. The CLI config file is typically located at ~/.aws/config on Linux, OS X, or Unix, or at C:UsersUSERNAME .awsconfig on Windows.

[default]
s3 =
    signature_version = s3v4

Now let’s upload our file to S3 with the cp command, specifying that we want to use server-side encryption (SSE) with KMS and also specifying the key ID of the master key to use.

aws s3 cp flight_data.gz s3://redshiftdata-kmsdemo/flight_data_sse.gz --sse aws:kms --sse-kms-key-id abcdefg1-697a-413c-a023-1e43b53e5392

 

Step 3: Retrieve the key from KMS and encrypt the data in S3

One reason customers choose server-side encryption is that it uses S3 to perform the encryption and decryption. The main benefits of this approach are that you don’t have to provide any extra compute resources for those operations. It also makes interacting with other services like Amazon Redshift seamless.

Although S3 handles the process of server-side encryption transparently, it’s interesting to know what’s going on under the covers.

When you upload an object into S3 and specify SSE as we did in step 2, a unique 256-bit key (a data key) is generated and the data is encrypted using that key with AES-256.

The data key is then encrypted using a master key, which S3 retrieves from KMS using the master key ID you supplied in the s3 cp command.

The encrypted data key is then stored with the encrypted data in the S3 bucket.

If you look at the details of the object in the S3 console, you see that it’s marked as being server-side encrypted with the alias of the KMS master key we used.

Note: You can specify a condition in the S3 bucket policy to ensure that only objects requesting server-side encryption can be PUT into that bucket.

 

Step 4: Run the COPY command in Amazon Redshift to load data from S3

If your target table doesn’t already exist, create it.

CREATE TABLE FLIGHT_DATA
(
	FL_DATE TIMESTAMP,
	AIRLINE_ID INTEGER,
	ORIGIN_REGION VARCHAR(10),
	ORIGIN_DIVISION VARCHAR(20),
	ORIGIN_STATE_NAME CHAR(15),
	ORIGIN_STATE_ABR CHAR(2),
	ORIGIN_AP CHAR(3),
	DEST_REGION VARCHAR(10),
	DEST_DIVISION VARCHAR(20),
	DEST_STATE_NAME CHAR(15),
	DEST_STATE_ABR CHAR(2),
	DEST_AP CHAR(3),
	DEP_DELAY NUMERIC(8, 2),
	ARR_DELAY NUMERIC(8, 2),
	CANCELLED BOOLEAN
)
DISTSTYLE EVEN
SORTKEY
(
	FL_DATE
);

Then run the COPY command.

COPY FLIGHT_DATA FROM 's3://redshiftdata-kmsdemo/flight_data_sse.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::012345678999:role/Redshift-Loader' GZIP DELIMITER ',';

We don’t need to make any reference to encryption in the COPY command, because S3 sees that the file is encrypted using SSE-KMS and communicates with KMS automatically to retrieve the correct master key. It then uses this key to decrypt the data key and in turn uses the data key to decrypt the data and provide it to the COPY command for loading. Communication between Amazon Redshift and S3 uses TLS, so the data is always encrypted in transit during loading.

You can now attach IAM roles to Amazon Redshift so that when you run commands like COPY and UNLOAD you can assume a role rather than use the credentials for an individual IAM user. We’ve used a role in our COPY command above, and you can get all the details on authorizing COPY and UNLOAD operations using IAM roles from the documentation.

For added security, the IAM user or role that is running the COPY command needs not only S3 read permission on the file but also permission to use the key within KMS. For the purposes of following this post you can attach the PowerUserAccess policy to the user or role but in production you would grant access for certain KMS keys to specific users and roles. For more information on KMS permissions, have a look at the section of the KMS documentation that discusses controlling access to customer master keys.

To illustrate the key points in this post, I have used only one data file for my COPY command. But the best practice for loading Amazon Redshift is to split your input data across a number of files, ideally a multiple of the number of slices in your cluster.

If all goes well with the COPY, we should get confirmation of the number of rows loaded, and we can then query the table.

INFO:  Load into table 'flight_data' completed, 1000000 record(s) loaded successfully.
COPY
select origin_ap, dest_ap, count(*) from flight_data group by 1,2 order by 3 desc limit 10;
 origin_ap | dest_ap | count 
-----------+---------+-------
 LAX       | LAS     |  3057
 PHX       | LAX     |  2996
 LAX       | PHX     |  2989
 SFO       | LAX     |  2984
 LAX       | SFO     |  2978
 LAS       | LAX     |  2963
 MSP       | ORD     |  2561
 ORD       | MSP     |  2557
 EWR       | ORD     |  2555
 ORD       | EWR     |  2489
(10 rows)

That’s the end of the walk-through. However, I want to mention a couple of other items of interest regarding encryption with Amazon Redshift.

Other options for key management

In this post, we used KMS for the management of the master key that S3 uses for server-side encryption.

There are a couple of other options for key management, which are described in more detail in the S3 documentation section that discusses protecting data with S3 server-side encryption.

SSE-S3: With this option, S3 manages the master key for you.

SSE-C: The C stands for customer in this option, which allows you to supply your own master key. Amazon Redshift does not support loading objects that are encrypted in this way.

Server-side vs. client-side encryption

Although server-side encryption uses S3 to perform the encryption and decryption process, you can also use client-side encryption (CSE), which means that you perform the encryption in your network before uploading the data to S3. This approach is discussed in more detail in the Amazon Redshift documentation on using client-side encryption for loading.

Cluster encryption vs data load encryption

Here we’ve covered encrypting your data loads, and you also have the option to encrypt the data at rest within the Amazon Redshift cluster. This approach is covered by cluster encryption, which is easy to set up and well-documented in the Amazon Redshift Cluster Management Guide.

These two aspects of encryption are completely separate, with no dependency on each other. You can set up either, neither, or both. If you do have both, the two types of encryption will use different keys.

Conclusion

In this post, I’ve shown you how to easily encrypt your Amazon Redshift data loads from end to end, using server-side encryption and the AWS Key Management Service. There are no additional S3 charges for using server-side encryption. Also, KMS has a free tier that covers 20,000 requests per month, so please try it out.

I have focused on loading data into a cluster, but if you need to move encrypted data between clusters, have a look at the Amazon Redshift Unload/Copy utility. The utility uses AWS KMS to do exactly that.

If you have questions or suggestions please add a comment below.

——————–

Related

Agile Analytics with Amazon Redshift

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.