AWS Big Data Blog
Encrypt Your Amazon Redshift Loads with Amazon S3 and AWS KMS
Russell Nash is a Solutions Architect with AWS
Have you been looking for a straightforward way to encrypt your Amazon Redshift data loads? Have you wondered how to safely manage the keys and where to perform the encryption?
In this post, I will walk through a solution that meets these requirements by showing you how to easily encrypt your data loads into Amazon Redshift from end to end, using the server-side encryption features of Amazon S3 coupled with the AWS Key Management Service (AWS KMS).
Tutorial: Encrypting your data loads end to end
Below is a high-level view of the process.
The major steps in this process are:
- You create a master key in AWS KMS.
- You load the data into S3.
- S3 retrieves the key from KMS and encrypts the data.
- You run the COPY command in Amazon Redshift to load the data from S3.
You’ll need the following AWS pieces in place:
- An AWS Identity and Access Management (IAM) user assigned with a role which has “PowerUserAccess” policy attached. Note that in a production system you would be more granular and restrictive in controlling access to master keys.
- An S3 bucket.
- The AWS Command Line Interface (AWS CLI) installed and configured.
- A running Amazon Redshift cluster.
- A data file for loading into Amazon Redshift.
- The Amazon Redshift documentation has details on preparing a data file for ingest with COPY.
- If you want to follow my example exactly, you can download my data file.
Step 1: Create a master key in AWS KMS
Before we start using the AWS Key Management Service, let’s quickly discuss what it is. Encryption requires keys to allow you to perform the encrypt and decrypt operations on your data. How you manage these keys is important, because you need to be able to generate, store, rotate, and control access to them. AWS KMS is a key management service that allows you to do this in a secure and cost-effective way. KMS is integrated into many AWS services, including Amazon Redshift and IAM, which makes it very easy to use.
As with most AWS services, we could interact with KMS using the console. But here, we are going to use the command line.
Note: KMS is a regional service, so make sure you are connecting to it in the same region as your S3 bucket.
First, let’s create the key. The description option isn’t required, but it’s good practice to include it.
$ aws kms create-key --description "For Redshift SSE loads”
You then get back some metadata about the new key.
{ "KeyMetadata": { "KeyId": "abcdefg1-413c-a023-1e43b53e5392", "Description": "For Redshift SSE loads", "Enabled": true, "KeyUsage": "ENCRYPT_DECRYPT", "KeyState": "Enabled", "CreationDate": 1456091123.452, "Arn": "arn:aws:kms:ap-southeast-2:012345678999:key/abcdefg1-697a-413c-a023-1e43b53e5392", "AWSAccountId": "012345678999" } }
Add an alias to the key
This step is also optional, but again it’s good practice to add an alias to the key, to later identify the key you’ve used.
aws kms create-alias --alias-name alias/Redshift-SSE --target-key-id abcdefg1-697a-413c-a023-1e43b53e5392
Step 2: Load the data into S3
Before we upload our data file into S3, there are two other points to discuss.
Using TLS for encryption in transit – The data will be encrypted at rest in S3. To help make sure that the object is also secure in transit from your network to S3, the AWS CLI, SDK and API use Transport Layer Security (TLS), the successor to Secure Sockets Layer (SSL), when communicating with AWS services.
Using AWS Signature Version 4 – For added security, KMS requires that all requests for uploading or accessing objects use AWS Signature Version 4. In certain AWS regions, S3 will only accept Version 4, and the AWS SDKs and CLI will therefore use that by default in those regions. However, in other regions they will default to Version 2. Follow the instructions in the S3 documentation for specifying the signature version, which explain how to ensure that Version 4 is being used.
Once the key signature is set correctly for the AWS CLI, you should see the following in your AWS CLI config file, either under your default or custom profile. The CLI config file is typically located at ~/.aws/config on Linux, OS X, or Unix, or at C:UsersUSERNAME .awsconfig on Windows.
[default] s3 = signature_version = s3v4
Now let’s upload our file to S3 with the cp command, specifying that we want to use server-side encryption (SSE) with KMS and also specifying the key ID of the master key to use.
aws s3 cp flight_data.gz s3://redshiftdata-kmsdemo/flight_data_sse.gz --sse aws:kms --sse-kms-key-id abcdefg1-697a-413c-a023-1e43b53e5392
Step 3: Retrieve the key from KMS and encrypt the data in S3
One reason customers choose server-side encryption is that it uses S3 to perform the encryption and decryption. The main benefits of this approach are that you don’t have to provide any extra compute resources for those operations. It also makes interacting with other services like Amazon Redshift seamless.
Although S3 handles the process of server-side encryption transparently, it’s interesting to know what’s going on under the covers.
When you upload an object into S3 and specify SSE as we did in step 2, a unique 256-bit key (a data key) is generated and the data is encrypted using that key with AES-256.
The data key is then encrypted using a master key, which S3 retrieves from KMS using the master key ID you supplied in the s3 cp command.
The encrypted data key is then stored with the encrypted data in the S3 bucket.
If you look at the details of the object in the S3 console, you see that it’s marked as being server-side encrypted with the alias of the KMS master key we used.
Note: You can specify a condition in the S3 bucket policy to ensure that only objects requesting server-side encryption can be PUT into that bucket.
Step 4: Run the COPY command in Amazon Redshift to load data from S3
If your target table doesn’t already exist, create it.
CREATE TABLE FLIGHT_DATA ( FL_DATE TIMESTAMP, AIRLINE_ID INTEGER, ORIGIN_REGION VARCHAR(10), ORIGIN_DIVISION VARCHAR(20), ORIGIN_STATE_NAME CHAR(15), ORIGIN_STATE_ABR CHAR(2), ORIGIN_AP CHAR(3), DEST_REGION VARCHAR(10), DEST_DIVISION VARCHAR(20), DEST_STATE_NAME CHAR(15), DEST_STATE_ABR CHAR(2), DEST_AP CHAR(3), DEP_DELAY NUMERIC(8, 2), ARR_DELAY NUMERIC(8, 2), CANCELLED BOOLEAN ) DISTSTYLE EVEN SORTKEY ( FL_DATE );
Then run the COPY command.
COPY FLIGHT_DATA FROM 's3://redshiftdata-kmsdemo/flight_data_sse.gz' CREDENTIALS 'aws_iam_role=arn:aws:iam::012345678999:role/Redshift-Loader' GZIP DELIMITER ',';
INFO: Load into table 'flight_data' completed, 1000000 record(s) loaded successfully. COPY
select origin_ap, dest_ap, count(*) from flight_data group by 1,2 order by 3 desc limit 10;
origin_ap | dest_ap | count -----------+---------+------- LAX | LAS | 3057 PHX | LAX | 2996 LAX | PHX | 2989 SFO | LAX | 2984 LAX | SFO | 2978 LAS | LAX | 2963 MSP | ORD | 2561 ORD | MSP | 2557 EWR | ORD | 2555 ORD | EWR | 2489 (10 rows)
That’s the end of the walk-through. However, I want to mention a couple of other items of interest regarding encryption with Amazon Redshift.
Other options for key management
In this post, we used KMS for the management of the master key that S3 uses for server-side encryption.
There are a couple of other options for key management, which are described in more detail in the S3 documentation section that discusses protecting data with S3 server-side encryption.
SSE-S3: With this option, S3 manages the master key for you.
SSE-C: The C stands for customer in this option, which allows you to supply your own master key. Amazon Redshift does not support loading objects that are encrypted in this way.
Server-side vs. client-side encryption
Although server-side encryption uses S3 to perform the encryption and decryption process, you can also use client-side encryption (CSE), which means that you perform the encryption in your network before uploading the data to S3. This approach is discussed in more detail in the Amazon Redshift documentation on using client-side encryption for loading.
Cluster encryption vs data load encryption
Here we’ve covered encrypting your data loads, and you also have the option to encrypt the data at rest within the Amazon Redshift cluster. This approach is covered by cluster encryption, which is easy to set up and well-documented in the Amazon Redshift Cluster Management Guide.
These two aspects of encryption are completely separate, with no dependency on each other. You can set up either, neither, or both. If you do have both, the two types of encryption will use different keys.
Conclusion
In this post, I’ve shown you how to easily encrypt your Amazon Redshift data loads from end to end, using server-side encryption and the AWS Key Management Service. There are no additional S3 charges for using server-side encryption. Also, KMS has a free tier that covers 20,000 requests per month, so please try it out.
I have focused on loading data into a cluster, but if you need to move encrypted data between clusters, have a look at the Amazon Redshift Unload/Copy utility. The utility uses AWS KMS to do exactly that.
If you have questions or suggestions please add a comment below.
--------------------
Related
Agile Analytics with Amazon Redshift
Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.