AWS Big Data Blog

Building Multi-AZ or Multi-Region Amazon Redshift Clusters

Erik Swensson is an AWS Solutions Architect.

AWS Solutions Architect Patrick Shumate also contributed to this post.

This post explores customer options for building multi-region or multi-availability zone (AZ) clusters. By default, Amazon Redshift has excellent tools to back up your cluster via snapshot to Amazon Simple Storage Service (Amazon S3). These snapshots can be restored in any AZ in that region or transferred automatically to other regions for disaster recovery. Amazon Redshift can even prioritize data being restored from Amazon S3 based on the queries running against a cluster that is still being restored.

However, sometimes requirements demand a faster Recovery Point Object/Recovery Time Object (RPO/RTO) of a full-scale operational Amazon Redshift cluster. Additionally, Amazon Redshift parallelizes queries across the different nodes of a cluster, but there may be circumstances when you want to allow more concurrent queries than one cluster can provide. You can achieve this by deploying two (or more) identical, independent parallel Amazon Redshift clusters. This design requires all database updates to be performed on all Amazon Redshift clusters.

There are many ways to ensure all writes occur to all clusters, but we will look at one possible way: using Amazon Kinesis. Amazon Kinesis is a durable ordered ingestion service where you can put messages and have one or more Kinesis-enabled applications retrieve and process those messages. In this case, all database modifications and updates would be sent to Amazon Kinesis and there would be an Amazon Kinesis-enabled subscriber for each identical cluster. The data flow diagram would look like this:

The source of each data load and modification query puts all update queries into Amazon Kinesis. In the image above there are three parallel clusters in two different regions, but you can choose any region/AZ combination. There will be one Amazon Kinesis subscriber for each parallel Amazon Redshift cluster. The subscriber will be in the same availability zone as the Redshift cluster it serves and will be in an AutoScaling group with a minimum and a maximum of one. That way, any failure of that subscriber is replaced to allow update processing to continue on where it left off.

As an update SQL command is written to Kinesis, each subscriber reads it and applies it to the Amazon Redshift cluster it serves. Ensure that updates can get to an Amazon Redshift cluster only through this process so that they all stay in sync. From an analysis standpoint, we will use Amazon Route 53, a fully managed DNS service, to do health checks and latency-based routing to determine which cluster will be accessed by users and tools to perform analysis and run read-only queries on the system.

Configuring the Amazon Redshift Cluster

Let’s walk through the configuration. For demonstration purposes, we will use just two Amazon Redshift clusters in the same region, but you could modify these steps to add more regions and parallel clusters.

Assumptions:

  • To keep this demo simple, the AWS CloudFormation template will launch in a Default VPC or EC2-Classic. Best practice would be to have each Amazon Redshift DB and Amazon Kinesis reader in a private subnet.
  • We are doing multi-AZ; to do multi-region you’d need to modify the Amazon Kinesis endpoint used by the Amazon Kinesis-enabled subscribers to fit the specific setup required for your database.
  • In this example, we keep things as simple as possible. There are many optimizations that could be made based off your company’s requirements and database environment.

IMPORTANT:

  • If you are updating a parallel or single Amazon Redshift cluster with a frequency greater than one minute, read the AWS Big Data Blog post that explains best practices for micro-batch loading on Amazon Redshift.

Directions:

  1. First, create two identical Amazon Redshift clusters in different AZs with the same database name, administrative username, and password. You can start with a small, empty database to try out the process. However, if you were taking an existing cluster and enhancing it to span multiple AZ’s or regions, you would just read-only the database and restore a snapshot to another database to have a starting point for your two identical parallel clusters.
  1. Get the Amazon Redshift end-point, database name and Availability Zone of each cluster, which you’ll need as an input in future steps. To get this, go to the AWS console > Redshift > Clusters, and select both clusters. The dashboard has the information:

  1. Launch this AWS CloudFormation template. It creates the following:
  • Amazon Kinesis Stream that ingests all modification queries.
  • Amazon DynamoDB table to track where each query process in case a processor fails and must be recreated.
  • IAM role that can read from the Amazon Kinesis stream and read and write to the DynamoDB tables created above.
  • Two Amazon Kinesis subscribers instances in AutoScaling groups to read from Kinesis and apply update queries to the Redshift clusters. It also deploys them. The Amazon Kinesis subscriber application code can be found here. These instances use the IAM role to gain access to read from the Amazon Kinesis stream.  It uses the Amazon Redshift username, password and endpoints to connect to the clusters to apply the updates as they come in via Amazon Kinesis. This subscriber code is just an example and could be easily modified to accommodate a more complex database environment.
  • Security group around the subscriber instances to allow only SSH port 22 access for administration from an IP address you provide as a parameter.

Now that you have a system for keeping the Amazon Redshift clusters in sync, it’s time to modify the database. We start by creating a table. Then we’ll load data into that table from a public data set in Amazon S3. Amazon Redshift can load data directly from Amazon S3 in parallel, so a best practice for data loading into Amazon Redshift is to write to Amazon S3 first. With this configuration we are only putting load SQL statements and updates through Amazon Kinesis, not actual data.

To write into the Amazon Kinesis stream, we use the Amazon Command Line Interface (CLI) . With CLI you can do this by running the following command to create a customer table with eight columns:

aws kinesis put-record –stream-name <YOUR KINESIS STREAM NAME> –data “$(echo “CREATE TABLE customer (c_custkey int8 NOT NULL, c_name varchar(25) NOT NULL, c_address varchar(40) NOT NULL, c_nationkey int4 NOT NULL, c_phone char(15) NOT NULL, c_acctbal numeric(12,2) NOT NULL, c_mktsegment char(10) NOT NULL, c_comment varchar(117) NOT NULL);”|base64 –wrap=0)” –partition-key “Source1”

You must modify this command and add the Amazon Kinesis stream name from the AWS CloudFormation template. You can do this by checking output from the AWS CloudFormation template in the console, selecting the stack that we just created, and then going to the Outputs tab.

Next, we load data into the Amazon Redshift clusters using the COPY command, which pulls the data from Amazon S3 into the database. We use a public data set that is already accessible, so you must replace your own access-key and secret-access-key in the following COPY statement:

aws kinesis put-record –stream-name <YOUR-KINESIS-STREAM-NAME> –data “$(echo “copy customer from ‘s3://redshift-demo/tpc-h/100lzo/customer/customer.tbl.’ CREDENTIALS ‘aws_access_key_id=;aws_secret_access_key=<YOUR SECRET ACCESS KEY>’ lzop delimiter ‘|’ COMPUPDATE ON;”|base64 –wrap=0)” –partition-key “Source1”

For the purposes of demonstration and to keep this simple, we pass the keys all the way from the source to Amazon Redshift. In a real-world production scenario, you would inject the keys by modifying the Amazon Kinesis subscriber application so that you wouldn’t have the keys stored at each source. As always, limit the access-key and secret–access-key to least privilege, so have only read privileges from required Amazon S3 locations and no other privileges. In this case, the data is stored on a public Amazon S3 bucket for simplicity.

These put queries into the Amazon Kinesis stream, which are then picked up by the Amazon Kinesis-enabled subscriber applications and applied to each Amazon Redshift cluster. To validate this proces, log into both of the cluster end points directly with a SQL client.  AWS provides step-by-step instructions for doing this. Once connected, you can run ‘SELECT COUNT(*) FROM CUSTOMER;’ to confirm that the data is populated in both clusters.

Accessing your Redshift Clusters for Analysis

Now that we are confident the clusters are being updated and synchronized, it is time to give our data analysts and business intelligence tools the ability to query the system. To do this, you can point the tools to a DNS record stored in Route 53 that is aware of both clusters. If they are in the same AZ, you’d probably want to do an evenly weighted round-robin similar to the following Clouldformation template

(Note: This AWS CloudFormation template will not apply since ‘example.com’ is not a valid domain name. You’d need to modify this to your own domain).

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Resources": {
    "dnsaz1clusterexamplecom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": "example.com.",
        "RecordSets": [
          {
            "Name": "az1cluster.example.com.",
            "Type": "CNAME",
            "TTL": "300",
            "ResourceRecords": [
              "az1cluster.cbjhdm0le2vh.us-east-1.redshift.amazonaws.com"
            ]
          }
        ]
      }
    },
    "dnsaz2clusterexamplecom": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": "example.com.",
        "RecordSets": [
          {
            "Name": "az2cluster.example.com.",
            "Type": "CNAME",
            "TTL": "300",
            "ResourceRecords": [
              "az2cluster.cbjhdm0le2vh.us-east-1.redshift.amazonaws.com"
            ]
          }
        ]
      }
    },
    "dnsmyparallelclusterexamplecomt": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": "example.com.",
        "RecordSets": [
          {
            "Name": "myparallelcluster.example.com.",
            "Type": "CNAME",
            "SetIdentifier": "t",
            "Weight": "50",
            "ResourceRecords": [],
            "AliasTarget": {
              "HostedZoneId": "Z1HN6T29675QII",
              "DNSName": "az2cluster.example.com"
            }
          }
        ]
      }
    },
    "dnsmyparallelclusterexamplecomy": {
      "Type": "AWS::Route53::RecordSetGroup",
      "Properties": {
        "HostedZoneName": "example.com.",
        "RecordSets": [
          {
            "Name": "myparallelcluster.example.com.",
            "Type": "CNAME",
            "SetIdentifier": "y",
            "Weight": "50",
            "ResourceRecords": [ ],
            "AliasTarget": {
              "HostedZoneId": "Z1HN6T29675QII",
              "DNSName": "az1cluster.example.com"
            }
          }
        ]
      }
    }
  },
  "Description": "Evenly Weighted DNS entry myparallelcluster.example.com to two identical Redshift Clusters"
}

In this example, the DNS record myparallelcluster.example.com is evenly weighted to az1.cluster.example.com and az2cluster.example.com, which resolve to our Amazon Redshift cluster endpoints.

In addition to this weighting, create a Route 53 Health Check on each one of these clusters. That way, if something happens to them they are removed from the DNS entry and all future requests are funneled to the healthy clusters. AWS provides simple instructions for creating health checks in Route 53.  The status check can be as simple as validating that the TCP port is responding. You could also set up a more sophisticated health check that validates off a website. This website could designate the cluster as unhealthy if it becomes out of sync for any reason, in which case Route53 would stop sending requests to it until it is healthy again.  If your clusters become out of sync for some reason, you must decide the best way to get them back in sync. If it’s just a failed query you might be able to re-run that query. If it’s a more complicated problem, you could re-import a table to sync them up.  Deciding whether to re-sync depends on the circumstances and the size of data.

If you are building a multi-region cluster, employ latency-based routing so that a user is routed to the best cluster possible. Some BI tools can point at just one database and don’t move over automatically the way this article describes; check the requirements of the tool you’re using.

In all cases, you want to make sure that only analysis queries are done through this method. Any modification to the database must go through Amazon Kinesis to ensure that it is applied to all parallel clusters.

Conclusion

By making Amazon Kinesis the ingestion point for all modification queries and using Amazon Route 53 to deliver users the closest healthy database, we have implemented an architecture that keeps multiple Redshift clusters synchronized.  This gives you the power to create a cost-effective, highly resilient, durable, and geographically disperse petabyte-scale data warehouse.

If you have questions or suggestions, please comment below.

Do more with Amazon Redshift!

Best Practices for Micro-batch Loading on Amazon Redshift

 

 

Using Amazon Redshift to Analyze your ELB Traffic Logs

 

 

Using Attunity Cloudbeam at UMUC to Replicate Data to Amazon RDS and Amazon Redshift