AWS Big Data Blog

Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation

Amazon Redshift Federated Query allows you to combine the data from one or more Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon S3 data lake.

This post shows you how to set up Aurora PostgreSQL and Amazon Redshift with a 10 GB TPC-H dataset, and Amazon Redshift Federated Query using AWS CloudFormation. For more information about using Federated Query, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. You can use the environment you set up in this post to experiment with various use cases in the preceding post.

Benefits of using CloudFormation templates

The standard workflow of setting up Amazon Redshift Federated Query involves six steps. For more information, see Querying Data with Federated Query in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create and allows you to automate the creation of AWS resources to support Amazon Redshift Federated Query. You can further enhance this template to become the single source of truth for your infrastructure.

A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

Architecture overview

The following diagram illustrates the solution architecture.

The CloudFormation templates provision the following components in the architecture:

  • VPC
  • Subnets
  • Route tables
  • Internet gateway
  • Amazon Linux Bastion host
  • Secrets
  • Aurora PostgreSQL cluster with TPC-H dataset preloaded
  • Amazon Redshift cluster with TPC-H dataset preloaded
  • Amazon Redshift IAM role with required permissions

Prerequisites

Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:

  • Have an IAM user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
  • Create an Amazon EC2 key pair in the us-east-1 Region. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.

Setting up the resources with AWS CloudFormation

This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.

To create these resources, complete the following steps:

  1. Sign in to the console.
  2. Choose the us-east-1 Region in which to create the stack.
  3. Choose Launch Stack:
  4. Choose Next.This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
  5. For Stack name, enter a stack name.
  6. For Session, leave as the default.
  7. For ec2KeyPair, choose the key pair you created earlier.
  8. Choose Next.
  9. On the next screen, choose Next.
  10. Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
  11. Choose Create.Stack creation can take up to 45 minutes.
  12. After the stack creation is complete, in the Outputs section, record the value of the key for the following components, which you use in a later step:
  • AuroraClusterEndpoint
  • AuroraSecretArn
  • RedshiftClusterEndpoint
  • RedshiftClusterRoleArn

You are now ready to log in to both the Aurora PostgreSQL and Amazon Redshift cluster and run some basic commands to test them.

Logging in to the clusters using the Amazon Linux Bastion host

The following steps assume that you use a computer with an SSH client to connect to the Bastion host. For more information about connecting using various clients, see Connect to Your Linux Instance.

  1. Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux Bastion host.
  2. Change the permission of the private key using the following code, so that it’s not publicly viewable:chmod 400 <private key file name; for example, bastion-key.pem>
  3. On the Amazon EC2 console, choose Instances.
  4. Choose the Amazon Linux Bastion host that the CloudFormation stack created.
  5. Choose Connect.
  6. Copy the value for SSHCommand.
  7. On the SSH client, change the directory to the location where you saved the EC2 private key, and paste the SSHCommand value.
  8. On the console, open the Secrets Manager dashboard.
  9. Choose the secret secretAuroraMasterUser-*.
  10. Choose Retrieve secret value.
  11. Record the password under Secret key/value, which you use to log in to the Aurora PostgreSQL cluster.
  12. Choose the secret SecretRedshiftMasterUser.
  13. Choose Retrieve secret value.
  14. Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
  15. Log in to both the Aurora PostgreSQL and Amazon Redshift database using PSQL Client.The CloudFormation template has already set up PSQL Client binaries on the Amazon Linux Bastion host.
  16. Enter the following code in the command prompt of the Bastion host (substitute <RedshiftClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <RedshiftClusterEndpoint> -d dev -p 5439 -U fqdemo
  17. When prompted, enter the database user password you recorded earlier.
  18. Enter the following SQL command:
    select "table" from svv_table_info where schema='public';

    You should see the following eight tables as the output:

    dev=# select "table" from svv_table_info where schema='public';
     table   
    ----------
     orders
     customer
     region
     nation
     supplier
     part
     lineitem
     partsupp
    (8 rows)
  19. Launch another command prompt session of the Bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):psql -h <AuroraClusterEndpoint> -d dev -p 5432 -U awsuser
  20. When prompted, enter the database user password you recorded earlier.
  21. Enter the following SQL command:
    select tablename from pg_tables where schemaname='public';

    You should see the following eight tables as the output:

    dev=# select tablename from pg_tables where schemaname='public';
     tablename 
    -----------
     region
     nation
     lineitem
     orders
     part
     supplier
     partsupp
     customer
    (8 rows)

Completing Federated Query setup

The final step is to create an external schema to connect to the Aurora PostgreSQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS pg 
FROM POSTGRES 
DATABASE 'dev' 
SCHEMA 'public' 
URI '<AuroraClusterEndpoint>' 
PORT 5432 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

  • URI – AuroraClusterEndpoint value from the CloudFormation stack outputs. Value is in the format <stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com
  • IAMRoleRedshiftClusterRoleArn value from the CloudFormation stack outputs. Value is in the format arn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>
  • SecretARNAuroraSecretArn value from the CloudFormation stack outputs. Value is in the format arn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>

Testing Federated Query

Now that you have set up Federated Query, you can start testing the feature using the TPC-H dataset that was preloaded into both Aurora PostgreSQL and Amazon Redshift.

The following query shows the parts and supplier relationship. Tables PARTSUPP and PART are stored in Amazon Redshift, and the SUPPLIER table in the subquery is from Aurora PostgreSQL:

SELECT TOP 10 P_BRAND,
       P_TYPE,
       P_SIZE,
       COUNT(DISTINCT PS_SUPPKEY) AS SUPPLIER_CNT
FROM PARTSUPP,
     PART
WHERE P_PARTKEY = PS_PARTKEY
AND   P_BRAND <> 'Brand#23'
AND   P_TYPE NOT LIKE 'MEDIUM ANODIZED%'
AND   P_SIZE IN (1,32,33,46,7,42,21,40)
AND   PS_SUPPKEY NOT IN (SELECT S_SUPPKEY
                         FROM pg.SUPPLIER
                         WHERE S_COMMENT LIKE '%Customer%Complaints%')
GROUP	BY P_BRAND,
         P_TYPE,
         P_SIZE
ORDER	BY SUPPLIER_CNT DESC,
         P_BRAND,
         P_TYPE,
         P_SIZE;

The following query shows the order priority by combining ORDERS table data from Amazon Redshift and Aurora PostgreSQL. This demonstrates the use case of live data query from an OLTP source federated with historical data on a data warehouse:

SELECT O_ORDERPRIORITY,
       COUNT(*) AS ORDER_COUNT
FROM (SELECT O_ORDERPRIORITY
      FROM ORDERS o
      WHERE O_ORDERDATE < '1997-07-01'       AND O_ORDERDATE >= CAST(DATE '1997-07-01' - INTERVAL '3 months' AS DATE)
      UNION ALL
      SELECT O_ORDERPRIORITY
      FROM pg.ORDERS o
      WHERE O_ORDERDATE >= '1997-07-01'
      AND   O_ORDERDATE < CAST(DATE '1997-07-01' +INTERVAL '1 day' AS DATE))
GROUP	BY O_ORDERPRIORITY
ORDER	BY O_ORDERPRIORITY;

You can continue to experiment with the dataset and explore the three main use cases from the post, Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query.

Deleting the CloudFormation stack

When you are finished, delete the CloudFormation stack; some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose the stack you launched in this walkthrough. The stack must be currently running.
  3. In the stack details pane, choose Delete.
  4. Choose Delete stack.

Summary

This post showed you how to automate the creation of an Aurora PostgreSQL and Amazon Redshift cluster preloaded with the TPC-H dataset, the prerequisites of the new Amazon Redshift Federated Query feature using AWS CloudFormation, and a single manual step to complete the setup. The post also provided some example federated queries using the TPC-H dataset, which you can use to accelerate your learning and adoption of the new features. You can continue to modify the CloudFormation templates from this post to support your business needs.

View a demo of Federated Query to learn more about the feature:

If you have any questions or suggestions, please leave a comment.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.