AWS Big Data Blog
Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation
Amazon Redshift Federated Query allows you to combine the data from one or more Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL databases with data already in Amazon Redshift. You can also combine such data with data in an Amazon S3 data lake.
This post shows you how to set up Aurora PostgreSQL and Amazon Redshift with a 10 GB TPC-H dataset, and Amazon Redshift Federated Query using AWS CloudFormation. For more information about using Federated Query, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. You can use the environment you set up in this post to experiment with various use cases in the preceding post.
Benefits of using CloudFormation templates
The standard workflow of setting up Amazon Redshift Federated Query involves six steps. For more information, see Querying Data with Federated Query in Amazon Redshift. With a CloudFormation template, you can condense these manual procedures into a few steps listed in a text file. The declarative code in the file captures the intended state of the resources to create and allows you to automate the creation of AWS resources to support Amazon Redshift Federated Query. You can further enhance this template to become the single source of truth for your infrastructure.
A CloudFormation template acts as an accelerator. It helps you automate the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.
Architecture overview
The following diagram illustrates the solution architecture.
The CloudFormation templates provision the following components in the architecture:
- VPC
- Subnets
- Route tables
- Internet gateway
- Amazon Linux Bastion host
- Secrets
- Aurora PostgreSQL cluster with TPC-H dataset preloaded
- Amazon Redshift cluster with TPC-H dataset preloaded
- Amazon Redshift IAM role with required permissions
Prerequisites
Before you create your resources in AWS CloudFormation, you must complete the following prerequisites:
- Have an IAM user with sufficient permissions to interact with the AWS Management Console and related AWS services. Your IAM permissions must also include access to create IAM roles and policies via the CloudFormation template.
- Create an Amazon EC2 key pair in the
us-east-1
Region. Make sure that you save the private key; this is the only time you can do so. You use this key pair as an input parameter when you set up the CloudFormation stack.
Setting up the resources with AWS CloudFormation
This post provides a CloudFormation template as a general guide. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.
To create these resources, complete the following steps:
- Sign in to the console.
- Choose the
us-east-1
Region in which to create the stack. - Choose Launch Stack:
- Choose Next.This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template from within the console.
- For Stack name, enter a stack name.
- For Session, leave as the default.
- For ec2KeyPair, choose the key pair you created earlier.
- Choose Next.
- On the next screen, choose Next.
- Review the details on the final screen and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.Stack creation can take up to 45 minutes.
- After the stack creation is complete, in the Outputs section, record the value of the key for the following components, which you use in a later step:
AuroraClusterEndpoint
AuroraSecretArn
RedshiftClusterEndpoint
RedshiftClusterRoleArn
You are now ready to log in to both the Aurora PostgreSQL and Amazon Redshift cluster and run some basic commands to test them.
Logging in to the clusters using the Amazon Linux Bastion host
The following steps assume that you use a computer with an SSH client to connect to the Bastion host. For more information about connecting using various clients, see Connect to Your Linux Instance.
- Move the private key of the EC2 key pair (that you saved previously) to a location on your SSH client, where you are connecting to the Amazon Linux Bastion host.
- Change the permission of the private key using the following code, so that it’s not publicly viewable:
chmod 400 <private key file name; for example, bastion-key.pem>
- On the Amazon EC2 console, choose Instances.
- Choose the Amazon Linux Bastion host that the CloudFormation stack created.
- Choose Connect.
- Copy the value for
SSHCommand
. - On the SSH client, change the directory to the location where you saved the EC2 private key, and paste the
SSHCommand
value. - On the console, open the Secrets Manager dashboard.
- Choose the secret secretAuroraMasterUser-*.
- Choose Retrieve secret value.
- Record the password under Secret key/value, which you use to log in to the Aurora PostgreSQL cluster.
- Choose the secret SecretRedshiftMasterUser.
- Choose Retrieve secret value.
- Record the password under Secret key/value, which you use to log in to the Amazon Redshift cluster.
- Log in to both the Aurora PostgreSQL and Amazon Redshift database using PSQL Client.The CloudFormation template has already set up PSQL Client binaries on the Amazon Linux Bastion host.
- Enter the following code in the command prompt of the Bastion host (substitute <RedshiftClusterEndpoint> with the value from the AWS CloudFormation output):
psql -h <RedshiftClusterEndpoint> -d dev -p 5439 -U fqdemo
- When prompted, enter the database user password you recorded earlier.
- Enter the following SQL command:
You should see the following eight tables as the output:
- Launch another command prompt session of the Bastion host and enter the following code (substitute <AuroraClusterEndpoint> with the value from the AWS CloudFormation output):
psql -h <AuroraClusterEndpoint> -d dev -p 5432 -U awsuser
- When prompted, enter the database user password you recorded earlier.
- Enter the following SQL command:
You should see the following eight tables as the output:
Completing Federated Query setup
The final step is to create an external schema to connect to the Aurora PostgreSQL instance. The following example code creates an external schema statement that you need to run on your Amazon Redshift cluster to complete this step:
Use the following parameters:
- URI – AuroraClusterEndpoint value from the CloudFormation stack outputs. Value is in the format
<stackname>-cluster.<randomcharacter>.us-east-1.rds.amazonaws.com
- IAMRole –
RedshiftClusterRoleArn
value from the CloudFormation stack outputs. Value is in the formatarn:aws:iam::<accountnumber>:role/<stackname>-RedshiftClusterRole-<randomcharacter>
- SecretARN –
AuroraSecretArn
value from the CloudFormation stack outputs. Value is in the formatarn:aws:secretsmanager:us-east-1:<accountnumber>: secret:secretAuroraMasterUser-<randomcharacter>
Testing Federated Query
Now that you have set up Federated Query, you can start testing the feature using the TPC-H dataset that was preloaded into both Aurora PostgreSQL and Amazon Redshift.
The following query shows the parts and supplier relationship. Tables PARTSUPP
and PART
are stored in Amazon Redshift, and the SUPPLIER
table in the subquery is from Aurora PostgreSQL:
The following query shows the order priority by combining ORDERS
table data from Amazon Redshift and Aurora PostgreSQL. This demonstrates the use case of live data query from an OLTP source federated with historical data on a data warehouse:
You can continue to experiment with the dataset and explore the three main use cases from the post, Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query.
Deleting the CloudFormation stack
When you are finished, delete the CloudFormation stack; some of the AWS resources in this walkthrough incur a cost if you continue to use them. Complete the following steps:
- On the AWS CloudFormation console, choose Stacks.
- Choose the stack you launched in this walkthrough. The stack must be currently running.
- In the stack details pane, choose Delete.
- Choose Delete stack.
Summary
This post showed you how to automate the creation of an Aurora PostgreSQL and Amazon Redshift cluster preloaded with the TPC-H dataset, the prerequisites of the new Amazon Redshift Federated Query feature using AWS CloudFormation, and a single manual step to complete the setup. The post also provided some example federated queries using the TPC-H dataset, which you can use to accelerate your learning and adoption of the new features. You can continue to modify the CloudFormation templates from this post to support your business needs.
View a demo of Federated Query to learn more about the feature:
If you have any questions or suggestions, please leave a comment.
About the Authors
BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.
Srikanth Sopirala is a Sr. Specialist Solutions Architect focused on Analytics at AWS. He is passionate about helping customers build scalable data and analytics solutions in the cloud.