AWS Big Data Blog

Cross-account streaming ingestion for Amazon Redshift

As AWS’s fast, petabyte-scale cloud data warehouse delivering the best price-performance, Amazon Redshift makes it simple and cost-effective to analyze all your data using standard SQL, your existing ETL (extract, transform, and load), business intelligence (BI), and reporting tools quickly and securely. Tens of thousands of customers use Amazon Redshift to analyze exabytes of data per day and power analytics workloads such as BI, predictive analytics, and real-time streaming analytics without having to manage the data warehouse infrastructure. You can also gain up to three times better price performance with Amazon Redshift than other cloud data warehouses.

We are continuously innovating and releasing new features of Amazon Redshift for our customers, enabling the implementation of a wide range of data use cases and meeting requirements with performance and scale. One of the features recently announced is Amazon Redshift Streaming Ingestion for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), which lets you experience performance at scale by ingesting real-time streaming data. Amazon Redshift with Kinesis Data Streams is fully managed and runs your streaming applications without requiring infrastructure management. You can use SQL to connect to and directly ingest data from multiple Kinesis data streams simultaneously with low latency and high bandwidth, allowing you to derive insights in seconds instead of minutes.

Previously, loading data from a streaming service like Kinesis Data Streams into Amazon Redshift included several steps. These included connecting the stream to an Amazon Kinesis Data Firehose and waiting for Kinesis Data Firehose to stage the data in Amazon Simple Storage Service (Amazon S3), using various-sized batches at varying-length buffer intervals. After this, Kinesis Data Firehose triggered a COPY command to load the data from Amazon S3 to a table in Amazon Redshift.

Rather than including preliminary staging in Amazon S3, streaming ingestion provides low-latency, high-speed ingestion of stream data from Kinesis Data Streams into an Amazon Redshift materialized view.

In this post, we walk through cross-account Amazon Redshift streaming ingestion by creating a Kinesis data stream in one account, and generating and loading streaming data into Amazon Redshift in a second account within the same Region using role chaining.

Solution overview

The following diagram illustrates our solution architecture.

We demonstrate the following steps to perform cross-account streaming ingestion for Amazon Redshift:

  1. Create a Kinesis data stream in Account-1.
  2. Create an AWS Identity and Access Management (IAM) role in Account-1 to read the data stream using AWS best practices around applying least privileges permissions.
  3. Create an Amazon Redshift – Customizable IAM service role in Account-2 to assume the IAM role.
  4. Create an Amazon Redshift cluster in Account-2 and attach the IAM role.
  5. Modify the trust relationship of the Kinesis Data Streams IAM role in order to access the Amazon Redshift IAM role on its behalf.
  6. Create an external schema using IAM role chaining.
  7. Create a materialized view for high-speed ingestion of stream data.
  8. Refresh the materialized view and start querying.

Account-1 setup

Complete the following steps in Account-1:

  1. Create a Kinesis data stream called my-data-stream. For instructions, refer to Step 1 in Set up streaming ETL pipelines.
  2. Send records to this data stream from an open-source API that continuously generates random user data. For instructions, refer to Steps 2 and 3 in Set up streaming ETL pipelines.
  3. To verify if the data is entering the stream, navigate to the Amazon Kinesis -> Data streams -> my-data-stream -> Monitoring tab.
  4. Find the PutRecord success – average (Percent) and PutRecord – sum (Bytes) metrics to validate record ingestion.

    Next, we create an IAM policy called KinesisStreamPolicy in Account-1.
  5. On the IAM console, choose Policies in the navigation pane.
  6. Choose Create policy.
  7. Create a policy called KinesisStreamPolicy and add the following JSON to your policy (provide the AWS account ID for Account-1):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ReadStream",
                "Effect": "Allow",
                "Action": [
                    "kinesis:DescribeStreamSummary",
                    "kinesis:GetShardIterator",
                    "kinesis:GetRecords",
                    "kinesis:DescribeStream"
                ],
                "Resource": "arn:aws:kinesis:*:<Account-1>:stream/*"
            },
            {
                "Sid": "ListStream",
                "Effect": "Allow",
                "Action": [
                    "kinesis:ListStreams",
                    "kinesis:ListShards"
                ],
                "Resource": "*"
            }
        ]
    }
  8. In the navigation pane, choose Roles.
  9. Choose Create role.
  10. Select AWS service and choose Kinesis.
  11. Create a new role called KinesisStreamRole.
  12. Attach the policy KinesisStreamPolicy.

Account-2 setup

Complete the following steps in Account-2:

  1. Sign in to the Amazon Redshift console in Account-2.
  2. Create an Amazon Redshift cluster.
  3. On the IAM console, choose Policies in the navigation pane.
  4. Choose Create policy.
  5. Create a policy RedshiftStreamPolicy and add the following JSON (provide the AWS account ID for Account-1):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "StmtStreamRole",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole"
                ],
                "Resource": "arn:aws:iam::<Account-1>:role/KinesisStreamRole"
            }
        ]
    }
  6. In the navigation pane, choose Roles.
  7. Choose Create role.
  8. Select AWS service and choose Redshift and Redshift customizable.
  9. Create a role called RedshiftStreamRole.
  10. Attach the policy RedshiftStreamPolicy to the role.

Set up trust relationship

To set up the trust relationship, complete the following steps:

  1. Sign in to the IAM console as Account-1.
  2. In the navigation pane, choose Roles.
  3. Edit the IAM role KinesisStreamRole and modify the trust relationship (provide the AWS account ID for Account-2):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<Account-2>:role/RedshiftStreamRole"
                },
                "Action": "sts:AssumeRole"
            }        
        ]
    }

Set up streaming ingestion

To set up streaming ingestion, complete the following steps:

  1. Sign in to the Amazon Redshift console as Account-2.
  2. Launch the Query Editor v2 or your preferred SQL client and run the following statements to access the data stream my-data-stream in Account-1.
  3. Create an external schema using role chaining (replace the IAM role ARNs, separated by a comma without any spaces around it):
    CREATE EXTERNAL SCHEMA schema_stream
    FROM KINESIS
    IAM_ROLE 'arn:aws:iam::<Account-2>:role/RedshiftStreamRole,arn:aws:iam::<Account-1>:role/KinesisStreamRole';
  4. Create a materialized view to consume the stream data and store stream records in semi-structured SUPER format:
    CREATE MATERIALIZED VIEW my_stream_vw AS
        SELECT approximatearrivaltimestamp,
        partitionkey,
        shardid,
        sequencenumber,
        json_parse(data) as payload    
        FROM schema_stream."my-data-stream";
  5. Refresh the view, which triggers Amazon Redshift to read from the stream and load data into the materialized view:
    REFRESH MATERIALIZED VIEW my_stream_vw;
  6. Query data in the materialized view using the dot notation:
    SELECT payload.name.first, payload.name.last, payload.name.title,
    payload.dob.date as dob, payload.cell, payload.location.city, payload.email
    FROM my_stream_vw;

You can now view the results, as shown in the following screenshot.

Conclusion

In this post, we discussed how to set up two different AWS accounts to enable cross-account Amazon Redshift streaming ingestion. It’s simple to get started and you can perform rich analytics on streaming data, right within Amazon Redshift using existing familiar SQL.

For information about how to set up Amazon Redshift streaming ingestion using Kinesis Data Streams in a single account, refer to Real-time analytics with Amazon Redshift streaming ingestion. To establish Redshift streaming ingestion from an Amazon MSK cluster which belongs to a different account, complete the configuration mentioned in Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK.


About the authors

Poulomi Dasgupta is a Senior Analytics Solutions Architect with AWS. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems. Outside of work, she likes travelling and spending time with her family.

Raks KhareRaks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.