AWS Big Data Blog

JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink

by Tony Gibbs | on | Permalink | Comments |  Share

Tony Gibbs is a Solutions Architect with AWS

(Update: This blog post has been translated into Japanese)

When it comes to choosing a SQL-based database in AWS, there are many options. Sometimes it can be difficult to know which one to choose. For example, when would you use Amazon Aurora instead of Amazon RDS PostgreSQL or Amazon Redshift? To answer this question, you must first understand the nature of the data workload and then evaluate other factors such as the quantity of data and query access patterns.

The design and capabilities of the different AWS services mean that each service has different strengths and excels at different workloads. This leads to trying to pick the right tool for the job, which can result in tradeoffs. But sometimes you don’t want to compromise.

This post explains how to use two services together—Amazon Redshift and Amazon RDS PostgreSQL—to avoid tradeoffs when choosing between a columnar data store and a row-based data store.

Amazon Redshift

Amazon Redshift is a high-performance, petabyte-scale data warehouse service that excels at online analytical processing (OLAP) workloads. Databases such as RDS PostgreSQL or Amazon Aurora typically store terabytes of data, and they excel at online transaction processing (OLTP) workloads.

Amazon Redshift uses a columnar architecture, which means the data is organized by columns on disk instead of row-by-row as in the OLTP approach. Columnar architecture offers advantages when querying a subset of the columns in a table by greatly reducing I/O. And because the data is stored by column, it can be highly compressed which further reduces I/O and allows more data to be stored and quickly queried.

RDS PostgreSQL uses a row-based architecture, which offers advantages when the workload is selecting, inserting, updating or deleting a small number of rows at a time, which is typical for OLTP.

Amazon Redshift also uses a massively parallel processing (MPP), shared-nothing architecture. This means that Amazon Redshift is designed to use all of the computing resources across many machines (called nodes) even when executing a single query. This provides excellent performance for analytical queries across a large number of rows. In contrast, most OLTP databases only use a subset of resources on one machine to process each query. This difference in architecture means that most OLTP databases can handle more concurrent queries because each query is typically less resource-intensive than those in Amazon Redshift.

Linking the high-performance power of Amazon Redshift with the feature-richness of RDS PostgreSQL is an attractive proposition because the two systems complement each other so well. How is it possible to link these two systems? An RDS PostgreSQL database is not an MPP database, but it does have features that enable multiple instances to be linked to one another.

Interestingly, Amazon Redshift was originally forked from PostgreSQL, which is why PostgreSQL drivers and API libpq work with Amazon Redshift. The combination of this PostgreSQL feature and Amazon Redshift compatibility lets the two systems be connected. This connection enables PostgreSQL to issue queries, and Amazon Redshift to return the results for processing to PostgreSQL.

Combining Amazon Redshift and RDS PostgreSQL provides the following benefits:

  • Materialized views for cached copies of data that work well for high-concurrency dashboards
  • High-concurrency, partitioned aggregate tables with block range indexes (BRIN).
  • Procedural Language/PostgreSQL (PL/pgSQL) user-defined functions that can query Amazon Redshift by using dynamic SQL.
  • Post-Amazon Redshift transformation, such as returning result sets as JSON.

The diagram above shows how the connections flow between the end users and the databases. Optionally, you can connect directly to Amazon Redshift if needed. If that is the case, consider configuring pgbouncer-rr on an Amazon EC2 instance to simplify management of the two connections. The diagram below illustrates this solution:

For further reading, check out Bob Strahan’s blog post Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL post.

RDS PostgreSQL includes two extensions to execute queries remotely. The first extension is the PostgreSQL foreign-data wrapper, postgres_fdw. The postgres_fdw module enables the creation of external tables. External tables can be queried in the same way as a local native table, However, the query is not currently executed entirely on the remote side because postgres_fdw doesn’t push down aggregate functions and limit clauses. When you perform an aggregate query through an external table, all the data is pulled into PostgreSQL for an aggregation step. This is unacceptably slow for any meaningful number of rows.

The second extension is dblink, which includes a function also called dblink. The dblink function allows the entire query to be pushed to Amazon Redshift. This lets Amazon Redshift do what it does best—query large quantities of data efficiently and return the results to PostgreSQL for further processing.

Installation and setup

To set up this solution:

  1. Launch an Amazon Redshift cluster.
  2. Launch an RDS PostgreSQL (9.5+) instance in the same Availability Zone as the cluster in Step 1.
  3. Configure the VPC security group for the Amazon Redshift cluster to allow an incoming connection from the RDS PostgreSQL endpoint.
  4. Optional: load the Amazon Redshift sample data to run the queries included in this post.
  5. Connect to the RDS PostgreSQL instance, and then run the following SQL code, replacing the <placeholders> with the values from your own instances:
CREATE EXTENSION postgres_fdw;
CREATE EXTENSION dblink;
CREATE SERVER foreign_server
        FOREIGN DATA WRAPPER postgres_fdw
        OPTIONS (host '<amazon_redshift _ip>', port '<port>', dbname '<database_name>', sslmode 'require');
CREATE USER MAPPING FOR <rds_postgresql_username>
        SERVER foreign_server
        OPTIONS (user '<amazon_redshift_username>', password '<password>');

For more information, see dblink in the PostgreSQL documentation.

Basic querying

The dblink function requires you to pass in the SQL statement as a string and define the result set, as shown in the following query:

SELECT *
FROM dblink('foreign_server',$REDSHIFT$
    SELECT sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= '2008-01-01' 
    AND saletime < '2008-02-01' 
    GROUP BY sellerid 
    ORDER BY sales DESC
$REDSHIFT$) AS t1 (sellerid int, sales decimal);

In this example:

  • The dblink function accepts the server connection (‘foreign_server’) that was created in the previous step.
  • The SQL query is passed in as a string between double dollar quotes ($REDSHIFT$). Using dollar quotes makes reading and writing the query easier.
  • The double dollar quotes are labeled REDSHIFT to help highlight the SQL that will be sent to Amazon Redshift.
  • The query results are a recordset that you must name and for which you must specify the datatypes (AS t1(sellerid int, sales decimal). This enables further joining and processing.

The partial result set from this query is:

After the results from the query are returned, PostgreSQL can do further processing such as transforming the results to JSON, as in the following example:

SELECT array_to_json(array_agg(t1)) FROM (
    SELECT *
    FROM
        dblink('foreign_server',$REDSHIFT$
             SELECT sellerid, sum(pricepaid) sales
             FROM sales 
             WHERE saletime >= '2008-01-01' 
             AND saletime < '2008-02-01' 
             GROUP BY sellerid 
             ORDER BY sales DESC    
$REDSHIFT$) AS t1 (sellerid int, sales decimal)
) t1;

Querying with views

To make the SQL less cumbersome to use, these queries can also be expressed as views. Below is a view of the first basic query:

CREATE OR REPLACE VIEW v_sales AS
SELECT *
FROM dblink ('foreign_server',$REDSHIFT$ 
    SELECT sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= '2008-01-01' 
    AND saletime < '2008-02-01' 
    GROUP BY sellerid 
    ORDER BY sales DESC
$REDSHIFT$) AS t1 (sellerid int, sales decimal);

Querying the view returns the same results as before:

SELECT * from v_sales;

Querying with user-defined functions

Another approach is to query the data using user-defined functions (UDF). In contrast to views, UDFs enable you to specify parameters when you run the UDF.  In the previous example, the date range was hard-coded in the view. With a UDF, you can query on an arbitrary datetime range. The following PL/pgSQL code creates the UDF in PostgreSQL:

CREATE OR REPLACE FUNCTION get_sales(_startdate timestamp, _enddate timestamp)
RETURNS TABLE (sellerid int, sales decimal) AS
$BODY$
DECLARE
remote_sql TEXT;
BEGIN
remote_sql = FORMAT( '
    SELECT sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= %L AND saletime < %L 
    GROUP BY sellerid 
    ORDER BY sales DESC
', _startdate, _enddate);
RETURN QUERY 
SELECT *
FROM dblink('foreign_server', remote_sql) 
AS t1  (sellerid int, sales decimal);
END;
$BODY$
LANGUAGE plpgsql VOLATILE;

In PostgreSQL, the function can be executed in a query:

SELECT * FROM get_sales ('2008-01-01', '2008-02-01');

The query returns the expected result set.

Caching data with materialized views

In the case of frequently accessed data, it may be better to use a materialized view. Materialized views cache the results so that the query can skip re-computing results. This makes them ideal for caching a small amount of frequently requested data, as in dashboards.

The following materialized view counts all the number of users who like sports and groups them by state. The DDL to create this materialized view is as follows:

CREATE MATERIALIZED VIEW v_users_likes_by_state AS
SELECT *
FROM dblink('foreign_server',$REDSHIFT$
        SELECT state, sum(likesports::int) sports_like_count
        FROM users 
        GROUP BY state
$REDSHIFT$) AS t1 (state text, sports_like_count int);

When the materialized view is created, the query is issued against Amazon Redshift. When a query is issued against the materialized view, there is no query issued against Amazon Redshift and instead the results are returned directly from PostgreSQL. Querying a materialized view is the same as querying a regular view, as shown in the following example:

SELECT * FROM v_users_likes_by_state;

Materialized views hold a cache of data that can become stale. The following SQL statement refreshes the data and re-issues the original query that created the materialized view:

REFRESH MATERIALIZED VIEW v_users_likes_by_state;

For more information, see Materialized Views in the PostgreSQL documentation.

To refresh the materialized view at regular intervals, you can use AWS Lambda. The Node.js code to refresh the materialized view is as follows:

var pg = require("pg");

exports.handler = function(event, context) {   
    var conn = "pg://username:password@host:port/dbname";
    var client = new pg.Client(conn);
    client.connect(function(err) {
        if (err) {
            context.fail("Failed" + err);
        }
        client.query('REFRESH MATERIALIZED VIEW v_users_likes_by_state', function (err, result) {
            if (err) {
                context.fail("Failed to run query" + err);
            }
            client.end();
            context.succeed("Successfully Refreshed.");
        });
    });
};

Lambda requires the pg module, which can be installed using the following command:

npm install pg

For more information about creating the Lambda deployment file, see Creating a Deployment Package (Node.js).

Copying data from Amazon Redshift to RDS PostgreSQL

When there is a larger quantity of data, it might be better to copy the data using the dblink function into PostgreSQL tables instead of using materialized views. This is useful where there are large quantities of new data and only the latest data needs to be copied. The disadvantage of the materialized view is that refreshes copy all of the data from the beginning.

The SQL to create the table:

CREATE TABLE sales_summary (
   saletime timestamp,
   sellerid int,
   sales decimal
);

PostgreSQL uses indexes to optimize reads and the new BRIN is an appropriate index for ordered timestamps. The SQL to create the index is:

CREATE INDEX idx_sales_summary_brin
   ON sales_summary
   USING BRIN (saletime);

The following query shows how to insert data into the tables using the dblink function:

INSERT INTO sales_summary

INSERT INTO sales_summary
SELECT *
FROM dblink('foreign_server',$REDSHIFT$
    SELECT date_trunc('hours', saletime) AS ts, sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= '2008-01-01' 
    AND saletime < '2008-02-01' 
    GROUP BY ts, sellerid 
    ORDER BY sales 
$REDSHIFT$) AS t1 (saletime timestamp, sellerid int, sales decimal);

Conclusion

You can use the dblink extension to connect to Amazon Redshift and leverage PostgreSQL functionality. This allows you to cache frequently queried small data sets, with two choices: materialized views with refreshes, or copying data into tables. For querying non-cached data, there are also two choices: regular views or UDFs that take parameters. There are likely many other uses for the dblink extension with Amazon Redshift, such as PostGIS or LDAP support in PostgreSQL (Amazon EC2 only), but they are beyond the scope of this post.

If you have questions or suggestions, please comment below.

————————————-

Related

Real-time in-memory OLTP and Analytics with Apache Ignite on AWS

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.