Federated query support for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL – Part 1

PostgreSQL is one of the most widely used database engines and is supported by a very large and active community. It’s a viable open-source option to use compared to many commercial databases, which require heavy license costs. Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition are AWS managed offerings that take away the heavy lifting required for setting up the platform, configuring high availability, monitoring, and much more. This allows DBAs to spend more time on business-critical problems like doing schema design early on or query tuning.

In 2003, a new specification called SQL/MED (“SQL Management of External Data”) was added to the SQL standard. It’s a standardized way of handling access to remote objects from SQL databases. In 2011, PostgreSQL 9.1 was released with read-only support of this standard, and in 2013, write support was added with PostgreSQL 9.3. The implementation of this concept in PostgreSQL was called foreign data wrappers (FDW). Several FDWs are available that help connect PostgreSQL Server to different remote data stores, ranging from other SQL database engines to flat files. However, most FDWs are independent open-source projects implemented as Postgres extensions, and not officially supported by the PostgreSQL Global Development Group.

In this post, we discuss one FDW implementation that comes with PostgreSQL source as a contrib extension module called postgres_fdw. postgres_fdw allows you to implement federated query capability to interact with any remote PostgreSQL-based database, both managed and self-managed on Amazon Elastic Compute Cloud (Amazon EC2) or on premises. This is available in all present versions supported for Amazon RDS for PostgreSQL and Aurora PostgreSQL. The following diagram illustrates this architecture.

Use cases

In this post, we primarily focus on two use cases to give an overview on the capability. However, you can easily extended the solution for other federated query use cases.

When working with independent software vendors (ISVs), we occasionally see them offering Amazon RDS for PostgreSQL and Aurora PostgreSQL in a multi-tenant setup in which they use one database per customer and a shared database within a single instance. Federated query capability implemented via FDW allows you to pull data from a shared database to other databases as needed.

An organization could have multiple systems catering to different departments. For example, a payroll database has employee salary information. This data maybe required by the HR and tax systems to calculate hike or decide tax incurred, respectively. One solution for such a problem is to copy the salary information in both the HR and Tax systems. However, this kind of duplication may lead to problems, like ensuring data accuracy, extra storage space incurred, or double writes. FDWs avoid duplication while providing access to required data that resides in a different foreign database. The following diagram illustrates this architecture.

Furthermore, in today’s world of purpose-built databases you host hot or active data in Amazon RDS for PostgreSQL and Aurora PostgreSQL, you have separate data warehouse solutions like Amazon Redshift for data archival. Without federated query support from an active database, you have to stream the active data to the data warehouse on an almost near-real-time basis to run analytical queries, which requires extra efforts and costs to set up a data pipeline and an additional overhead to the data warehouse. With federated query capability, you can derive insights while joining data at the transactional database from within itself as well as a data warehouse like Amazon Redshift. The following diagram illustrates this architecture.

For more information about querying data in your Aurora PostgreSQL or Amazon RDS for PostgreSQL remote server from Amazon Redshift as the primary database, see Querying data with federated queries in Amazon Redshift.

Prerequisites

Before getting started, make sure you have the following prerequisites:

A primary Aurora PostgreSQL or Amazon RDS for PostgreSQL instance as your source machine.
A remote PostgreSQL-based instance with information like username, password, and database name. You can use any of the following databases:
- Aurora PostgreSQL
- Amazon RDS for PostgreSQL
- Amazon Redshift
- Self-managed PostgreSQL on Amazon EC2
- On-premises PostgreSQL database
Network connectivity between the primary and remote database. The remote database can be a different database within the same primary database instance, a separate database instance, Amazon Redshift cluster within the same or different Amazon Virtual Private Cloud (Amazon VPC), or even on-premises PostgreSQL-based database servers that can be reached using VPC peering or AWS managed VPN or AWS Direct Connect. The remote PostgreSQL-based database instance or Amazon Redshift cluster can be in a same or different AWS account with established network connectivity in place. For more information, see How can I troubleshoot connectivity to an Amazon RDS DB instance that uses a public or private subnet of a VPC?
Tables that you query in the foreign (remote) server. For this post, we create one table in source database pdfdwsource and one table in target database pdfdwtarget.

For pdfdwsource, create a salary table with the columns emailid and salary with dummy content. See the following code:

[ec2-user@ip-172-31-15-24 ~]$ psql -h pgfdwsource.xxxx.us-west-2.rds.amazonaws.com -d pgfdwsource -U pgfdwsource -w
SET
Expanded display is on.
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help
pgfdwsource=> create table salary(emailid varchar, salary int);
CREATE TABLE
pgfdwsource=> insert into salary values('tom@customer1.com',10000);
INSERT 0 1
pgfdwsource=> insert into salary values('harry@customer1.com',100000);
INSERT 0 1
pgfdwsource=> insert into salary values('jeff@customer1.com',1000000);
INSERT 0 1
pgfdwsource=> \d salary
          Table "public.salary"
 Column  |       Type        | Modifiers 
---------+-------------------+-----------
 emailid | character varying | 
 salary  | integer           | 

pgfdwsource=> select * from salary;
       emailid       | salary  
---------------------+---------
 tom@customer1.com   |   10000
 harry@customer1.com |  100000
 jeff@customer1.com  | 1000000

For pdfdwtarget, create a table corresponding to customer1 with the columns id, name, emailid, projectname, and contactnumber with dummy content. See the following code:

[ec2-user@ip-172-31-15-24 ~]$ psql -h pgfdwtarget.xxxx.us-west-2.rds.amazonaws.com -d pgfdwtarget -U pgfdwtarget
SET
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

pgfdwtarget=> create table customer1( id int, name varchar, emailid varchar, projectname varchar, contactnumber bigint);
CREATE TABLE
pgfdwtarget=> insert into customer1 values(1,'Tom','tom@customer1.com','Customer1.migration',328909432);
INSERT 0 1
pgfdwtarget=> insert into customer1 values(2,'Harry','harry@customer1.com','Customer1.etl',2328909432);
INSERT 0 1
pgfdwtarget=> insert into customer1 values(3,'Jeff','jeff@customer1.com','Customer1.infra',328909432);
INSERT 0 1
pgfdwtarget=> \d customer1
           Table "public.customer1"
    Column     |       Type        | Modifiers 
---------------+-------------------+-----------
 id            | integer           | 
 name          | character varying | 
 emailid       | character varying | 
 projectname   | character varying | 
 contactnumber | bigint            | 

pgfdwtarget=> select * from customer1;
 id | name  |       emailid       |     projectname     | contactnumber 
----+-------+---------------------+---------------------+---------------
  1 | Tom   | tom@customer1.com   | Customer1.migration |     328909432
  2 | Harry | harry@customer1.com | Customer1.etl       |    2328909432
  3 | Jeff  | jeff@customer1.com  | Customer1.infra     |     328909432
(3 rows)

In both tables, the column emailid is common and can be used to derive insights into salary corresponding to customer1 employee information.

Now let’s see this in action.

Configuring your source instance, foreign server, user mapping, and foreign table

All the steps in this section are performed after logging in with the role pgfdwsource into the primary database instance pgfdwsource.

Connecting to the source instance

You connect to your source instance with a master user or via a normal user that has rds_superuser permissions. You can use client tools like psql or pgAdmin. For more information, see Connecting to a DB instance running the PostgreSQL database engine. Create the extension postgres_fdw with CREATE EXTENSION:

pgfdwsource=> \conninfo        
You are connected to database "pgfdwsource" as user "pgfdwsource" on host "pgfdwsource.xxxx.us-west-2.rds.amazonaws.com" at port "5432".
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
pgfdwsource=> \du pgfdwsource 
                         List of roles
  Role name  |          Attributes           |    Member of    
-------------+-------------------------------+-----------------
 pgfdwsource | Create role, Create DB       +| {rds_superuser}
             | Password valid until infinity | 
					
pgfdwsource=> create extension postgres_fdw;
CREATE EXTENSION

pgfdwsource=> \dx
                               List of installed extensions
     Name     | Version |   Schema   |                    Description                     
--------------+---------+------------+----------------------------------------------------
 plpgsql      | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgres_fdw | 1.0     | public     | foreign-data wrapper for remote PostgreSQL servers
(2 rows)

Creating a foreign server

We use CREATE SERVER to create our foreign (remote) server mapping as the PostgreSQL-based server from which we pull the data. A foreign server typically encapsulates connection information that an FDW uses to access an external data resource. It uses the same connection options as libpq. SSLMODE ‘require’ makes sure that the data is encrypted in transit. See the following code:

pgfdwsource=> create server my_fdw_target Foreign Data Wrapper postgres_fdw OPTIONS (DBNAME 'pgfdwtarget', HOST 'pgfdwtarget.xxxx.us-west-2.rds.amazonaws.com', SSLMODE 'require');
CREATE SERVER

If the master user (or user with rds_superuser) is creating the foreign server, then other users need usage access to this server. See the following code:

GRANT USAGE ON FOREIGN SERVER my_fdw_target TO normal_user;

If you want to grant access to normal users to create a foreign server, you need to grant usage on the extension itself from the master user:

GRANT USAGE ON FOREIGN DATA WRAPPER postgres_fdw TO normal_user;

Creating user mapping

CREATE USER MAPPING defines a mapping of a user to a foreign server. In the following code, when the user pgfdwsource is connecting to the server my_fdw_target (remote database), they use the login information provided in the user mapping. Therefore, postgres_fdw uses the user pgfdwtarget to connect to the remote database.

pgfdwsource=> CREATE USER MAPPING FOR pgfdwsource SERVER my_fdw_target OPTIONS (user 'pgfdwtarget', password 'test1234');
CREATE USER MAPPING

Creating a foreign table

CREATE FOREIGN TABLE creates a foreign table in the source database that is mapped to the table in the foreign server. It creates a table reference in the local database of a table residing in the remote database. The name of this table reference is defined as the foreign table. For example, in the following code, foreign table customer1_fdw points to the table customer1, which resides in the remote database. Now, whenever you need to query data from table customer1 in the remote database, the user in local database queries using the foreign table customer1_fdw.

pgfdwsource=> create foreign table customer1_fdw( id int, name varchar,
emailid varchar, projectname varchar, contactnumber bigint) server
my_fdw_target OPTIONS( TABLE_NAME 'customer1');
CREATE FOREIGN TABLE

Furthermore, the name of the foreign table or table reference can be different from the name of the table in the remote database. This flexibility can be useful when you want to mask the name of the table in remote database.

Running federated queries with a PostgreSQL FDW

FDWs (postgres_fdw) allow you to query (SELECT, INSERT, UPDATE, and DELETE) a remote PostgreSQL-based database (Amazon RDS for PostgreSQL, Aurora PostgreSQL, or Amazon Redshift). For this post, we focus on the SELECT query functionality.

With the preceding steps complete, you can now query the table customer1, which resides in the Amazon RDS for PostgreSQL or Aurora PostgreSQL instance pgfdwtarget.xxxx.us-west-2.rds.amazonaws.com in the database pgfdwtarget, from within the instance pgfdwsource.xxxx.us-west-2.rds.amazonaws.com without the need for replication or any kind of pipeline. You can also run join queries to perform aggregates (and more) to derive insights. See the following code:

pgfdwsource=> \conninfo
You are connected to database "pgfdwsource" as user "pgfdwsource" on host "pgfdwsource.xxxx.us-west-2.rds.amazonaws.com" at port "5432".
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

-- running select * on foreign table customer1 mapped to name customer1_fdw in source instance.

pgfdwsource=> select * from customer1_fdw;
 id | name  |       emailid       |     projectname     | contactnumber 
----+-------+---------------------+---------------------+---------------
  1 | Tom   | tom@customer1.com   | Customer1.migration |     328909432
  2 | Harry | harry@customer1.com | Customer1.etl       |    2328909432
  3 | Jeff  | jeff@customer1.com  | Customer1.infra     |     328909432
(3 rows)

-- running select * on foreign table customer1 mapped to name customer1_fdw in source instance while joining it with salary table to derive salary insights.

pgfdwsource=> select * from salary s inner join customer1_fdw c on s.emailid=c.emailid;
       emailid       | salary  | id | name  |       emailid       |     projectname     | contactnumber 
---------------------+---------+----+-------+---------------------+---------------------+---------------
 tom@customer1.com   |   10000 |  1 | Tom   | tom@customer1.com   | Customer1.migration |     328909432
 harry@customer1.com |  100000 |  2 | Harry | harry@customer1.com | Customer1.etl       |    2328909432
 jeff@customer1.com  | 1000000 |  3 | Jeff  | jeff@customer1.com  | Customer1.infra     |     328909432
(3 rows)

In the following code, only the data for a particular project is queried. The project name details are stored in the remote server. In the explain plan of the query, the WHERE clause has been pushed down to the remote server. This makes sure that only the data required to realize the result of the query is retrieved from the remote server and not the entire table. This is one of the optimizations that postgres_fdw offers.

pgfdwsource=> select * from salary s inner join customer1_fdw c on s.emailid=c.emailid where c.projectname like 'Customer1.etl';

       emailid       | salary | id | name  |       emailid       |  projectname  | contactnumber 
---------------------+--------+----+-------+---------------------+---------------+---------------
 harry@customer1.com | 100000 |  2 | Harry | harry@customer1.com | Customer1.etl |    2328909432
(1 row)

pgfdwsource=> explain verbose select * from salary s inner join customer1_fdw c on s.emailid=c.emailid where c.projectname like 'Customer1.etl';
                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=100.00..118.98 rows=1 width=144)
   Output: s.emailid, s.salary, c.id, c.name, c.emailid, c.projectname, c.contactnumber
   Join Filter: ((s.emailid)::text = (c.emailid)::text)
   ->  Seq Scan on public.salary s  (cost=0.00..1.03 rows=3 width=36)
         Output: s.emailid, s.salary
   ->  Materialize  (cost=100.00..117.83 rows=3 width=108)
         Output: c.id, c.name, c.emailid, c.projectname, c.contactnumber
         ->  Foreign Scan on public.customer1_fdw c  (cost=100.00..117.81 rows=3 width=108)
               Output: c.id, c.name, c.emailid, c.projectname, c.contactnumber
               Remote SQL: SELECT id, name, emailid, projectname, contactnumber FROM public.customer1 WHERE ((projectname ~~ 'Customer1.etl'::text))
(10 rows)

You can pull data from a different database within the same instance or from a different instance, as seen in the preceding examples. You can also pull data from your PostgreSQL-based data warehouse like Amazon Redshift for a quick query on archived data. Because the data is pulled in real time from a foreign database, you can achieve this without needing to set up any kind of replication or reserving space in the source database.

Cleaning up

You can follow the steps in this section to clean up the resources created in previous steps:

Drop the foreign table customer1_fdw:

pgfdwsource=> drop foreign table customer1_fdw;
DROP FOREIGN TABLE

Drop the user mapping that maps the database user to the foreign server user:

pgfdwsource=> drop user mapping for pgfdwsource SERVER my_fdw_target;
DROP USER MAPPING

Drop the foreign server definition that provides the local PostgreSQL server with the foreign (remote) server connection information:
```
pgfdwsource=> drop server my_fdw_target;
DROP SERVER
```

Drop the extension postgres_fdw:

pgfdwsource=> drop extension postgres_fdw;
DROP EXTENSION

Common error messages

As is the case with any other setup, you may encounter some common issues while setting up postgres_fdw, like connection problems or permission issues. The following section lists some common error messages that you may encounter and also suggests common culprits for those errors, which should point you in the right direction to mitigate them.

Network connectivity – The following error message can appear when trying to query the foreign table customer1_fdw from the source database. A connection timed out error message appears when there is no network connectivity between the source and target databases. Common culprits are incorrectly configured security groups, Network Access Control List (NACL), or route table.
```
pgfdwsource=> select * from customer1_fdw;
ERROR:  could not connect to server "my_fdw_target"
DETAIL:  could not connect to server: Connection timed out
	Is the server running on host "pgfdwtarget.xxxx.us-west-2.rds.amazonaws.com" (172.31.30.166) and accepting
	TCP/IP connections on port 5432?
```

DNS resolution – The following error message may appear when trying to query the foreign table customer1_fdw from the source database. This is a classic case of DNS resolution failure (issues in /etc/resolv.conf) for the endpoint. The culprit is the incorrect endpoint entered while creating the foreign server.
```
pgfdwsource=> select * from customer1_fdw;
ERROR:  could not connect to server "my_fdw_target"
DETAIL:  could not translate host name "pgfdwtarget.xx.us-west-2.rds.amazonaws.com" to address: Name or service not known
```

Table not present – In this scenario, the table notpresent was not in the target database instance. Therefore, the error message ERROR: relation "public.notpresent" does not exist The common culprit is a non-existent or dropped table.

pgfdwsource=> create foreign table customer2_fdw( id int, name varchar, emailid varchar, projectname varchar, contactnumber bigint) server my_fdw_target OPTIONS( TABLE_NAME 'notpresent');
CREATE FOREIGN TABLE
pgfdwsource=> select * from customer2_fdw; 
ERROR:  relation "public.notpresent" does not exist

CONTEXT:  remote SQL command: SELECT id, name, emailid, projectname, contactnumber FROM public.notpresent

User permission – In this scenario, a permission denied error occurs when setting up user mapping, saying that you don’t have permission over the schema test. Common culprits are lack of select permission on the schema or table for the user specified in user mappings.

pgfdwsource=> create foreign table customer1_fdw( id int, name varchar, emailid varchar, projectname varchar, contactnumber bigint) server my_fdw_target OPTIONS( SCHEMA_NAME 'test', TABLE_NAME 'newtable');
CREATE FOREIGN TABLE
pgfdwsource=> select * from customer1_fdw;

ERROR:  permission denied for schema test

Additional capabilities

In addition to SELECT, postgres_fdw allows you to run UPDATE, INSERT, and DELETE on a foreign table. The following code updates the foreign table customer1_fdw (note Foreign Update in the query plan):

pgfdwsource=> explain update customer1_fdw set name='Jeff' where name='Jerry';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Update on customer1_fdw  (cost=100.00..119.73 rows=4 width=114)
   ->  Foreign Update on customer1_fdw  (cost=100.00..119.73 rows=4 width=114)
(2 rows)

pgfdwsource=> select * from customer1_fdw;
 id | name  |       emailid       |     projectname     | contactnumber 
----+-------+---------------------+---------------------+---------------
  1 | Tom   | tom@customer1.com   | Customer1.migration |     328909432
  2 | Harry | harry@customer1.com | Customer1.etl       |    2328909432
  3 | Jerry | jeff@customer1.com  | Customer1.infra     |     328909432
(3 rows)

pgfdwsource=> update customer1_fdw set name='Jeff' where name='Jerry';
UPDATE 1
pgfdwsource=> select * from customer1_fdw;
 id | name  |       emailid       |     projectname     | contactnumber 
----+-------+---------------------+---------------------+---------------
  1 | Tom   | tom@customer1.com   | Customer1.migration |     328909432
  2 | Harry | harry@customer1.com | Customer1.etl       |    2328909432
  3 | Jeff  | jeff@customer1.com  | Customer1.infra     |     328909432
(3 rows)

postgres_fdw come with some small optimizations, such as FDW maintaining the table schema locally. As a result, rather than doing a SELECT * to pull all the data from a table in remote server, the FDW sends query WHERE clauses to the remote server to run, and doesn’t retrieve table columns that aren’t needed for the current query. It can also push down joins, aggregates, sorts, and more.

postgres_fdw is quite powerful; this post covers just the basic functionality. For more information about these optimizations, see Appendix F. Additional Supplied Modules.

Summary

This post illustrated what you can achieve with federated query capability and postgres_fdw in Amazon RDS for PostgreSQL or Aurora PostgreSQL to query data from a PostgreSQL-based remote server, both managed and self-managed on Amazon EC2 or on premises.

We encourage you to use the postgres_fdw extension of community PostgreSQL in your Amazon RDS for PostgreSQL and Aurora PostgreSQL environments to query data from multiple databases in remote PostgreSQL-based servers like Amazon Redshift, Amazon RDS for PostgreSQL, Aurora PostgreSQL, self-managed PostgreSQL servers on Amazon EC2 or on premises, or from the same server in real time without the need for setting up ongoing replication or a data pipeline.

In our next post, we’ll show you how to work with other FDW extensions that let you use your Amazon RDS for PostgreSQL database instance to query MySQL-compatible engines, Microsoft SQL Server and Sybase engines that support the tabular data stream (TDS) protocol, and Oracle databases

About the Authors

Vibhu Pareek is a Solutions Architect at AWS. He joined AWS in 2016 and specializes in providing guidance on cloud adoption through the implementation of well architected, repeatable patterns and solutions that drive customer innovation. He has keen interest in open source databases like PostgreSQL. In his free time, you’d find him spending hours on automobile reviews, enjoying a game of football or engaged in pretentious fancy cooking.

Gaurav Sahi is a Principal Solutions Architect based out of Bangalore, India and has rich experience across embedded, telecom, broadcast & digital video and cloud technologies. In addition to helping customers in their transformation journey to cloud, his current passion is to explore and learn AI/ML services. He likes to travel, enjoy local food and spend time with family and friends in his free time.

AWS Database Blog