AWS Database Blog

How to set up a single pgpool endpoint for reads and writes with Amazon Aurora PostgreSQL

Amazon Aurora provides endpoints for the primary DB instance (the cluster endpoint) and for Read Replicas (the reader endpoint). Aurora updates the cluster endpoint automatically so that it always points to the primary instance. The reader endpoint load balances read operations across all available Read Replicas.

Amazon Aurora Replicas typically have less than 100 ms of replication lag. So if your application can tolerate that latency, it can use both the cluster and reader endpoints to take advantage of a horizontally scaled database (Figure 1).

Figure 1: Current architecture with the application deciding which endpoint to use

Figure 1: Current architecture with the application deciding which endpoint to use

However, managing two database endpoints, one for reads and one for writes, adds complexity to the application. In this post, I show how to use pgpool to provide a single PostgreSQL-compatible Aurora endpoint that automatically directs write traffic to the cluster endpoint and read traffic to the reader endpoint (Figure 2).

Figure 2: Proposed solution based on pgpool middleware

Figure 2: Proposed solution based on pgpool middleware

Architecture

Pgpool is a BSD-licensed middleware that sits between a PostgreSQL database and the database clients. In this example, we deploy it using the architecture in Figure 3.

Figure 3: Using pgpool as middleware that provides a single endpoint for a PostgreSQL-compatible Amazon Aurora cluster

Figure 3: Using pgpool as middleware that provides a single endpoint for a PostgreSQL-compatible Amazon Aurora cluster

The Amazon Aurora cluster consists of one primary instance and two Aurora Read Replicas that are deployed across two Availability Zones and two private subnets. The cluster has a security group that allows ingress only from the pgpool instances.

Pgpool is deployed in an Auto Scaling group with a single active instance for failover purposes. It is also isolated in private subnets with a security group that permits access only from allowed Classless Inter-Domain Routing (CIDR) blocks.

The access subnets host a Network Load Balancer that provides a consistent endpoint for pgpool. So your database endpoint doesn’t change if the pgpool instance fails and the Auto Scaling group creates a new one.

Deploying with AWS CloudFormation

The AWS CloudFormation templates that are used in this example are located in this GitHub repository. The example uses several nested AWS CloudFormation templates to deploy the virtual private cloud (VPC) infrastructure, the security groups, the Aurora cluster, and the pgpool middleware. Using nested stacks lets you break up a large stack into several reusable components. If you aren’t familiar with AWS CloudFormation, review the AWS CloudFormation documentation.

For complete instructions on deploying the templates, see the README file on GitHub. The following sections contain some of the highlights.

Creating the Amazon Aurora cluster

The following AWS CloudFormation snippets show how to create a three-node Amazon Aurora cluster when the VPC infrastructure and security groups are ready. Amazon Aurora has one of the three nodes as the primary and the other two as Read Replicas.

DBAuroraCluster:
    Type: "AWS::RDS::DBCluster"
    Properties:
      DatabaseName: !Ref DatabaseName
      Engine: aurora-postgresql
      MasterUsername: !Ref DatabaseUser
      MasterUserPassword: !Ref DatabasePassword
      VpcSecurityGroupIds: 
        - !Ref DBFirewall
      Tags:
        - Key: Project
          Value: !Ref ProjectTag
  DBAuroraOne:
    Type : "AWS::RDS::DBInstance"
    Properties:
      DBClusterIdentifier: !Ref DBAuroraCluster
      Engine: aurora-postgresql
      DBInstanceClass: !Ref DbInstanceSize
      Tags:
        - Key: Project
          Value: !Ref ProjectTag
  DBAuroraTwo:
    Type : "AWS::RDS::DBInstance"    
    Properties:
      DBClusterIdentifier: !Ref DBAuroraCluster
      Engine: aurora-postgresql
      DBInstanceClass: !Ref DbInstanceSize
      Tags:
        - Key: Project
          Value: !Ref ProjectTag
  DBAuroraThree:
    Type : "AWS::RDS::DBInstance"    
    Properties:
      DBClusterIdentifier: !Ref DBAuroraCluster
      Engine: aurora-postgresql
      DBInstanceClass: !Ref DbInstanceSize     
      Tags:
        - Key: Project
          Value: !Ref ProjectTag

Deploying pgpool

The blog post Using pgpool and Amazon ElastiCache for query caching with Amazon Redshift is a good reference for deploying pgpool on AWS. In addition, the pgpool documentation contains a section that is specifically for Amazon Aurora.

In the AWS CloudFormation template that deploys pgpool, you’ll set up an ELB load balancer and an Auto Scaling group. The launch configuration for the Auto Scaling group deploys and configures pgpool using the AWS CloudFormation cfn-init tool.

First, you need to install a few packages and unpack the pgpool installation file:

yum groupinstall -y "Development Tools" && yum install -y postgresql-devel
wget www.pgpool.net/download.php?f=pgpool-II-3.7.2.tar.gz -O /tmp/pgpool-II-3.7.2.tar.gz
tar zxf /tmp/pgpool-II-3.7.2.tar.gz

Next, build pgpool, and create the log and PID directories:

cd /opt/pgpool-II-3.7.2
./configure && make && make install
mkdir -p /var/run/pgpool && mkdir -p /var/log/pgpool && chmod -R 777 /var/run/pgpool && chmod -R 777 /var/log/pgpool

Amazon Aurora uses MD5 authentication, so you need to enter your master database user in a local authentication file.

/usr/local/bin/pg_md5 -m -u ${DatabaseUser} ${DatabasePassword}

You also need to configure MD5 authentication in /usr/local/etc/pool_hba.conf.

host    all         all         0.0.0.0/0               md5

Finally, register pgpool as a service, start it, and use chkconfig to start it automatically.

Configuring pgpool

Most of the pgpool configuration is done in /usr/local/etc/pgpool.conf. The following table shows the important settings.

Setting Value Notes
listen_addresses ‘*’ Allow incoming connections on all interfaces.
backend_hostname0 The Amazon Aurora cluster endpoint
backend_port0 3306 Amazon Aurora in PostgreSQL mode uses port 3306.
backend_flag0 ALWAYS_MASTER Don’t let pgpool try to fail over.
backend_hostname1 The Amazon Aurora reader endpoint
backend_port1 3306 Amazon Aurora in PostgreSQL mode uses port 3306.
enable_pool_hba On Amazon Aurora requires this for authentication.
pool_passwd ‘pool_passwd’ Set location of authentication file.
Ssl On Amazon Aurora uses Secure Sockets Layer (SSL) connections.
replication_mode Off
load_balance_mode On
master_slave_mode On
master_slave_sub_mode Stream
sr_check_period 0
health_check_* Configure with master account credentials.
fail_over_on_backend_error Off

Testing the configuration

Now you can issue a few SQL statements and confirm that pgpool is directing traffic as expected. Start by creating a table and inserting a few rows:

CREATE TABLE distributors (
    did     integer PRIMARY KEY,
    name    varchar(40)
);

INSERT INTO distributors (did, name) VALUES (5, 'XYZ Widgets')
INSERT INTO distributors (did, name) VALUES (6, 'ABC Widgets')
INSERT INTO distributors (did, name) VALUES (7, 'QRS Widgets')

Checking in /var/log/pgpool.log, you can see that these statements are directed to the Amazon Aurora primary (node 0):

DB node id: 0 backend pid: 18703 statement: INSERT INTO distributors (did, name) VALUES (7, 'QRS Widgets')

Next, issue a query:

select * from distributors;

Checking in /var/log/pgpool.log, you can see that this statement is directed to the Amazon Aurora reader endpoint (node 1):

DB node id: 1 backend pid: 19793 statement: select * from distributors

As a final test, confirm that pgpool handles SELECT statements embedded in a transaction:

BEGIN;
    INSERT INTO distributors (did, name) VALUES (10, '10');
    INSERT INTO distributors (did, name) VALUES (11, '11');
    SELECT * from distributors where did > 9;
COMMIT;

These statements should all hit the primary node, as the embedded SELECT is querying rows that are written in the transaction. You can again see from the log that things are working as expected.

DB node id: 0 backend pid: 22614 statement: INSERT INTO distributors (did, name) VALUES (10, '10')
DB node id: 0 backend pid: 22614 statement: INSERT INTO distributors (did, name) VALUES (11, '11')
DB node id: 0 backend pid: 22614 statement: SELECT * from distributors where did > 9

Of course, you should do a full regression test with your own database clients before using this configuration in production. If the scalability of the pgpool middleware concerns you, consider using a clustered pgpool instance where the Auto Scaling group allows multiple instances and scales based on load.

Strive for simplicity

In this post, we saw how to use pgpool to provide a single endpoint for an Amazon Aurora cluster that automatically directs read traffic to the reader endpoint. You can use this technique to simplify how your application handles connections to Amazon Aurora.

If you need more advanced routing options, such as the ability to route database connections based on custom logic, check out a solution called pgbouncer-rr. With custom logic, you can handle more advanced routing such as sending a small percentage of queries to a newer schema for canary testing.

In the future, keep an eye on Amazon Aurora Serverless, which scales the database backend automatically without requiring the configuration of replicas. This option is still in preview and does not yet support PostgreSQL compatibility.


About the Author

Randy DeFauw is a principal solutions architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.