AWS Big Data Blog
Using pgpool and Amazon ElastiCache for Query Caching with Amazon Redshift
NOTE: Amazon Redshift now supports Result Caching for sub-second response for repeat queries. For more information, please review this “What’s New” post.
————————–
Felipe Garcia and Hugo Rozestraten are Solutions Architects for Amazon Web Services
In this blog post, we’ll use a real customer scenario to show you how to create a caching layer in front of Amazon Redshift using pgpool and Amazon ElastiCache.
Almost every application, no matter how simple, uses some kind of database. With SQL queries pervasive, a lack of coordination between end users or applications can sometimes result in redundant executions. This redundancy wastes resources that could be allocated to other tasks.
For example, BI tools and applications consuming data from Amazon Redshift are likely to issue common queries. You can cache some of them to improve the end-user experience and reduce contention in the database. In fact, when you use good data modeling and classification policies, you can even save some money by reducing your cluster size.
What is caching?
In computing, a cache is a hardware or software component that stores data so future requests for that data can be served faster. The data stored in a cache might be the result of an earlier computation or the duplicate of data stored elsewhere. A cache hit occurs when the requested data is found in a cache; a cache miss occurs when it is not. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store. The more requests served from the cache, the faster the system performs.
Customer case: laboratory analysis
In a clinical analysis laboratory, a small team of 6 to 10 scientists (geneticists, doctors, and biologists) query around 2 million lines of genetic code looking for specific genetic modifications. The genes next to a modified gene are also of interest because they can confirm a disease or disorder.
The scientists simultaneously analyze one DNA sample and then hold a meeting to discuss their findings and reach a conclusion.
A Node.js web application contains the logic; it issues the queries against Amazon Redshift. Using the web application connected to Amazon Redshift, the team of scientists experienced latencies of around 10 seconds. When the architecture is modified to use pgpool, these scientists were able to return the same queries in less than 1 second (in other words, 10 times faster).
Introducing pgpool
Pgpool is software that sits between your database clients and your database server(s). It acts as a reverse proxy, receiving connections from clients and forwarding them to the database servers. Originally written for PostgreSQL, pgpool has other interesting features besides caching: connection pooling, replication, load balancing, and queueing exceeding connections. Although we didn’t explore these features, we suspect they can be used with Amazon Redshift due to the compatibility between PostgreSQL and Amazon Redshift.
Pgpool can run in an Amazon EC2 instance or in your on-premises environment. For example, you might have a single EC2 instance for dev and test and a fleet of EC2 instances with Elastic Load Balancing and Auto Scaling in production.
The clinical analysis laboratory in our use case used the Psql (command line) and Node.js application to issue queries against Amazon Redshift and it worked as expected. However, we strongly recommend that you test pgpool with your PostgreSQL client before making any changes to your architecture.
Taking a look at the pgpool caching feature
The pgpool caching feature is disabled by default. It can be configured in two ways:
- On-memory (shmem)
- This is the default method if you set up the cache and make no changes. It’s slightly faster than Memcached and is easier to configure and maintain. On the other hand, in high-availability scenarios, you tend to waste memory and some database processing, because you cache the query per server and process the query for caching at least once for each server. For example, in a pgpool cluster with four servers, if you expect to have a 20 GB cache, you must provision 4 x m3.xlarge instances and pay four times the cache. Each query must be processed by the database at least four times to be cached in each server.
- Memcached (memcached)
- In this method, the cache is maintained externally from the server. The advantage is that the caching storage layer (Memcached) is decoupled from the cache processing layer (pgpool). This means you won’t waste server memory and database processing time because the queries are processed only and cached externally in Memcached.
- You can run Memcached anywhere, but we suggest you use Amazon ElastiCache with Memcached. Amazon ElastiCache detects and replaces failed nodes automatically, thereby reducing the overhead associated with self-managed infrastructures. It provides a resilient system that mitigates the risk of overloaded databases, which slows website and application load times.
Caching queries with pgpool
The following flow chart shows how query caching works with pgpool:
The following diagram shows the minimum architecture required to install and configure pgpool for a dev/test environment:
The following diagram shows the recommended minimum architecture for a production environment:
Prerequisites
For the steps in this post, we will use the AWS Command Line Interface (AWS CLI). If you want to use your Mac, Linux, or Microsoft Windows machine to follow along, make sure you have installed the AWS CLI installed. To learn how, see Installing the AWS Command Line Interface.
Steps for installing and configuring pgpool
1. Setting up the variables:
IMAGEID=ami-c481fad3
KEYNAME=<set your key name here>
The IMAGEID variable is set to use an Amazon Linux AMI from the US East (N. Virginia) region.
Set the KEYNAME variable to the name of the EC2 key pair you will use. This key pair must have been created in the US East (N. Virginia) region.
If you will use a region other than US East (N. Virginia), update IMAGEID and KEYNAME accordingly.
2. Creating the EC2 instance:
aws ec2 create-security-group --group-name PgPoolSecurityGroup --description "Security group to allow access to pgpool" MYIP=$(curl eth0.me -s | awk '{print $1"/32"}') aws ec2 authorize-security-group-ingress --group-name PgPoolSecurityGroup --protocol tcp --port 5432 --cidr $MYIP aws ec2 authorize-security-group-ingress --group-name PgPoolSecurityGroup --protocol tcp --port 22 --cidr $MYIP INSTANCEID=$(aws ec2 run-instances \ --image-id $IMAGEID \ --security-groups PgPoolSecurityGroup \ --key-name $KEYNAME \ --instance-type m3.medium \ --query 'Instances[0].InstanceId' \ | sed "s/\"//g") aws ec2 wait instance-status-ok --instance-ids $INSTANCEID INSTANCEIP=$(aws ec2 describe-instances \ --filters "Name=instance-id,Values=$INSTANCEID" \ --query "Reservations[0].Instances[0].PublicIpAddress" \ | sed "s/\"//g")
3. Creating the Amazon ElastiCache cluster:
aws ec2 create-security-group --group-name MemcachedSecurityGroup --description "Security group to allow access to Memcached" aws ec2 authorize-security-group-ingress --group-name MemcachedSecurityGroup --protocol tcp --port 11211 --source-group PgPoolSecurityGroup MEMCACHEDSECURITYGROUPID=$(aws ec2 describe-security-groups \ --group-names MemcachedSecurityGroup \ --query 'SecurityGroups[0].GroupId' | \ sed "s/\"//g") aws elasticache create-cache-cluster \ --cache-cluster-id PgPoolCache \ --cache-node-type cache.m3.medium \ --num-cache-nodes 1 \ --engine memcached \ --engine-version 1.4.5 \ --security-group-ids $MEMCACHEDSECURITYGROUPID aws elasticache wait cache-cluster-available --cache-cluster-id PgPoolCache
4. Accessing the EC2 instance through SSH, and then updating and installing packages:
ssh -i <your pem file goes here> ec2-user@$INSTANCEIP sudo yum update -y sudo yum group install "Development Tools" -y sudo yum install postgresql-devel libmemcached libmemcached-devel -y
5. Downloading the pgpool sourcecode tarball:
curl -L -o pgpool-II-3.5.3.tar.gz http://www.pgpool.net/download.php?f=pgpool-II-3.5.3.tar.gz
6. Extracting and compiling the source:
tar xvzf pgpool-II-3.5.3.tar.gz cd pgpool-II-3.5.3 ./configure --with-memcached=/usr/include/libmemcached-1.0 make sudo make install
7. Making a copy of the sample conf that comes with pgpool to create our own pgpool.conf:
sudo cp /usr/local/etc/pgpool.conf.sample /usr/local/etc/pgpool.conf
8. Editing pgpool.conf:
Using your editor of choice, open /usr/local/etc/pgpool.conf, and then find and set the following parameters:
- Set listen_addresses to *.
- Set port to5432.
- Set backend_hostname0 to the endpoint address of your Amazon Redshift cluster.
- Set backend_port0 to 5439.
- Set memory_cache_enabled to on.
- Set memqcache_method to memcached.
- Set memqcache_memcached_host to your Elasticache endpoint address.
- Set memqcache_memcached_port to your Elasticache endpoint port.
- Set log_connections to on
- Set log_per_node_statement to on
- Set pool_passwd to ‘‘.
The modified parameters in your config file should look like this:
listen_addresses = '*' port = 5432 backend_hostname0 = '<your redshift endpoint goes here>' backend_port0 = 5439 memory_cache_enabled = on memqcache_method = 'memcached' memqcache_memcached_host = '<your memcached endpoint goes here>' memqcache_memcached_port = 11211 log_connections = on log_per_node_statement = on
9. Setting up permissions:
sudo mkdir /var/run/pgpool sudo chmod u+rw,g+rw,o+rw /var/run/pgpool sudo mkdir /var/log/pgpool sudo chmod u+rw,g+rw,o+rw /var/log/pgpool
10. Starting pgpool:
pgpool -n
pgpool is already listening on port 5432:
2016-06-21 16:04:15: pid 18689: LOG: Setting up socket for 0.0.0.0:5432 2016-06-21 16:04:15: pid 18689: LOG: Setting up socket for :::5432 2016-06-21 16:04:15: pid 18689: LOG: pgpool-II successfully started. version 3.5.3 (ekieboshi)
11. Testing the setup:
Now that pgpool is running, we will configure our Amazon Redshift client to point to the pgpool endpoint instead of the Amazon Redshift cluster endpoint. To get the endpoint address, you can use the console or the CLI to retrieve the public IP address of the EC2 instance or you can just print the value we stored in the $INSTANCEIP variable.
#psql –h <pgpool endpoint address> -p 5432 –U <redshift username>
The first time we run the query, we see the following information in the pgpool log:
2016-06-21 17:36:33: pid 18689: LOG: DB node id: 0 backend pid: 25936 statement: select
s_acctbal,
s_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 5
and p_type like '%TIN'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey,
and s_suppkey = ps_suppkey,
and s_nationkey = n_nationkey,
and n_regionkey = r_regionkey,
and r_name = 'AFRICA'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
The first line in the log shows that the query is running directly on the Amazon Redshift cluster, so this is a cache miss. Executing the query against the database, it took 6814.595 ms to return the results.
If we run this query again, with the same predicates, we see a different result in the logs:
2016-06-21 17:40:19: pid 18689: LOG: fetch from memory cache
2016-06-21 17:40:19: pid 18689: DETAIL: query result fetched from cache. statement:
select
s_acctbal,
s_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 5
and p_type like '%TIN'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'AFRICA'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey,
and s_suppkey = ps_suppkey,
and s_nationkey = n_nationkey,
and n_regionkey = r_regionkey,
and r_name = 'AFRICA'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
As the first two lines of the log show, now we are retrieving the results from the cache with the desired result, so this is a cache hit. The difference is huge: The query took only 247.719 ms. In other words, it’s running 30 times faster than in the previous scenario.
Understanding pgpool caching behavior
Pgpool uses your SELECT query as the key for the fetched results.
Caching behavior and invalidation can be configured in a couple ways:
- Auto invalidation
- By default, memqcache_auto_cache_invalidation is set to on. When you update a table in Amazon Redshift, the cache in pgpool is invalidated.
- Expiration
- memqcache_expire defines, in seconds, how long a result should stay in the cache. The default value is 0, which means infinite.
- Black list and white list
- white_memqcache_table_list
- Comma-separated list of tables that should be cached. Regular expressions are accepted.
- black_memqcache_table_list
- Comma-separated list of tables that should not be cached. Regular expressions are accepted.
- white_memqcache_table_list
- Bypassing Cache
- /* NO QUERY CACHE */
- If you specify the comment /* NO QUERY CACHE */ in your query, the query ignores pgpool cache and fetches the result from the database.
- /* NO QUERY CACHE */
If pgpool doesn’t reach the cache due to name resolution or routing issues, for example, it falls back to the database endpoint and doesn’t use any cache.
Conclusion
It is easy to implement a caching solution using pgpool with Amazon Redshift and Amazon Elasticache. This solution significantly improves the end-user experience and alleviate the load on your cluster by orders of magnitude.
This post shows just one example of how pgpool and this caching architecture can help you. To learn more about the pgpool caching feature, see the pgpool documentation.
Happy querying (and caching, of course). If you have questions or suggestions, please leave a comment below.
Related
Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL